V1MPIJob

polyaxon.polyflow.run.kubeflow.mpi_job.V1MPIJob(kind='mpi_job', clean_pod_policy=None, scheduling_policy=None, ssh_auth_mount_path=None, implementation=None, slots_per_worker=None, worker=None, launcher=None)

Kubeflow MPI-Job provides an interface to train distributed experiments with Pytorch.

  • Args:
    • kind: str, should be equal mpijob
    • clean_pod_policy: str, one of [All, Running, None]
    • scheduling_policy: V1SchedulingPolicy, optional
    • slots_per_worker: int, optional
    • ssh_auth_mount_path: str, optional
    • implementation: str, optional, one of [OpenMPI, Intel]
    • launcher: V1KFReplica, optional
    • worker: V1KFReplica, optional

YAML usage

run:
  kind: mpijob
  cleanPodPolicy:
  schedulingPolicy:
  slotsPerWorker:
  sshAuthMountPath:
  implementation:
  launcher:
  worker:

Python usage

from polyaxon.polyflow import V1KFReplica, V1MPIJob
from polyaxon.k8s import k8s_schemas
mpi_job = V1MPIJob(
    clean_pod_policy='All',
    launcher=V1KFReplica(...),
    worker=V1KFReplica(...),
)

Fields

kind

The kind signals to the CLI, client, and other tools that this component's runtime is a mpijob.

If you are using the python client to create the runtime, this field is not required and is set by default.

run:
  kind: mpijob

cleanPodPolicy

Controls the deletion of pods when a job terminates. The policy can be one of the following values: [All, Running, None]

run:
  kind: mpijob
  cleanPodPolicy: 'All'
 ...

schedulingPolicy

SchedulingPolicy encapsulates various scheduling policies of the distributed training job, for example minAvailable for gang-scheduling.

run:
  kind: mpijob
  schedulingPolicy:
    ...
 ...

slotsPerWorker

Specifies the number of slots per worker used in hostfile. Defaults to 1.

run:
  kind: mpijob
  slotsPerWorker: 2
 ...

sshAuthMountPath

The directory where SSH keys are mounted. Defaults to "/root/.ssh".

run:
  kind: mpijob
  sshAuthMountPath: "/different/path/.ssh"
 ...

implementation

The MPI implementation. Options are "OpenMPI" (default) and "Intel".

run:
  kind: mpijob
  implementation: "Intel"
 ...

launcher

The launcher replica in the distributed mpijob, automatica

run:
  kind: mpijob
  master:
    replicas: 1
    container:
      ...
 ...

worker

The workers do the actual work of training the model.

run:
  kind: mpijob
  worker:
    replicas: 3
    container:
      ...
 ...