V1MPIJob

polyaxon._flow.run.kubeflow.mpi_job.V1MPIJob()

Kubeflow MPI-Job provides an interface to train distributed experiments with MPI.

  • Args:
    • kind: str, should be equal mpijob
    • clean_pod_policy: str, one of [All, Running, None]
    • scheduling_policy: V1SchedulingPolicy, optional
    • slots_per_worker: int, optional
    • launcher: V1KFReplica, optional
    • worker: V1KFReplica, optional

YAML usage

run:
  kind: mpijob
  cleanPodPolicy:
  schedulingPolicy:
  slotsPerWorker:
  launcher:
  worker:

Python usage

from polyaxon.schemas import V1KFReplica, V1MPIJob
mpi_job = V1MPIJob(
    clean_pod_policy='All',
    launcher=V1KFReplica(...),
    worker=V1KFReplica(...),
)

Fields

kind

The kind signals to the CLI, client, and other tools that this component’s runtime is a mpijob.

If you are using the python client to create the runtime, this field is not required and is set by default.

run:
  kind: mpijob

cleanPodPolicy

Controls the deletion of pods when a job terminates. The policy can be one of the following values: [All, Running, None]

run:
  kind: mpijob
  cleanPodPolicy: 'All'
 ...

schedulingPolicy

SchedulingPolicy encapsulates various scheduling policies of the distributed training job, for example minAvailable for gang-scheduling.

run:
  kind: mpijob
  schedulingPolicy:
    ...
 ...

slotsPerWorker

Specifies the number of slots per worker used in hostfile. Defaults to 1.

run:
  kind: mpijob
  slotsPerWorker: 2
 ...

launcher

The launcher replica in the distributed mpijob, automatica

run:
  kind: mpijob
  master:
    replicas: 1
    container:
      ...
 ...

worker

The workers do the actual work of training the model.

run:
  kind: mpijob
  worker:
    replicas: 3
    container:
      ...
 ...