V1MPIJob
polyaxon.polyflow.run.kubeflow.mpi_job.V1MPIJob(kind='mpi_job', clean_pod_policy=None, scheduling_policy=None, ssh_auth_mount_path=None, implementation=None, slots_per_worker=None, worker=None, launcher=None)
Kubeflow MPI-Job provides an interface to train distributed experiments with Pytorch.
- Args:
- kind: str, should be equal
mpijob
- clean_pod_policy: str, one of [
All
,Running
,None
] - scheduling_policy: V1SchedulingPolicy, optional
- slots_per_worker: int, optional
- ssh_auth_mount_path: str, optional
- implementation: str, optional, one of [
OpenMPI
,Intel
] - launcher: V1KFReplica, optional
- worker: V1KFReplica, optional
- kind: str, should be equal
YAML usage
run:
kind: mpijob
cleanPodPolicy:
schedulingPolicy:
slotsPerWorker:
sshAuthMountPath:
implementation:
launcher:
worker:
Python usage
from polyaxon.polyflow import V1KFReplica, V1MPIJob
from polyaxon.k8s import k8s_schemas
mpi_job = V1MPIJob(
clean_pod_policy='All',
launcher=V1KFReplica(...),
worker=V1KFReplica(...),
)
Fields
kind
The kind signals to the CLI, client, and other tools that this component’s runtime is a mpijob.
If you are using the python client to create the runtime, this field is not required and is set by default.
run:
kind: mpijob
cleanPodPolicy
Controls the deletion of pods when a job terminates.
The policy can be one of the following values: [All
, Running
, None
]
run:
kind: mpijob
cleanPodPolicy: 'All'
...
schedulingPolicy
SchedulingPolicy encapsulates various scheduling policies of the distributed training
job, for example minAvailable
for gang-scheduling.
run:
kind: mpijob
schedulingPolicy:
...
...
slotsPerWorker
Specifies the number of slots per worker used in hostfile.
Defaults to 1
.
run:
kind: mpijob
slotsPerWorker: 2
...
sshAuthMountPath
The directory where SSH keys are mounted. Defaults to “/root/.ssh”.
run:
kind: mpijob
sshAuthMountPath: "/different/path/.ssh"
...
implementation
The MPI implementation. Options are “OpenMPI” (default) and “Intel”.
run:
kind: mpijob
implementation: "Intel"
...
launcher
The launcher replica in the distributed mpijob, automatica
run:
kind: mpijob
master:
replicas: 1
container:
...
...
worker
The workers do the actual work of training the model.
run:
kind: mpijob
worker:
replicas: 3
container:
...
...