V1PytorchJob
polyaxon._flow.run.kubeflow.pytorch_job.V1PytorchJob()
Kubeflow Pytorch-Job provides an interface to train distributed experiments with Pytorch.
- Args:
- kind: str, should be equal
pytorchjob
- clean_pod_policy: str, one of [
All
,Running
,None
] - scheduling_policy: V1SchedulingPolicy, optional
- master: V1KFReplica, optional
- worker: V1KFReplica, optional
- kind: str, should be equal
YAML usage
run:
kind: pytorchjob
cleanPodPolicy:
schedulingPolicy:
master:
worker:
Python usage
from polyaxon.schemas import V1KFReplica, V1PytorchJob
pytorch_job = V1PytorchJob(
clean_pod_policy='All',
master=V1KFReplica(...),
worker=V1KFReplica(...),
)
Fields
kind
The kind signals to the CLI, client, and other tools that this component’s runtime is a pytorchjob.
If you are using the python client to create the runtime, this field is not required and is set by default.
run:
kind: pytorchjob
cleanPodPolicy
Controls the deletion of pods when a job terminates.
The policy can be one of the following values: [All
, Running
, None
]
run:
kind: pytorchjob
cleanPodPolicy: 'All'
...
schedulingPolicy
SchedulingPolicy encapsulates various scheduling policies of the distributed training
job, for example minAvailable
for gang-scheduling.
run:
kind: pytorchjob
schedulingPolicy:
...
...
elasticPolicy
ElasticPolicy encapsulates various policies for elastic distributed training job.
run:
kind: pytorchjob
elasticPolicy:
...
...
master
The master replica in the distributed PytorchJob
run:
kind: pytorchjob
master:
replicas: 1
container:
...
...
worker
The workers do the actual work of training the model.
run:
kind: pytorchjob
worker:
replicas: 3
container:
...
...