V1PytorchJob

polyaxon._flow.run.kubeflow.pytorch_job.V1PytorchJob()

Kubeflow Pytorch-Job provides an interface to train distributed experiments with Pytorch.

  • Args:

YAML usage

run:
  kind: pytorchjob
  cleanPodPolicy:
  schedulingPolicy:
  master:
  worker:

Python usage

from polyaxon.schemas import V1KFReplica, V1PytorchJob
pytorch_job = V1PytorchJob(
    clean_pod_policy='All',
    master=V1KFReplica(...),
    worker=V1KFReplica(...),
)

Fields

kind

The kind signals to the CLI, client, and other tools that this component’s runtime is a pytorchjob.

If you are using the python client to create the runtime, this field is not required and is set by default.

run:
  kind: pytorchjob

cleanPodPolicy

Controls the deletion of pods when a job terminates. The policy can be one of the following values: [All, Running, None]

run:
  kind: pytorchjob
  cleanPodPolicy: 'All'
 ...

schedulingPolicy

SchedulingPolicy encapsulates various scheduling policies of the distributed training job, for example minAvailable for gang-scheduling.

run:
  kind: pytorchjob
  schedulingPolicy:
    ...
 ...

elasticPolicy

ElasticPolicy encapsulates various policies for elastic distributed training job.

run:
  kind: pytorchjob
  elasticPolicy:
    ...
 ...

master

The master replica in the distributed PytorchJob

run:
  kind: pytorchjob
  master:
    replicas: 1
    container:
      ...
 ...

worker

The workers do the actual work of training the model.

run:
  kind: pytorchjob
  worker:
    replicas: 3
    container:
      ...
 ...