polyaxon.polyflow.run.kubeflow.tf_job.V1TFJob(kind='tfjob', clean_pod_policy=None, scheduling_policy=None, chief=None, worker=None, ps=None, evaluator=None)
Kubeflow TF-Job provides an interface to train distributed experiments with TensorFlow.
run: kind: tfjob cleanPodPolicy: schedulingPolicy: chief: ps: worker: evaluator:
from polyaxon.polyflow import V1KFReplica, V1TFJob from polyaxon.k8s import k8s_schemas tf_job = V1TFJob( clean_pod_policy='All', chief=V1KFReplica(...), ps=V1KFReplica(...), worker=V1KFReplica(...), evaluator=V1KFReplica(...), )
The kind signals to the CLI, client, and other tools that this component’s runtime is a tfjob.
If you are using the python client to create the runtime, this field is not required and is set by default.
run: kind: tfjob
Controls the deletion of pods when a job terminates.
The policy can be one of the following values: [
run: kind: tfjob cleanPodPolicy: 'All' ...
SchedulingPolicy encapsulates various scheduling policies of the distributed training
job, for example
minAvailable for gang-scheduling.
run: kind: tfjob schedulingPolicy: ... ...
The chief is responsible for orchestrating training and performing tasks like checkpointing the model.
run: kind: tfjob chief: replicas: 1 container: ... ...
The ps are parameter servers; these servers provide a distributed data store for the model parameters.
run: kind: tfjob ps: replicas: 2 container: ... ...
The workers do the actual work of training the model. In some cases, worker 0 might also act as the chief.
run: kind: tfjob worker: replicas: 2 container: ... ...
The evaluators can be used to compute evaluation metrics as the model is trained.
run: kind: tfjob evaluator: replicas: 1 container: ... ...