V1TFJob
polyaxon._flow.run.kubeflow.tf_job.V1TFJob()Kubeflow TF-Job provides an interface to train distributed experiments with TensorFlow.
- Args:
- kind: str, should be equal
tfjob - clean_pod_policy: str, one of [
All,Running,None] - scheduling_policy: V1SchedulingPolicy, optional
- enable_dynamic_worker: boolean
- chief: V1KFReplica, optional
- ps: V1KFReplica, optional
- worker: V1KFReplica, optional
- evaluator: V1KFReplica, optional
- kind: str, should be equal
YAML usage
run:
kind: tfjob
cleanPodPolicy:
schedulingPolicy:
enableDynamicWorker:
chief:
ps:
worker:
evaluator:Python usage
from polyaxon.schemas import V1KFReplica, V1TFJob
tf_job = V1TFJob(
clean_pod_policy='All',
chief=V1KFReplica(...),
ps=V1KFReplica(...),
worker=V1KFReplica(...),
evaluator=V1KFReplica(...),
)Fields
kind
The kind signals to the CLI, client, and other tools that this component’s runtime is a tfjob.
If you are using the python client to create the runtime, this field is not required and is set by default.
run:
kind: tfjobcleanPodPolicy
Controls the deletion of pods when a job terminates.
The policy can be one of the following values: [All, Running, None]
run:
kind: tfjob
cleanPodPolicy: 'All'
...schedulingPolicy
SchedulingPolicy encapsulates various scheduling policies of the distributed training
job, for example minAvailable for gang-scheduling.
run:
kind: tfjob
schedulingPolicy:
...
...enableDynamicWorker
Flag to enable dynamic worker.
run:
kind: tfjob
enableDynamicWorker: true
...
...chief
The chief is responsible for orchestrating training and performing tasks like checkpointing the model.
run:
kind: tfjob
chief:
replicas: 1
container:
...
...ps
The ps are parameter servers; these servers provide a distributed data store for the model parameters.
run:
kind: tfjob
ps:
replicas: 2
container:
...
...worker
The workers do the actual work of training the model. In some cases, worker 0 might also act as the chief.
run:
kind: tfjob
worker:
replicas: 2
container:
...
...evaluator
The evaluators can be used to compute evaluation metrics as the model is trained.
run:
kind: tfjob
evaluator:
replicas: 1
container:
...
...