Scaling with distributed training
Running a distributed job is similar to running a normal job, Polyaxon offers different distributed runtimes.
In this example, we will just show a Tensorflow distributed experiment based on TFJob. But the same principle applies to other supported operators.
version: 1.1
kind: component
run:
kind: tfjob
chief:
connections: [my-training-dataset]
container:
image: image-with-default-entrypoint
worker:
replicas: 2
environment:
restartPolicy: OnFailure
connections: [my-training-dataset]
container:
image: image-with-default-entrypoint
resources:
limits:
nvidia.com/gpu: 1
This will start a TFJob with 1 replica of type chief and 2 workers.
Learn More
You can check the distributed jobs section for more details.