Scaling with distributed training

Running a distributed job is similar to running a normal job, Polyaxon offers different distributed runtimes.

In this example, we will just show a Tensorflow distributed experiment based on TFJob. But the same principle applies to other supported operators.

version: 1.1
kind: component
run:
  kind: tfjob
  chief:
    connections: [my-training-dataset]
    container:
      image: image-with-default-entrypoint
  worker:
    replicas: 2
    environment:
      restartPolicy: OnFailure
    connections: [my-training-dataset]
    container:
      image: image-with-default-entrypoint
      resources:
        limits:
          nvidia.com/gpu: 1

This will start a TFJob with 1 replica of type chief and 2 workers.

Learn More

You can check the distributed jobs section for more details.