Distributed Deep Learning on Kubernetes

Polyaxon currently supports and simplifies distributed training on the following frameworks: Tensorflow, MXNet, and Pytorch.

To parallelize your computations across processes and clusters of machines, you need to adapt your code and update your polyaxonfile to specify the cluster definition. Polyaxon then takes care of creating the tasks and exports necessary environment variables to enable the distributed training.

The current specification provides a section called environment to customize the computational resources and the distribution definition.

By default, when no parallelism is specified, Polyaxon creates a single task to run the experiment defined in the polyaxonfile. Sometimes users wish to specify the resources for the master task. e.g. using gpus

environment:
  resources:
    cpu:
      requests: 1
      limits: 2
    memory:
      requests: 256
      limits: 1024
    gpu:
      request: 1
      limits: 1

This same section is also used to specify the cluster definition for the distributed training. Since the parallelism is performed differently by the different frameworks, in this guide we will describe how to define a cluster for each one of the supported frameworks.

Distributed Tensorflow

To distribute Tensorflow experiments, the user needs to define a cluster, which is a set of tasks that participate in the distributed execution.

Tensorflow defines 3 different types of tasks: master, workers, and parameter servers.

To define a cluster in Polyaxon with a master, 2 parameter servers, and 4 workers, we need to add a tensorflow subsection to the environment section:

environment:
  ...

  tensorflow:
    n_workers: 4
    n_ps: 2

As it was mentioned before, Polyaxon always creates a master task.

You can have more control over the created tasks by defining the resources of each task the same way we defined the resources for the master.

Here’s the same example with custom resources:

environment:
  resources:
    cpu:
      requests: 1
      limits: 2
    memory:
      requests: 256
      limits: 1024
    gpu:
      request: 1
      limits: 1

  tensorflow:
    n_workers: 4
    n_ps: 2

    default_worker:
      resources:
        cpu:
          requests: 1
          limits: 2
        memory:
          requests: 256
          limits: 1024
        gpu:
          request: 1
          limits: 1

    worker:
      - index: 2
        resources:
          cpu:
            requests: 1
            limits: 2
          memory:
            requests: 256
            limits: 1024

    default_ps:
      resources:
        cpu:
          requests: 1
          limits: 1
        memory:
          requests: 256
          limits: 256

The master’s resources is defined in the resources section, i.e.

resources:
  cpu:
    requests: 1
    limits: 2
  memory:
    requests: 256
    limits: 1024
  gpu:
    request: 1
    limits: 1

The third worker (worker with index == 2) has a specific resources definition:

worker
  - index: 2
    resources:
      cpu:
        requests: 1
        limits: 2
      memory:
        requests: 256
        limits: 1024

And all the other workers have the same default worker resources definition, i.e.

default_worker:
  resources:
    cpu:
      requests: 1
      limits: 2
    memory:
      requests: 256
      limits: 1024
    gpu:
      request: 1
      limits: 1

Same logic applies to the parameter servers with the default_ps and ps.

Distributed MXNet

In a similar fashion, MXNet defines 3 types of tasks: scheduler, workers, and parameter servers.

As explained before, the master section will have in this case the scheduler role, the rest of the cluster needs to be defined in an mxnet subsection:

environment:
  ...

  mxnet:
    n_workers: 3
    n_ps: 1

And in a similar fashion you can customize the resources of each task the same way it was explained in the tensorflow section.

Distributed Pytorch

Distributed Pytorch is also similar but only defines a master task (worker with rank 0) and a set of worker tasks.

To define a Pytorch cluster in Polyaxon with a master and 3 workers, we need to add a pytorch subsection to the environment section:

environment:
  ...

  mxnet:
    n_workers: 3

You can customize the resources of each task the same way it was explained in the tensorflow section.

Examples

You can find examples of distributed experiments for these frameworks in our examples github repo.

Distributed Deep Learning on Kubernetes

Distributed Tensorflow

Distributed MXNet

Distributed Pytorch

Examples

Jupyter notebooks and tensorboards on Polyaxon

Hyperparameters tuning on Kubernetes

Saving, resuming, and restarting experiments with Polyaxon