Polyaxon allows to schedule distributed Horovod experiments, and supports tracking metrics, outputs, and models.

Experiments on a single node

To run experiments on single node with Horovod, you don’t need to deploy the MPIOperator, you just need to provide the correct command and args to enable usage of Horovod.

version: 1.1
kind: component
inputs:
  - name: gpus
    isOptional: true
    type: int
    value: 2
run:
  kind: job
  container:
      image: IMAGE_TO_USE
      resources:
          limits:
            nvidia.com/gpu: "{{ gpus }}"
      command: ["horovodrun", "-np", "{{ gpus }}", "-H", "localhost:{{ gpus }}", "python", "-u", "mnist.py"]

Distributed experiments with the MPIJob Operator

Polyaxon provides support for Horovod via the MPIJob Operator. So you will need to deploy the operator first and then provide a valid MPIJob manifest.

Define the distributed topology

Please check the guide Running Horovod for more details on how to set a Horovod experiment with MPI.

Example manifest:

version: 1.1
kind: component
run:
  kind: mpijob
  slotsPerWorker: 1
  launcher:
    replicas: 1
    container:
      image: docker.io/kubeflow/mpi-horovod-mnist
      command:
        - mpirun
      args:
        - -np
        - "2"
        - --allow-run-as-root
        - -bind-to
        - none
        - -map-by
        - slot
        - -x
        - LD_LIBRARY_PATH
        - -x
        - PATH
        - -mca
        - pml
        - ob1
        - -mca
        - btl
        - ^openib
        - python
        - /examples/tensorflow_mnist.py
      resources:
        limits:
          cpu: 1
          memory: 2Gi
  worker:
    replicas: 2
    container:
      image: docker.io/kubeflow/mpi-horovod-mnist
      resources:
        limits:
          cpu: 2
          memory: 4Gi