Polyaxon allows to schedule distributed Horovod experiments, and supports tracking metrics, outputs, and models.
Experiments on a single node
To run experiments on single node with Horovod, you don’t need to deploy the MPIOperator, you just need to provide the correct command and args to enable usage of Horovod.
version: 1.1
kind: component
inputs:
- name: gpus
isOptional: true
type: int
value: 2
run:
kind: job
container:
image: IMAGE_TO_USE
resources:
limits:
nvidia.com/gpu: "{{ gpus }}"
command: ["horovodrun", "-np", "{{ gpus }}", "-H", "localhost:{{ gpus }}", "python", "-u", "mnist.py"]
Distributed experiments with the MPIJob Operator
Polyaxon provides support for Horovod via the MPIJob Operator. So you will need to deploy the operator first and then provide a valid MPIJob manifest.
Define the distributed topology
Please check the guide Running Horovod for more details on how to set a Horovod experiment with MPI.
Example manifest:
version: 1.1
kind: component
run:
kind: mpijob
slotsPerWorker: 1
launcher:
replicas: 1
container:
image: docker.io/kubeflow/mpi-horovod-mnist
command:
- mpirun
args:
- -np
- "2"
- --allow-run-as-root
- -bind-to
- none
- -map-by
- slot
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- /examples/tensorflow_mnist.py
resources:
limits:
cpu: 1
memory: 2Gi
worker:
replicas: 2
container:
image: docker.io/kubeflow/mpi-horovod-mnist
resources:
limits:
cpu: 2
memory: 4Gi