V1MXJob

polyaxon._flow.run.kubeflow.mx_job.V1MXJob()

Kubeflow MXNet-Job provides an interface to train distributed experiments with MXNet.

YAML usage

run:
  kind: mxjob
  cleanPodPolicy:
  schedulingPolicy:
  mode:
  scheduler:
  server:
  worker:
  tuner:
  tunerTracker:
  tunerServer:

Python usage

from polyaxon.schemas import V1KFReplica, V1MXJob
mx_job = V1MXJob(
    clean_pod_policy='All',
    scheduler=V1KFReplica(...),
    server=V1KFReplica(...),
    worker=V1KFReplica(...),
    tuner=V1KFReplica(...),
)

Fields

kind

The kind signals to the CLI, client, and other tools that this component’s runtime is a mxjob.

If you are using the python client to create the runtime, this field is not required and is set by default.

run:
  kind: mxjob

cleanPodPolicy

Controls the deletion of pods when a job terminates. The policy can be one of the following values: [All, Running, None]

run:
  kind: mxjob
  cleanPodPolicy: 'All'
 ...

schedulingPolicy

SchedulingPolicy encapsulates various scheduling policies of the distributed training job, for example minAvailable for gang-scheduling.

run:
  kind: mxjob
  schedulingPolicy:
    ...
 ...

mode

The kind of MXJob to schedule. Different mode may have different replicas.

run:
  kind: mxjob
  mode: 'MXTrain'
 ...

Scheduler

Ths scheduler replica in the distributed MXJob.

run:
  kind: mxjob
  scheduler:
    replicas: 2
    container:
      ...
 ...

server

The server replica in the distributed MXJob.

run:
  kind: mxjob
  server:
    replicas: 2
    container:
      ...
 ...

worker

The worker replica in the distributed MXJob.

run:
  kind: mxjob
  worker:
    replicas: 2
    container:
      ...
 ...

tuner

The tuner replica in the distributed MXJob.

run:
  kind: mxjob
  tuner:
    replicas: 1
    container:
      ...
 ...

tunerTracker

The tuner tracker replica in the distributed MXJob.

run:
  kind: mxjob
  tunerTracker:
    replicas: 1
    container:
      ...
 ...

tunerServer

The tuner server replica in the distributed MXJob.

run:
  kind: mxjob
  tunerServer:
    replicas: 1
    container:
      ...
 ...