The termination section allows to define and control when to stop an operation and how long to keep its resources on the cluster.
- max_retries: int, optional
- ttl: int, optional
- timeout: int, optional
from polyaxon.schemas import V1Termination
termination = V1Termination(
Maximum number of retries when an operation fails.
This field can be used with restartPolicy from the environment section.
This field is the equivalent of the backoffLimit. Polyaxon exposes a uniform specification and knows how to manage and inject this value into the underlying primitive of the runtime, i.e. Job, Service, TFJob CRD, RayJob CRD, …
Polyaxon will automatically clean all resources just after they finish and after the various helpers finish collecting and archiving information from the cluster, such as logs, outputs, … This ensures that your cluster(s) are kept clean and no resources are actively putting pressure on the API server.
That being said, sometimes users might want to keep the resources after they finish for sanity check or debugging.
The ttl field allows you to leverage the TTL controller provided by some primitives, for example the ttlSecondsAfterFinished, from the Job controller. Polyaxon has helpers for resources that don’t have a built-in TTL mechanism, such as services, so that you can have a uniform definition for all of your operations.
Sometimes you might stop an operation after a certain time, timeout lets you define how
long before Polyaxon decides to stop that operation, this is the equivalent of Kubernetes Jobs
but you can use this field for all runtimes, for instance you might want to stop a tensorboard after 12 hours, this way you don’t have to actively look for running tensorboards. Timeout is also how you can enforce SLAs (Service Level Agreements), if one operation have not succeeded by that timedelta, Polyaxon will stop the operation. Timeout can be combined with hooks/notifications to notify a user or a system about the details of the failure or stopping the operation.