V1Termination
polyaxon._flow.termination.V1Termination()The termination section allows to define and control when to stop an operation and how long to keep its resources on the cluster.
- Args:
- max_retries: int, optional
- ttl: int, optional
- timeout: int, optional
- culling: V1Culling, optional
- probe: V1ActivityProbe, optional
- pod_failure_policy: V1PodFailurePolicy, optional
YAML usage
termination:
maxRetries:
ttl:
timeout:
culling:
timeout: 3600
probe:
http:
path: "/api/status"
port: 8888
podFailurePolicy:
rules:
- action: Ignore
onPodConditions:
- type: DisruptionTargetPython usage
from polyaxon.schemas import V1Termination, V1Culling, V1ActivityProbe, V1ActivityProbeHttp
termination = V1Termination(
max_retries=1,
ttl=1000,
timeout=50,
culling=V1Culling(timeout=3600),
probe=V1ActivityProbe(
http=V1ActivityProbeHttp(path="/api/status", port=8888)
)
)Fields
maxRetries
Maximum number of retries when an operation fails.
This field can be used with restartPolicy from the environment section.
This field is the equivalent of the backoffLimit. Polyaxon exposes a uniform specification and knows how to manage and inject this value into the underlying primitive of the runtime, i.e. Job, Service, TFJob CRD, RayCluster CRD, …
termination:
maxRetries: 3ttl
Polyaxon will automatically clean all resources just after they finish and after the various helpers finish collecting and archiving information from the cluster, such as logs, outputs, … This ensures that your cluster(s) are kept clean and no resources are actively putting pressure on the API server.
That being said, sometimes users might want to keep the resources after they finish for sanity check or debugging.
The ttl field allows you to leverage the TTL controller provided by some primitives, for example the ttlSecondsAfterFinished, from the Job controller. Polyaxon has helpers for resources that don’t have a built-in TTL mechanism, such as services, so that you can have a uniform definition for all of your operations.
termination:
ttl: 1000timeout
Sometimes you might stop an operation after a certain time, timeout lets you define how
long before Polyaxon decides to stop that operation, this is the equivalent of Kubernetes Jobs
activeDeadlineSeconds
but you can use this field for all runtimes, for instance you might want to stop a
tensorboard after 12 hours, this way you don’t have to actively look for running tensorboards.
Timeout is also how you can enforce SLAs (Service Level Agreements),
if one operation have not succeeded by that timedelta, Polyaxon will stop the operation.
Timeout can be combined with hooks/notifications to
notify a user or a system about the details of the failure or stopping the operation.
termination:
timeout: 1000culling
Note: Available from v2.12
Idle-based termination configuration for long-running services. Unlike the absolute timeout
which stops a service after a fixed duration, culling only triggers when the service has been
idle for the specified period. This is particularly useful for services like Jupyter notebooks
that may run for long periods but are only actively used occasionally.
Culling requires an activity probe (see probe field) to determine when the service is idle.
termination:
culling:
timeout: 3600 # Stop after 1 hour of idle timeYou can combine both timeout and culling. The service will be stopped when either
condition is met (whichever happens first):
termination:
timeout: 86400 # Absolute: stop after 24 hours
culling:
timeout: 3600 # Idle: stop after 1 hour of inactivity
probe:
http:
path: "/api/status"
port: 8888probe
Note: Available from v2.12
Activity probe configuration that defines how to check if a service is active or idle.
This is used in conjunction with the culling field to implement idle-based termination.
Two probe types are supported:
HTTP probe - Polls an HTTP endpoint to check for activity (recommended for Jupyter):
termination:
probe:
http:
path: "/api/status"
port: 8888Exec probe - Runs a custom command to check for activity:
termination:
probe:
exec:
command: ["bash", "-c", "check-activity.sh"]The probe is periodically executed to determine if the service is active. For HTTP probes, the endpoint should return activity information. For exec probes, the command should exit with code 0 for active, 1 for idle.
See services timeout preset documentation for detailed examples and use cases.
podFailurePolicy
Note: Available from v2.13. Requires Kubernetes v1.25+.
Pod failure policy configuration that defines fine-grained rules for how pod failures should be handled. This feature allows you to:
- Fail jobs immediately on certain exit codes (non-retriable errors)
- Ignore failures due to involuntary disruptions (preemption, eviction)
- Control which failures count towards the backoff limit
termination:
maxRetries: 3
podFailurePolicy:
rules:
# Fail immediately on exit code 42 (non-retriable error)
- action: FailJob
onExitCodes:
containerName: main
operator: In
values: [42]
# Ignore pod disruptions (preemption, eviction)
- action: Ignore
onPodConditions:
- type: DisruptionTargetAvailable actions:
FailJob: Mark the job as failed immediately without further retriesIgnore: Don’t count this failure towards the backoff limitCount: Count towards backoff limit (default behavior)FailIndex: Fail the index for indexed jobs
See Kubernetes Pod Failure Policy for more details.