V1Termination

polyaxon._flow.termination.V1Termination()

The termination section allows to define and control when to stop an operation and how long to keep its resources on the cluster.

  • Args:
    • max_retries: int, optional
    • ttl: int, optional
    • timeout: int, optional
    • culling: V1Culling, optional
    • probe: V1ActivityProbe, optional
    • pod_failure_policy: V1PodFailurePolicy, optional

YAML usage

termination:
  maxRetries:
  ttl:
  timeout:
  culling:
    timeout: 3600
  probe:
    http:
      path: "/api/status"
      port: 8888
  podFailurePolicy:
    rules:
      - action: Ignore
        onPodConditions:
          - type: DisruptionTarget

Python usage

from polyaxon.schemas import V1Termination, V1Culling, V1ActivityProbe, V1ActivityProbeHttp
termination = V1Termination(
    max_retries=1,
    ttl=1000,
    timeout=50,
    culling=V1Culling(timeout=3600),
    probe=V1ActivityProbe(
        http=V1ActivityProbeHttp(path="/api/status", port=8888)
    )
)

Fields

maxRetries

Maximum number of retries when an operation fails.

This field can be used with restartPolicy from the environment section.

This field is the equivalent of the backoffLimit. Polyaxon exposes a uniform specification and knows how to manage and inject this value into the underlying primitive of the runtime, i.e. Job, Service, TFJob CRD, RayCluster CRD, …

termination:
  maxRetries: 3

ttl

Polyaxon will automatically clean all resources just after they finish and after the various helpers finish collecting and archiving information from the cluster, such as logs, outputs, … This ensures that your cluster(s) are kept clean and no resources are actively putting pressure on the API server.

That being said, sometimes users might want to keep the resources after they finish for sanity check or debugging.

The ttl field allows you to leverage the TTL controller provided by some primitives, for example the ttlSecondsAfterFinished, from the Job controller. Polyaxon has helpers for resources that don’t have a built-in TTL mechanism, such as services, so that you can have a uniform definition for all of your operations.

termination:
  ttl: 1000

timeout

Sometimes you might stop an operation after a certain time, timeout lets you define how long before Polyaxon decides to stop that operation, this is the equivalent of Kubernetes Jobs activeDeadlineSeconds
but you can use this field for all runtimes, for instance you might want to stop a tensorboard after 12 hours, this way you don’t have to actively look for running tensorboards. Timeout is also how you can enforce SLAs (Service Level Agreements), if one operation have not succeeded by that timedelta, Polyaxon will stop the operation. Timeout can be combined with hooks/notifications to notify a user or a system about the details of the failure or stopping the operation.

termination:
  timeout: 1000

culling

Note: Available from v2.12

Idle-based termination configuration for long-running services. Unlike the absolute timeout which stops a service after a fixed duration, culling only triggers when the service has been idle for the specified period. This is particularly useful for services like Jupyter notebooks that may run for long periods but are only actively used occasionally.

Culling requires an activity probe (see probe field) to determine when the service is idle.

termination:
  culling:
    timeout: 3600  # Stop after 1 hour of idle time

You can combine both timeout and culling. The service will be stopped when either condition is met (whichever happens first):

termination:
  timeout: 86400   # Absolute: stop after 24 hours
  culling:
    timeout: 3600  # Idle: stop after 1 hour of inactivity
  probe:
    http:
      path: "/api/status"
      port: 8888

probe

Note: Available from v2.12

Activity probe configuration that defines how to check if a service is active or idle. This is used in conjunction with the culling field to implement idle-based termination.

Two probe types are supported:

HTTP probe - Polls an HTTP endpoint to check for activity (recommended for Jupyter):

termination:
  probe:
    http:
      path: "/api/status"
      port: 8888

Exec probe - Runs a custom command to check for activity:

termination:
  probe:
    exec:
      command: ["bash", "-c", "check-activity.sh"]

The probe is periodically executed to determine if the service is active. For HTTP probes, the endpoint should return activity information. For exec probes, the command should exit with code 0 for active, 1 for idle.

See services timeout preset documentation for detailed examples and use cases.

podFailurePolicy

Note: Available from v2.13. Requires Kubernetes v1.25+.

Pod failure policy configuration that defines fine-grained rules for how pod failures should be handled. This feature allows you to:

  • Fail jobs immediately on certain exit codes (non-retriable errors)
  • Ignore failures due to involuntary disruptions (preemption, eviction)
  • Control which failures count towards the backoff limit
termination:
  maxRetries: 3
  podFailurePolicy:
    rules:
      # Fail immediately on exit code 42 (non-retriable error)
      - action: FailJob
        onExitCodes:
          containerName: main
          operator: In
          values: [42]
      # Ignore pod disruptions (preemption, eviction)
      - action: Ignore
        onPodConditions:
          - type: DisruptionTarget

Available actions:

  • FailJob: Mark the job as failed immediately without further retries
  • Ignore: Don’t count this failure towards the backoff limit
  • Count: Count towards backoff limit (default behavior)
  • FailIndex: Fail the index for indexed jobs

See Kubernetes Pod Failure Policy for more details.