It’s important to ensure that jobs and services are robust and that they respect SLAs.
Polyaxon exposes a section for handling failure and managing termination.
You can set a default termination on the component level and override the values for each operation, or you can only define termination on some operation without setting too much information on the component. You can also use the scheduling presets to define one or multiple termination configurations that you can use with one or several of your operations.
Handling failures with max retries
In order to make your operations resilient to failure that could happen for a variety of reasons:
- pod preemption, node failure, …
- HTTP requests failing when fetching data or assets
- Service or external API down or unavailable for a short period of time
Polyaxon provides a concept called max_retries.
Enforcing SLAs with timeout
It’s also important to enforce SLAs (Service Level Agreements) for your operations. Polyaxon provides the timeout section that will stop a job or service if it does not succeed or terminate on its own during the time window defined in the termination timeout.
Timeout can be combined with hooks/notifications to deliver the necessary information to users or external services.
Debugging with TTL
The third key in the termination section is ttl. By default, Polyaxon cleans out and removes all cluster resources as soon as an operation is done. It is often necessary to keep a job or a service after it’s done for sanity checks or debugging purposes.
Optimizing resource usage with service culling
For long-running services like Jupyter notebooks, Tensorboard, or RStudio, Polyaxon provides an idle-based culling mechanism that automatically stops services when they are not actively being used. This is particularly useful for:
- Services that request expensive resources (GPUs, high-memory nodes)
- Preventing resource waste from forgotten sessions
- Automatically freeing up cluster capacity for other workloads
Unlike the absolute timeout which terminates after a fixed duration regardless of usage,
culling only triggers when the service has been idle for the specified period.
Configuration
The culling feature is configured through the termination section with two key components:
termination:
culling:
timeout: 3600 # Idle timeout in seconds
probe:
http:
path: "/api/status"
port: 8888culling.timeout: Duration in seconds that the service must be idle before it is terminatedprobe: Defines how to check for activity (HTTP or Exec)
How culling works
- Polyaxon periodically checks the service’s activity status using the configured probe
- If the probe indicates activity, the idle timer resets
- If no activity is detected for the duration specified in
culling.timeout, the service is automatically stopped - On probe errors, the service is assumed to be active to avoid accidental termination
Activity probes
The culling feature works by periodically checking for activity using configurable activity probes:
HTTP probes
HTTP probes poll a service endpoint to detect activity. The endpoint must return a JSON response with a last_activity field containing an RFC3339 timestamp:
{
"last_activity": "2024-01-15T10:30:00Z",
"started": "2024-01-15T08:00:00Z"
}Configuration example for Jupyter:
termination:
culling:
timeout: 3600
probe:
http:
path: "/api/status" # Default path if not specified
port: 8888 # Uses first service port if not specifiedCommon HTTP probe paths:
- Jupyter/JupyterLab:
/api/status(port 8888) - Custom services: Implement an endpoint that returns the
last_activityJSON field
Exec probes
Exec probes run custom commands inside the container to determine activity status:
termination:
culling:
timeout: 7200
probe:
exec:
command: ["bash", "-c", "/scripts/check-activity.sh"]The command runs in the container’s root (/) directory. Exit code 0 indicates activity, non-zero indicates idle.
Note: Exec probes are currently defined in the API but not yet fully implemented. Use HTTP probes for production workloads.
Combining timeout and culling
You can use both absolute timeout and idle-based culling together. The service will be stopped when either condition is met (whichever happens first):
termination:
timeout: 86400 # Absolute: stop after 24 hours regardless of activity
culling:
timeout: 3600 # Idle: stop after 1 hour of inactivity
probe:
http:
path: "/api/status"
port: 8888This configuration ensures:
- Active services are terminated after 24 hours maximum
- Idle services are terminated after 1 hour of inactivity
Troubleshooting
If culling is not working as expected:
- Verify probe configuration: Ensure the path and port match your service’s activity endpoint
- Check endpoint response: The HTTP endpoint must return valid JSON with
last_activityin RFC3339 format - Review logs: Check the mloperator logs for culling-related messages
- Service must be running: Culling only applies to services in the running state
See the services timeout preset documentation for more examples and preset configurations.