Reproducible & Portable Machine Learning packages

As we are testing release candidates for the upcoming PolyaxonV1.1. Several users have been asking about the new Polyaxonfile changes and some incompatibilities with the previous version. This blog post aims to bring some clarity about what’s new in Polyaxonfile specification, why we made some changes, and how to migrate your files to the new version.

What is a Polyaxonfile

Polyaxonfile is Polyaxon’s specification for packaging every aspect of dependencies, artifacts, environments, and runtime of an operation to schedule on Kubernetes. the specification focuses on the following use cases:

Simplicity: It should be simple enough to author by a data-scientist or an ML-engineer, but it should be also very easy to read by other stakeholders, such as DevOps and engineers.
Portability: Polyaxon focuses on running containerized workload, and so a Polyaxonfile is equivalent to a Dockerfile in the data science world.
Reproducibility: One of the main problems of ML projects is reproducibility, by capturing several information about how to run an operation in a simple manifest, Polyaxon delivers a reproducibility engine for its users, every time you run a Polyaxonfile, the system knows exactly what type of code references, inputs, outputs, container images, and artifacts are required to reproduce a similar execution and achieve similar results.
Customization: Polyaxon is a simple system for organizing experimentation and automation of ML projects, in order to bring the required agility for data scientists and ML engineers to iterative rapidly and to deliver more models, the specification should provide an interface to easily pass and check parameters.
Extensibility: Providing a simple interface that can be used by technical and non-technical people is important. Since ML projects are notoriously very complex, oftentimes users will have some advanced requirements to change the behaviour of the underlying containers, environments, or networking interfaces, the specification is required to open the door for advanced users to provide the changes needed to tune their executables.

When Polyaxon started, the main objective was to make machine learning experimentation dead simple, which was to some extent the main driver for several teams adopting Polyaxon as an experimentation tool. In order to make the tool and interfaces simple, we made the conscious decision of restricting what users can and cannot do with the Polyaxonfile specification. This choice was good enough until we start noticing users working around Polyaxon’s interface limitation to achieve their workflow.

The new Polyaxonfile specification is as simple as the previous one, but it does not limit what users can do. Most of the time, users will not need to customize several areas of the new Polyaxonfile specification, but now the option is there for them to decide. Furthermore, the new architecture allows users to not only manage the workflow that they used to do before, but it allows them to optionally customize and extend several aspects of some plugins and integrations without any deep knowledge of how Polyaxon’s internals and without requesting/submitting changes to the platform itself.

Kubernetes

In previous versions of Polyaxon, we exposed some fields from Kubernetes that we thought were necessary for most use cases, but due to the complexity and diversity of deployments and environmental differences, we received requests to support new requirements and to expose more fields and aspects about Kubernetes manifests. This lead to two issues:

Users had to wait until such features are implemented.
We either had to support the field natively using a snake case format, which requires more work, or just consume raw Kubernetes camel case without validation.

The Polyaxonfile specification has now native support for several Kubernetes fields, we made the decision to make the full specification camel case to reduce the cognitive load of reading the manifests, but we kept the specification as simple as possible, for instance this is the simplest manifest possible:

version: 1.1
kind: component
run:
  container:
     image: "ml-project"

This time, there’s no limitation of what the users can use from the Kubernetes world to extend the manifest and solve their problems.

We are also now exposing a Python interface for authoring Polyaxonfiles, and the CLI and Clients can consume both Yaml/Json manifests and Python manifests.

Differences with the previous specification

The new Polyaxonfile architecture resembles the previous one and uses similar naming for many sections, but it focuses on two main primitives: Components and Operations.

Component: A discrete, repeatable, and self-contained action that defines an environment and a runtime.
Operation: Logic to operationalize and execute a component by passing parameters, connections, and possibly patch the run environment.

Technically, Components and Operations codifies your data and machine learning logic as specification metafiles. These manifests open the door to reproduce and automate several day-to-day workflows for MLOps and GitOps.

The new architecture is mainly driven and inspired by discussions and contributions from customers and a community of data scientists, ML engineers, developers and software engineers.

Architecture

A component is made of code that performs an action, such as container building, data processing, data transformation, model training, etc… Since Polyaxon schedules containerized workload, you can use any language to write the logic of your components and just provide the container image to run.

An operation is how Polyaxon executes a component by passing parameters, connections, and a run environment.

With an operation users can:

Pass the parameters for required inputs/outputs or override the default values of optional inputs/outputs.
Patch the definition of the component to set environments, initializers, and resources.
Set termination logic and retries.
Set trigger logic to start a component in a pipeline context.
Parallelize or map the component over a matrix of parameters.
Provide parameters to the components using an iterative optimization algorithm.
Put an operation on a schedule.
Subscribe a component to events to trigger executions automatically.

Polyaxonfile Architecture Polyaxonfile Architecture

Context

Knowing exactly what to resolve in a Polyaxonfile was a major pain, users were restricted to only template the command section using the passed parameters. The new Polyaxonfile provides not only a documented and clear information about what users can resolve from the context besides the information they passed in the parameters, but it also allows to template different sections of the manifests. The new Polyaxonfile comes with a new multi-stage compiler that resolves information and augments the contexts as it progresses in the resolution pipeline.

For paying customers, the compiler also checks the Auth, ACL, and RBAC system to validate that the manifest can be executed by a specific user within a specif project, for instance if a Polyaxonfile requires access to a volume or an S3 bucket, or a secret, or a namespace, or a cluster resource such TPU, that the user does not have access to either globally or within a project, the run will fail immediately with a CompilationError, even though the Polyaxonfile is correct and can be executed by other team members who have enough access rights.

Use cases

Traditionally, Polyaxon allowed its users a couple of workflows:

Drive fast experimentation cycle.
Deploy and manage Notebooks and Tensorboards.

Both of these use cases are still handled and managed in a similar fashion as in Polyaxon v0, and with the new architecture users can still drive, even faster experimentation cycles thanks to the fact that the platform does not require a build step, but also they do not have to request new changes to support new versions of Tensorboard or Jupyter Notebook/Hub. Instead of requesting new features, users can just modify the public Tensorboard components or the Notebook components, or even create specific components with custom runtime, and still start these components in a simple way, e.g:

polyaxon run --hub jupyter-notebook

This new architecture also opened a complete set of new possibilities to version the components with domain knowledge, since they can be tagged without requiring new API endpoints or new CLI commands:

For instance to start a Tensorboard based on a single run:

polyaxon run --hub tensorboard:single-run -P...

To start a Tensorboard based on a multi runs:

polyaxon run --hub tensorboard:multi-runs -P...

To start a Tensorboard based on a path:

polyaxon run --hub tensorboard:path -P...

Component Hub

Because authoring these new components and plugins is as simple as creating what users create everyday for running their workload, extensibility comes as a natural feature, any employee with domain knowledge can provide a new component for processing data from a specific data store or a database, or create a new dashboard based on streamlit or voila notebooks.

For instance, adding a vscode component was less than 30 min of work, and now users can use a GPU enabled remote environment to drive fast experimentation either using a Notebook or a vscode session:

polyaxon run --hub vscode

Polyaxon provides already several reusable components that can be started using the CLI/Client, and they are already live on https://github.com/polyaxon/polyaxon-hub/

For customers we started a Beta for a new feature: Private Component Hub. It integrates seamlessly with their team management and organization ACL and RBAC, so that they can create a new dashboard or an app based on their favorite open-source tools, and share the results with the rest of the team knowing that only people with read access can view the new dashboard.

Polyaxon Component Hub Polyaxon Component Hub

Conclusion

The new specification would not have been possible without the feedback we have received, and we are confident that this new architecture is here to stay, since it provides the needed interface and extensibility qualities to drive new features and plugins in an easy and simple way.

To learn more about Polyaxon’s new specification, please visit the specification docs.

Learn More about Polyaxon

Polyaxon continues to grow quickly and keeps improving and providing the simplest machine learning layer on Kubernetes.

To learn more about all the new features, fixes, and enhancements, follow us on