Add artifact tracking and log metadata to the lineage table - Tracking

Polyaxon provides two methods for tracking assets and artifacts that you generate in your jobs:

Versioned assets.
Reference-only lineage metadata.

Overview

For each run, Polyaxon creates an artifacts folder with a predefined structure to organize your assets and outputs.

outputs: This is where the user can put custom artifacts that are not generated by Polyaxon library, additionally both log_artifact(...) and log_model(...) will move the file/dir under outputs if no step is provided.
assets: This folder is populated automatically when a log_... method is called, an event file and an asset gets created.
events: This folder is populated automatically when a log_... method is called.
plxlogs: This folder is populated automatically if archiving logs is enabled.
resources: This folder is populated automatically if resources tracking is enabled.

Users might also save content in other subpaths as they wish, for instance, uploading data to uploads or code.

Note: during the runtime, users should only save their custom artifacts under outputs to leverage automatic artifact management, we suggest using get_outputs_path for getting and ensuring a new outputs subpath.

For each run, users can get the artifacts_path and outputs_path using:

tracking.get_artifacts_path()
tracking.get_outputs_path()

They can also get a relative path to these root paths:

tracking.get_artifacts_path("code/file.py")
tracking.get_outputs_path("model/model.h5")
tracking.get_outputs_path("sub-folder", is_dir=True)

Polyaxon also exposes 3 types of logging methods:

Reference logging: these are the methods ending with _ref, they generally only log the lineage reference, the user has to save the artifact manually.
Versioned assets logging: These are the methods that save both the assets and the corresponding event file.
Generic methods: log_artifact and log_model, these two methods can save versioned assets (default behavior) or move the asset under the outputs folder and save the lineage is versioned=False.

Reference logging

Reference logging is useful when the user:

needs to have more control where the asset must be saved.
does not need to log multiple versions of the asset during the run, i.e. the asset is only saved once.

In reference logging the user is responsible for saving the artifacts, you can use a relative path to the run’s root artifacts or a relative path to the run’s outputs path:

tracking.get_outputs_path("model/model.h5")

After saving the artifact under that path, you can decide if you want to create a new entry in the lineage table:

asset_path = tracking.get_outputs_path("custom_artifacts/filename.ext")
custom_logic_to_save_the_file(asset_path)

# This is optional, it tracks lineage only
tracking.log_artifact_ref(path=asset_path, ...)

Sometimes you will need to save an artifact on different backend and not on the artifact store, you can still use the ref to log the lineage:

asset_path = "{}/file.ext".format(S3_URI)
custom_logic_to_save_in_s3(asset_path)

# This is optional, it tracks lineage only
tracking.log_artifact_ref(path=asset_path, name="myfile", summary={"extra_key": "extra_value"}, ...)

In this case the file was not saved to the default artifacts store, but it was saved to a custom S3 bucket, and we added a new entry in the lineage table with that information.

Saving a model reference

from polyaxon import tracking
...
tracking.init()
...
model_dir = tracking.get_outputs_path("model", is_dir=True)
classifier = tf.estimator.LinearClassifier(
    model_dir=model_dir,
    feature_columns=[...],
    n_classes=2
)
tracking.log_model_ref(model_dir, framework="tensorflow", summary={...}, ...)
...

This is similar to log_artifactand it handles setting kind=V1ArtifactKind.MODEL automatically.

Note: In reference lineage tracking, you can also provide extra key/value summary to augment the lineage information. log_artifact_ref also accepts a kind to specify the artifact kind.

Saving artifacts and reference logging

Some methods support saving artifacts, i.e. moving the file/dir under outputs, and logging the lineage reference in one call:

log artifact

tracking.log_artifact(name="file", path="path/from/file.csv", kind=V1ArtifactKind.CSV)

This will both move the file.csv under the run-uuid/outputs/ folder and save a lineage reference.

you can also control where under the run-uuid/outputs/ the file should be saved using the rel_path argument:

tracking.log_artifact(name="file", path="path/from/file.csv", kind=V1ArtifactKind.CSV, rel_path="csv_files/new/")

log model

Similar behaviour can be achieved with log_model:

tracking.log_model(name="file", path="path/from/model.h5", framework="scikit", summary={"additional_key": "value"})

This will both save the model under the run-uuid/outputs/model/ folder and save a lineage reference.

You can change the default model sub-folder by passing a rel_path:

tracking.log_model(name="file", path="path/from/model.h5", framework="scikit", summary={"additional_key": "value"}, rel_path="different_folder")

Versioned assets logging

Versioned assets logging is useful when the user needs to save the same asset but several times in the same run, based on timestamp values and/or steps. For that the tracking module will automatically generate a new subpath under the assets sub-folder, e.g. assets/model/dirname_STEP_NUMBER or assets/audio/filename_STEP_NUMBER, and each time a logging method is called with the same artifact name, it will create a new entry in the event file and a new subpath with the step number.

Usually users should use the versioned logging when they need an easy way to explore a specific versioned artifact in the dashboards tab, the UI will create a widget with a step slider to load a new file version based on the step number.

Note: some logging functions do not save assets, they just populate the event file with new values, for instance scalars/metrics/text tracking.

Example logging a custom curve several times during the runtime:

tracking.log_curve("custom-curve", x_array, y_array, step=14)
...
tracking.log_curve("custom-curve", x_array, y_array, step=140)

Example logging a model for every checkpoint:

tracking.log_model(model_path, name="model", framework="tensorflow", step=3)
...
tracking.log_model(model_path, name="model", framework="tensorflow", step=140)

Example logging a generic artifact several times:

tracking.log_artifact(artifact_path, kind=V1ArtifactKind.DATAFRAME, name="df-pickle", step=3)
...
tracking.log_artifact(artifact_path, kind=V1ArtifactKind.DATAFRAME, name="df-pickle", step=140)

Experimentation Tracking Artifacts

Tracking Artifacts

Overview

Reference logging

Saving artifacts and reference logging

log artifact

log model

Versioned assets logging

Tracking

Tracking

Version

Improve this page!

Have a feedback?

ExperimentationTracking Artifacts

Tracking Artifacts

Overview

Reference logging

Saving artifacts and reference logging

log artifact

log model

Versioned assets logging

Tracking

Tracking

Version

Improve this page!

Have a feedback?

Experimentation Tracking Artifacts