You can use one or multiple buckets on Google Cloud Storage (GCS) to access data directly on your machine learning experiments and jobs.

Create a Google cloud storage bucket

You should create a google cloud storage bucket (e.g. plx-data), and you have to assign permission to the bucket.

Google cloud storage provides an easy way to download the access key as a JSON file. You should create a secret based on that JSON file.

Create a secret on Kubernetes

You can create a secret with an env var of the content of the gcs-key.json:

  • GC_KEYFILE_DICT or GOOGLE_KEYFILE_DICT

Or you can create a secret to be mounted as a volume:

  • kubectl create secret generic gcs-secret --from-file=gc-secret.json=path/to/gcs-key.json -n polyaxon

Use the secret name and secret key in your data persistence definition

You can use the default mount path /plx-context/.gc, Polyaxon will set the GC_KEY_PATH to /plx-context/.gc/gc-secret.json so the secret must contain —from-file=gc-secret.json=

connections:
- name: gcs-dataset1
  kind: gcs
  schema:
    bucket: "gs://gcs-datasets"
  secret:
    name: "gcs-secret"
    mountPath: /plx-context/.gc

You can also use a different mount path /etc/gcs, in which case you need to provide an env var to tell the SDK where to look:

kubectl create configmap gcs-key-path --from-literal GC_KEY_PATH="/etc/gcs/gc-secret.json" -n polyaxon
connections:
- name: gcs-dataset1
  kind: gcs
  schema:
    bucket: "gs://gcs-datasets"
  secret:
    name: "gcs-secret"
    mountPath: /etc/gcs
  configMap:
    name: gcs-key-path

If you want ot access multiple datasets using the same secret:

persistence:
- name: gcs-dataset1
  kind: gcs
  schema:
    bucket: "gs://gcs-datasets/path1"
  secret:
    name: "gcs-secret"
    mountPath: /etc/gcs
  configMap:
    name: gcs-key-path
- name: gcs-dataset2
  kind: gcs
  schema:
    bucket: "gs://gcs-datasets/path2"
  secret:
    name: "gcs-secret"
    mountPath: /etc/gcs
  configMap:
    name: gcs-key-path

Update/Install Polyaxon deployment

You can deploy/upgrade your Polyaxon CE or Polyaxon Agent deployment with access to data on GCS.

Access to data in your experiments/jobs

To expose the connection secret to one of the containers in your jobs or services:

run:
  kind: job
  connections: [gcs-dataset1]

Or

run:
  kind: job
  connections: [gcs-dataset1, s3-dataset1]

Use the initializer to load the dataset

To use the artifacts initializer to load the dataset

run:
  kind: job
  init:
   - artifacts: {dirs: [...], files: [...]}
     connection: "gcs-dataset1"

Access the dataset programmatically

This is optional, you can use any language or logic to interact with Google Cloud Storage buckets.

For instance you can install gcloud CLI and it will be configured automatically when you request the GCS connection.

You can also use Polyaxon’s fs library to get a fully resolved gcsfs instance:

To use that logic:

pip install polyaxon[gcs]

Creating a sync instance of the gcsfs client:

from polyaxon.fs import get_fs_from_name

...
fs = get_fs_from_name("gcs-dataset1")  # You can pass additional kwargs to the function
...

Creating an async instance of the gcsfs client:

from polyaxon.fs import get_fs_from_name

...
fs = get_fs_from_name("gcs-dataset1",
                      asynchronous=True)  # You can pass additional kwargs to the function
...