You can use one or multiple buckets on Google Cloud Storage (GCS) to access data directly on your machine learning experiments and jobs.
Create a Google cloud storage bucket
You should create a google cloud storage bucket (e.g. plx-data), and you have to assign permission to the bucket.
Google cloud storage provides an easy way to download the access key as a JSON file. You should create a secret based on that JSON file.
Create a secret on Kubernetes
You can create a secret with an env var of the content of the gcs-key.json:
GC_KEYFILE_DICT
orGOOGLE_KEYFILE_DICT
Or you can create a secret to be mounted as a volume:
kubectl create secret generic gcs-secret --from-file=gc-secret.json=path/to/gcs-key.json -n polyaxon
Use the secret name and secret key in your data persistence definition
You can use the default mount path /plx-context/.gc
, Polyaxon will set the GC_KEY_PATH
to /plx-context/.gc/gc-secret.json
so the secret must contain --from-file=gc-secret.json=
connections:
- name: gcs-dataset1
kind: gcs
schema:
bucket: "gs://gcs-datasets"
secret:
name: "gcs-secret"
mountPath: /plx-context/.gc
You can also use a different mount path /etc/gcs
, in which case you need to provide an env var to tell the SDK where to look:
kubectl create configmap gcs-key-path --from-literal GC_KEY_PATH="/etc/gcs/gc-secret.json" -n polyaxon
connections:
- name: gcs-dataset1
kind: gcs
schema:
bucket: "gs://gcs-datasets"
secret:
name: "gcs-secret"
mountPath: /etc/gcs
configMap:
name: gcs-key-path
If you want ot access multiple datasets using the same secret:
persistence:
- name: gcs-dataset1
kind: gcs
schema:
bucket: "gs://gcs-datasets/path1"
secret:
name: "gcs-secret"
mountPath: /etc/gcs
configMap:
name: gcs-key-path
- name: gcs-dataset2
kind: gcs
schema:
bucket: "gs://gcs-datasets/path2"
secret:
name: "gcs-secret"
mountPath: /etc/gcs
configMap:
name: gcs-key-path
Update/Install Polyaxon deployment
You can deploy/upgrade your Polyaxon CE or Polyaxon Agent deployment with access to data on GCS.
Access to data in your experiments/jobs
To expose the connection secret to one of the containers in your jobs or services:
run:
kind: job
connections: [gcs-dataset1]
Or
run:
kind: job
connections: [gcs-dataset1, s3-dataset1]
Use the initializer to load the dataset
To use the artifacts initializer to load the dataset
run:
kind: job
init:
- artifacts: [dirs: [...], files: [...]]
connection: "gcs-dataset1"
Use Polyaxon to access the dataset
This is optional, you can use any language or logic to interacts with Azure Storage.
Polyaxon has some built-in logic that you can leverage if you want.
To use that logic:
pip install polyaxon[gcs]
All possible functions to use:
from polyaxon.connections.gcp.gcs import GCSService
store = GCSService(...)
store.delete()
store.ls()
store.upload_file()
store.upload_dir()
store.download_file()
store.download_dir()