You can use one or multiple buckets on S3 to access data directly on your machine learning experiments and jobs.
Create an S3 bucket
You should create an S3 bucket (e.g. plx-storage).
You need to expose information about how to connect to the blob storage, the standard way is to expose these keys:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
And optionally these keys:
AWS_ENDPOINT_URL
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_SECURITY_TOKEN
AWS_REGION
Create a secret or a config map for storing these keys
We recommend using a secret to store your access information json object:
kubectl create secret -n polyaxon generic s3-secret --from-literal=AWS_ACCESS_KEY_ID=key-id --from-literal=AWS_SECRET_ACCESS_KEY=hash-key
Use the secret name and secret key in your data persistence definition
connections:
- name: s3-dataset1
kind: s3
schema:
bucket: "s3://bucket/"
secret:
name: "s3-secret"
If you want ot access multiple datasets using the same secret:
connections:
- name: s3-dataset1
kind: s3
schema:
bucket: "s3://bucket/path1"
secret:
name: "s3-secret"
- name: s3-dataset1
kind: s3
schema:
bucket: "s3://bucket/path2"
secret:
name: "s3-secret"
Update/Install Polyaxon deployment
You can deploy/upgrade your Polyaxon CE or Polyaxon Agent deployment with access to data on S3.
Access to the dataset in your experiments/jobs
To expose the connection secret to one of the containers in your jobs or services:
run:
kind: job
connections: [s3-dataset1]
Or
run:
kind: job
connections: [s3-dataset1, azure-dataset1]
Use the initializer to load the dataset
To use the artifacts initializer to load the dataset
run:
kind: job
init:
- artifacts: {dirs: [...], files: [...]}
connection: "s3-dataset1"
Access the dataset programmatically
This is optional, you can use any language or logic to interact with S3 buckets.
For instance you can install boto3
and it will be configured automatically when you request the S3 connection.
You can also use Polyaxon’s fs library to get a fully resolved s3fs instance:
pip install polyaxon[s3]
Creating a sync instance of the s3fs client:
from polyaxon.fs import get_fs_from_name
...
fs = get_fs_from_name("s3-dataset1") # You can pass additional kwargs to the function
...
Creating an async instance of the s3fs client:
from polyaxon.fs import get_fs_from_name
...
fs = get_fs_from_name("s3-dataset1",
asynchronous=True) # You can pass additional kwargs to the function
...