This is part of our commercial offering.
V1Spark
polyaxon.polyflow.run.spark.spark.V1Spark(kind='spark', connections=None, volumes=None, type=None, spark_version=None, python_version=None, deploy_mode=None, main_class=None, main_application_file=None, arguments=None, hadoop_conf=None, spark_conf=None, spark_config_map=None, hadoop_config_map=None, executor=None, driver=None)
Spark jobs are used to run Spark applications on Kubernetes.
Apache Spark is data-processing engine.
- Args:
- kind: str, should be equal
spark
- connections: List[str], optional
- volumes: List[Kubernetes Volume], optional
- type: str [
JAVA
,SCALA
,PYTHON
,R
] - spark_version: str, optional
- python_version: str, optional
- deploy_mode: str, optional
- main_class: str, optional
- main_application_file: str, optional
- arguments: List[str], optional
- hadoop_conf: Dict[str, str], optional
- spark_conf: Dict[str, str], optional
- hadoop_config_map: str, optional
- spark_config_map: str, optional
- executor: V1SparkReplica
- driver: V1SparkReplica
- kind: str, should be equal
YAML usage
run:
kind: spark
connections:
volumes:
type:
sparkVersion:
deployMode:
mainClass:
mainApplicationFile:
arguments:
hadoopConf:
sparkConf:
hadoopConfigMap:
sparkConfigMap:
executor:
driver:
Python usage
from polyaxon.polyflow import V1Environment, V1Init, V1Spark, V1SparkReplica, V1SparkType
from polyaxon.k8s import k8s_schemas
spark_job = V1Spark(
connections=["connection-name1"],
volumes=[k8s_schemas.V1Volume(...)],
type=V1SparkType.PYTHON,
spark_version="3.0.0",
spark_conf={...},
driver=V1SparkReplica(...),
executor=V1SparkReplica(...),
)
Fields
kind
The kind signals to the CLI, client, and other tools that this component’s runtime is a job.
If you are using the python client to create the runtime, this field is not required and is set by default.
run:
kind: spark
connections
A list of connection names to resolve for the job.
If you are referencing a connection it must be configured. All referenced connections will be checked:
If they are accessible in the context of the project of this run
If the user running the operation can have access to those connections
After checks, the connections will be resolved and inject any volumes, secrets, configMaps, environment variables for your main container to function correctly.
run:
kind: spark
connections: [connection1, connection2]
volumes
A list of Kubernetes Volumes to resolve and mount for your jobs.
This is an advanced use-case where configuring a connection is not an option.
When you add a volume you need to mount it manually to your container(s).
run:
kind: spark
volumes:
- name: volume1
persistentVolumeClaim:
claimName: pvc1
...
type
Tells the type of the Spark application, possible values: Java
, Scala
, Python
, R
run:
kind: spark
type: Python
...
sparkVersion
The version of Spark the application uses.
run:
kind: spark
sparkVersion: 3.0.0
...
deployMode
The deployment mode of the Spark application.
run:
kind: spark
deployMode: cluster
...
mainClass
The fully-qualified main class of the Spark application. This only applies to Java/Scala Spark applications.
run:
kind: spark
mainClass: ...
...
mainApplicationFile
The path to a bundled JAR, Python, or R file of the application.
run:
kind: spark
mainApplicationFile: ...
...
arguments
List of arguments to be passed to the application.
run:
kind: spark
arguments: [...]
...
hadoopConf
HadoopConf carries user-specified Hadoop configuration properties as they would use the “—conf” option in spark-submit. The SparkApplication controller automatically adds prefix “spark.hadoop.” to Hadoop configuration properties.
run:
kind: spark
hadoopConf: {...}
...
sparkConf
Carries user-specified Spark configuration properties as they would use the “—conf” option in spark-submit.
run:
kind: spark
sparkConf: {...}
...
hadoopConfigMap
Carries the name of the ConfigMap containing Spark configuration files such as log4j.properties. The controller will add environment variable SPARK_CONF_DIR to the path where the ConfigMap is mounted to.
run:
kind: spark
hadoopConfigMap: {...}
...
sparkConfigMap
Carries the name of the ConfigMap containing Spark configuration files such as log4j.properties. The controller will add environment variable SPARK_CONF_DIR to the path where the ConfigMap is mounted to.
run:
kind: spark
sparkConfigMap: {...}
...
executor
Executor is a spark replica specification
run:
kind: spark
executor:
replicas: 1
...
...
driver
Driver a spark replica specification
run:
kind: spark
driver:
replicas: 1
...
...