This is part of our commercial offering.

V1Spark

polyaxon.polyflow.run.spark.spark.V1Spark(kind='spark', connections=None, volumes=None, type=None, spark_version=None, python_version=None, deploy_mode=None, main_class=None, main_application_file=None, arguments=None, hadoop_conf=None, spark_conf=None, spark_config_map=None, hadoop_config_map=None, executor=None, driver=None)

Spark jobs are used to run Spark applications on Kubernetes.

Apache Spark is data-processing engine.

  • Args:
    • kind: str, should be equal spark
    • connections: List[str], optional
    • volumes: List[Kubernetes Volume], optional
    • type: str [JAVA, SCALA, PYTHON, R]
    • spark_version: str, optional
    • python_version: str, optional
    • deploy_mode: str, optional
    • main_class: str, optional
    • main_application_file: str, optional
    • arguments: List[str], optional
    • hadoop_conf: Dict[str, str], optional
    • spark_conf: Dict[str, str], optional
    • hadoop_config_map: str, optional
    • spark_config_map: str, optional
    • executor: V1SparkReplica
    • driver: V1SparkReplica

YAML usage

run:
  kind: spark
  connections:
  volumes:
  type:
  sparkVersion:
  deployMode:
  mainClass:
  mainApplicationFile:
  arguments:
  hadoopConf:
  sparkConf:
  hadoopConfigMap:
  sparkConfigMap:
  executor:
  driver:

Python usage

from polyaxon.polyflow import V1Environment, V1Init, V1Spark, V1SparkReplica, V1SparkType
from polyaxon.k8s import k8s_schemas
spark_job = V1Spark(
    connections=["connection-name1"],
    volumes=[k8s_schemas.V1Volume(...)],
    type=V1SparkType.PYTHON,
    spark_version="3.0.0",
    spark_conf={...},
    driver=V1SparkReplica(...),
    executor=V1SparkReplica(...),
)

Fields

kind

The kind signals to the CLI, client, and other tools that this component’s runtime is a job.

If you are using the python client to create the runtime, this field is not required and is set by default.

run:
  kind: spark

connections

A list of connection names to resolve for the job.

If you are referencing a connection it must be configured. All referenced connections will be checked:
  • If they are accessible in the context of the project of this run

  • If the user running the operation can have access to those connections

After checks, the connections will be resolved and inject any volumes, secrets, configMaps, environment variables for your main container to function correctly.

run:
  kind: spark
  connections: [connection1, connection2]

volumes

A list of Kubernetes Volumes to resolve and mount for your jobs.

This is an advanced use-case where configuring a connection is not an option.

When you add a volume you need to mount it manually to your container(s).

run:
  kind: spark
  volumes:
    - name: volume1
      persistentVolumeClaim:
        claimName: pvc1
  ...

type

Tells the type of the Spark application, possible values: Java, Scala, Python, R

run:
  kind: spark
  type: Python
  ...

sparkVersion

The version of Spark the application uses.

run:
  kind: spark
  sparkVersion: 3.0.0
  ...

deployMode

The deployment mode of the Spark application.

run:
  kind: spark
  deployMode: cluster
  ...

mainClass

The fully-qualified main class of the Spark application. This only applies to Java/Scala Spark applications.

run:
  kind: spark
  mainClass: ...
  ...

mainApplicationFile

The path to a bundled JAR, Python, or R file of the application.

run:
  kind: spark
  mainApplicationFile: ...
  ...

arguments

List of arguments to be passed to the application.

run:
  kind: spark
  arguments: [...]
  ...

hadoopConf

HadoopConf carries user-specified Hadoop configuration properties as they would use the “—conf” option in spark-submit. The SparkApplication controller automatically adds prefix “spark.hadoop.” to Hadoop configuration properties.

run:
  kind: spark
  hadoopConf: {...}
  ...

sparkConf

Carries user-specified Spark configuration properties as they would use the “—conf” option in spark-submit.

run:
  kind: spark
  sparkConf: {...}
  ...

hadoopConfigMap

Carries the name of the ConfigMap containing Spark configuration files such as log4j.properties. The controller will add environment variable SPARK_CONF_DIR to the path where the ConfigMap is mounted to.

run:
  kind: spark
  hadoopConfigMap: {...}
  ...

sparkConfigMap

Carries the name of the ConfigMap containing Spark configuration files such as log4j.properties. The controller will add environment variable SPARK_CONF_DIR to the path where the ConfigMap is mounted to.

run:
  kind: spark
  sparkConfigMap: {...}
  ...

executor

Executor is a spark replica specification

run:
  kind: spark
  executor:
    replicas: 1
    ...
  ...

driver

Driver a spark replica specification

run:
  kind: spark
  driver:
    replicas: 1
    ...
  ...