Spark : v4.x - Introduction

You’ll find in this article some information about Spark v4.x and how to create a local cluster with Docker.

What’s Spark

Apache Spark is a fast, general‑purpose cluster‑computing system that unifies batch, streaming, machine‑learning, and graph workloads under a single execution engine. It was created to address the limitations of Hadoop MapReduce by leveraging in‑memory data structures and an advanced optimizer.

Core features :

  • RDD (Resilient Distributed Dataset) : Immutable, fault‑tolerant distributed collections that expose transformations (map, filter, reduce) and actions
  • DataFrames / Datasets : Schema‑aware, relational API that blends SQL and programmatic access.
  • Spark SQL : SQL engine on top of Catalyst and Tungsten, supporting ANSI SQL, UDFs, and built‑in functions.
  • Catalyst Optimizer : A rule‑based planner that rewrites logical plans into an efficient physical plan.
  • Tungsten Execution Engine : Binary, off‑heap storage and code generation for CPU‑efficient processing.
  • Structured Streaming : Continuous query model that treats streaming data as an unbounded DataFrame.
  • MLlib : Scalable machine‑learning library built on top of Spark Core and Spark SQL.
  • GraphX : Graph processing API that operates on the same underlying cluster as DataFrames.

What’s new with Spark v4.x

Apache Spark 4.0 was officially released in May 2025 and Spark 4.1.1 was released in January 2026. Apache Spark 4.0 represents the first major version release since Spark 3.0 in June 2020.

This release is a real shift toward making Spark a “SQL-first” engine that rivals traditional cloud data warehouses in usability while improving capabilities in distributed computing.

New key features :

Some examples of Spark 4.x support among managed service providers (February 2026) :

  • Databricks : Spark v4.0 is supported starting from Databricks Runtime 17.2 (September 2025) and Spark 4.1 is supported starting from Databricks Runtime 18.0 (January 2026)
  • Snowflake : There does not appear to be a Snowflake Connector supporting Spark 4.x yet. Spark 3.5 is the most recent supported version.
  • AWS EMR : EMR version 7.12.0 (latest version available) supports Spark 3.5.6.
  • AWS Glue : Glue version 5.1 (latest version available) supports Spark 3.5.6.
  • Azure Fabric : Runtime 1.3 (latest GA version in February 2026) supports Spark 3.5. Runtime 2.0 under development will support Spark 4.0 (no release date, experimental status)

Additional Information

  • Support :
    • Support for Java 17 and Java 21
    • Support for Python 3.10+
    • Support for Scala 2.13
  • Enhanced Adaptative Query Execution (AQE)
    • Data Skew Optimization : It can dynamically split skewed partitions into smaller parts to better distribute the workload across cluster cores.
    • Post-shuffle Partition Coalescing : AQE can now merge partitions after a shuffle operation if they are too small, which reduces the number of tasks and scheduling overhead. This optimizes the execution plan without requiring manual configuration.
    • Join Conversion : AQE can convert sort-merge joins into broadcast-hash joins if one of the tables is small enough, which is much more efficient. In version 4.x, this detection and conversion are more robust and granular.
  • Structured Streaming :
    • State Store Data Source : Access internal streaming state as a Table/DataFrame (for debugging, monitoring, and auditing).
    • TransformWithState - The new arbitrary stateful operator :
      • The transformWithState API : Designed to be flexible and extensible.
      • Multiple State Variables : Instead of a single monolithic object, it allows declaring multiple types of state variables for the same key.
      • Schema Evolution : Allows modifying the structure of your state data (adding or removing fields) without breaking compatibility with existing checkpoints.
      • Improved Timer Management : Management of time (Event Time and Processing Time) has been simplified and made more robust.
  • Observability :
    • Structured Logs : Spark 4.0 is transitioning from an unstructured text log format to a structured JSON format.

How to create a local cluster

To test this new version of Spark, we will create a Docker image and the necessary elements to set up a cluster composed of one Driver and two Workers. The directories used for this project are:

  • application : Directory to store all PySpark scripts
  • data : Directory to store datasets (input and output)
  • logs : Directory to store execution logs from the Spark cluster
  • spark-image-docker : Directory to store the files needed to create the Docker image

Create a docker image

Steps from the spark-image-docker directory:

  1. Create the spark-defaults.conf file to define the configuration elements for the Spark cluster
  2. Create the spark-env.sh file to define the environment elements for the Spark cluster
  3. Create the spark-start.sh file to define the execution script for Spark cluster applications
  4. Create the Dockerfile to define the content of the Spark image for creating the Spark cluster
  5. Execute the Docker image build command : docker build -t spark4 spark-image-docker --no-cache

Content of the spark-defaults.conf file :

1# --- History Server Configuration ---
2spark.eventLog.enabled              true
3spark.eventLog.dir                  file:///opt/spark/event_logs
4spark.history.fs.logDirectory       file:///opt/spark/event_logs
5
6# --- Database Configurations ---
7spark.sql.warehouse.dir             file:///opt/spark/data/warehouse

Content of the spark-env.sh file :

1#!/bin/env bash
2
3export SPARK_LOCAL_IP=`hostname -i`
4export SPARK_PUBLIC_DNS=`hostname -f`

Content of the spark-start.sh file :

 1#!/bin/bash
 2
 3# Start the SSH daemon
 4/usr/sbin/sshd
 5if [ $? -ne 0 ]; then
 6    echo "Failed to start SSH server. Exiting."
 7    exit 1
 8fi
 9
10if [ "$SPARK_MODE" = "master" ]; then
11    echo "Starting Spark Master..."
12    # Spark Master/Driver
13    $SPARK_HOME/sbin/start-master.sh
14    # Spark Connect 
15    $SPARK_HOME/sbin/start-connect-server.sh
16    # History server
17    $SPARK_HOME/sbin/start-history-server.sh 
18elif [ "$SPARK_MODE" = "worker" ]; then
19    echo "Starting Spark Worker..."
20    # Spark Worker
21    $SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER_URL
22else
23    echo "Unknown SPARK_MODE: $SPARK_MODE"
24    exit 1
25fi
26
27# Keep the container alive
28tail -f $SPARK_HOME/logs/*

Content of the Dockerfile file :

 1# Use OpenJDK base image
 2FROM eclipse-temurin:21-jdk-jammy
 3
 4# Define env variables
 5ENV SPARK_MASTER="spark://spark-master:7077"
 6ENV SPARK_MASTER_HOST=spark-master
 7ENV SPARK_MASTER_PORT=7077
 8ENV PYSPARK_PYTHON=python3
 9ENV SPARK_HOME=/opt/spark
10ENV PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
11# Define Spark Version
12ENV SPARK_VERSION="4.1.1"
13
14
15# Install tools
16RUN apt-get update \
17    && apt-get install -y --no-install-recommends wget tar iputils-ping rsync openssh-server openssh-client \
18    && apt-get install -y --no-install-recommends python3 python3-pip \
19    && rm -rf /var/lib/apt/lists/*
20
21# Manage SSH informations
22RUN mkdir -p /root/.ssh/ \
23    && ssh-keygen -t rsa -f /root/.ssh/id_rsa -q -N "" \
24    && cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys \
25    && chmod 600 /root/.ssh/authorized_keys \
26    && echo "Host *" >> /root/.ssh/config \
27    && echo "    StrictHostKeyChecking no" >> /root/.ssh/config \
28    && chmod 600 /root/.ssh/config \
29    && mkdir -p /var/run/sshd \
30    && ssh-keygen -A
31
32# Download and install Spark
33RUN wget https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz
34RUN tar -xzf spark-${SPARK_VERSION}-bin-hadoop3.tgz \
35    && rm spark-${SPARK_VERSION}-bin-hadoop3.tgz \
36    && mv spark-${SPARK_VERSION}-bin-hadoop3 ${SPARK_HOME} \ 
37    && chown -R root:root ${SPARK_HOME} \
38    && mkdir -p ${SPARK_HOME}/logs \
39    && mkdir -p ${SPARK_HOME}/event_logs
40
41
42# Load and set up Spark configuration for logging and history server
43COPY spark-defaults.conf $SPARK_HOME/conf/spark-defaults.conf
44
45# Load and set up Spark configuration scripts (env and start)
46COPY spark-env.sh $SPARK_HOME/conf/spark-env.sh
47COPY spark-start.sh $SPARK_HOME/spark-start.sh
48RUN chmod +x $SPARK_HOME/conf/spark-env.sh
49RUN chmod +x $SPARK_HOME/spark-start.sh
50
51# Expose needed ports
52EXPOSE 7077 8080 4040 15002 22
53
54# Entrypoint config
55CMD ["/opt/spark/spark-start.sh"]

Content of the log from the build command :

 1[+] Building 36.2s (15/15) FINISHED                                                                                                                                                                     => [internal] load build definition from Dockerfile
 2 => => transferring dockerfile: 3.01kB
 3 => [internal] load metadata for docker.io/library/eclipse-temurin:21-jdk-jammy
 4 => [internal] load .dockerignore
 5 => => transferring context: 2B
 6 => [ 1/10] FROM docker.io/library/eclipse-temurin:21-jdk-jammy@sha256:9119073a0b783fd380fbf4b131a40955525c8e2a66083681c87fb15a39ae01d0
 7 => => resolve docker.io/library/eclipse-temurin:21-jdk-jammy@sha256:9119073a0b783fd380fbf4b131a40955525c8e2a66083681c87fb15a39ae01d0
 8 => => sha256:9119073a0b783fd380fbf4b131a40955525c8e2a66083681c87fb15a39ae01d0 5.25kB / 5.25kB
 9 => => sha256:86cbb2dbf3b68d3c30280722d5597afc845af596fe7f6398db56e5c9f9e0bc4e 1.94kB / 1.94kB
10 => => sha256:b3aadbf953f8337a9a81487d7a8a93b72853d9e50781b984c57c7552c8778c7a 5.94kB / 5.94kB
11 => => sha256:b1cba2e842ca52b95817f958faf99734080c78e92e43ce609cde9244867b49ed 29.54MB / 29.54MB
12 => => sha256:1dde4b555a697f85138e99e7759480373b71f47cbad9f7c0fa6cba34f2f5fe1e 20.69MB / 20.69MB
13 => => sha256:878e917e8d81357d9636aab51478d4fe889d01e89988159f87e55dbc3bba337b 157.87MB / 157.87MB
14 => => sha256:2049ec1cef96e43c3946f1be57d5efb77052e16d9e7f1fa3d7c8e4030155eac7 158B / 158B
15 => => sha256:9760149be10a5530ec0649fba4393ef8c2a058a9a97978c7cf43287b890dfcc0 2.28kB / 2.28kB
16 => => extracting sha256:b1cba2e842ca52b95817f958faf99734080c78e92e43ce609cde9244867b49ed
17 => => extracting sha256:1dde4b555a697f85138e99e7759480373b71f47cbad9f7c0fa6cba34f2f5fe1e
18 => => extracting sha256:878e917e8d81357d9636aab51478d4fe889d01e89988159f87e55dbc3bba337b
19 => => extracting sha256:2049ec1cef96e43c3946f1be57d5efb77052e16d9e7f1fa3d7c8e4030155eac7
20 => => extracting sha256:9760149be10a5530ec0649fba4393ef8c2a058a9a97978c7cf43287b890dfcc0
21 => [internal] load build context
22 => => transferring context: 2.11kB
23 => [ 2/10] RUN apt-get update     && apt-get install -y --no-install-recommends wget tar iputils-ping rsync openssh-server openssh-client     && apt-get install -y --no-install-recommends python3 python3-pip     && rm -rf /var/lib/apt/lists/*
24 => [ 3/10] RUN mkdir -p /root/.ssh/     && ssh-keygen -t rsa -f /root/.ssh/id_rsa -q -N ""     && cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys     && chmod 600 /root/.ssh/authorized_keys     && echo "Host *" >> /root/.ssh/config     && echo "    StrictHostKeyChecking no" >> /root/.ssh/config     && chmod 600 /root/.ssh/config     &&
25 => [ 4/10] RUN wget https://archive.apache.org/dist/spark/spark-4.1.1/spark-4.1.1-bin-hadoop3.tgz
26 => [ 5/10] RUN tar -xzf spark-4.1.1-bin-hadoop3.tgz     && rm spark-4.1.1-bin-hadoop3.tgz     && mv spark-4.1.1-bin-hadoop3 /opt/spark     && chown -R root:root /opt/spark     && mkdir -p /opt/spark/logs     && mkdir -p /opt/spark/event_logs
27 => [ 6/10] COPY spark-defaults.conf /opt/spark/conf/spark-defaults.conf
28 => [ 7/10] COPY spark-env.sh /opt/spark/conf/spark-env.sh
29 => [ 8/10] COPY spark-start.sh /opt/spark/spark-start.sh
30 => [ 9/10] RUN chmod +x /opt/spark/conf/spark-env.sh
31 => [10/10] RUN chmod +x /opt/spark/spark-start.sh
32 => exporting to image
33 => => exporting layers
34 => => writing image sha256:ecbea515a6ce1fce621a2b7c8962957862f7069fa2c4f1fc30352fdb8abcb72c
35 => => naming to docker.io/library/spark4

Create a docker compose file

Steps to perform from the root directory :

  1. Create the compose.yml file to define the various elements of the cluster, which will consist of one Master node and two Worker nodes
  2. Start the local cluster with the command docker compose up -d
  3. Stop the local cluster with the command docker compose down

Content of the compose.yml :

  1services:
  2  spark-master:
  3    image: spark4
  4    container_name: spark-master
  5    hostname: spark-master
  6    environment:
  7      - SPARK_MODE=master
  8      - SPARK_RPC_AUTHENTICATION_ENABLED=false
  9      - SPARK_RPC_ENCRYPTION_ENABLED=false
 10      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=false
 11      - SPARK_SSL_ENABLED=false
 12      - SPARK_PUBLIC_DNS=spark-master
 13      - SPARK_MASTER_HOST=spark-master
 14      - SPARK_MASTER_PORT=7077
 15      - SPARK_DRIVER_MEMORY=1g
 16      - SPARK_DRIVER_CORES=1
 17      - SPARK_EXECUTOR_MEMORY=1g
 18      - SPARK_MASTER_WEBUI_PORT=8080
 19    ports:
 20      - "4040:4040"   # Application UI (Job Details)
 21      - "8080:8080"   # Interface web du master
 22      - "7077:7077"   # Port de communication Spark
 23      - "15002:15002" # Port Spark Connect
 24      - "18080:18080" # Interface History Server
 25    deploy:
 26      resources:
 27        limits:
 28          cpus: '1'
 29          memory: 1G
 30    volumes:
 31      - ./data:/opt/spark/data
 32      - ./logs/events:/opt/spark/event_logs
 33      - ./application/src:/home/root/src
 34    networks:
 35      - spark-network
 36
 37
 38  spark-worker-1:
 39    image: spark4
 40    container_name: spark-worker-1
 41    hostname: spark-worker-1
 42    depends_on:
 43      - spark-master
 44    environment:
 45      - SPARK_MODE=worker
 46      - SPARK_MASTER_URL=spark://spark-master:7077
 47      - SPARK_RPC_AUTHENTICATE=false
 48      - SPARK_RPC_ENCRYPTION=false
 49      - SPARK_LOCAL_STORAGE_ENCRYPTION=false
 50      - SPARK_SSL_ENABLED=no
 51      - SPARK_PUBLIC_DNS=spark-worker-1
 52      - SPARK_MASTER_HOST=spark-master
 53      - SPARK_MASTER_PORT=7077
 54      - SPARK_WORKER_CORES=2
 55      - SPARK_WORKER_MEMORY=2g
 56      - SPARK_EXECUTOR_MEMORY=1g
 57      - SPARK_WORKER_WEBUI_PORT=8081
 58    ports:
 59      - "8081:8081"   # Interface web du worker
 60    volumes:
 61      - ./data:/opt/spark/data
 62    networks:
 63      - spark-network
 64    deploy:
 65      resources:
 66        limits:
 67          cpus: '2'
 68          memory: 2G
 69
 70
 71  spark-worker-2:
 72    image: spark4
 73    container_name: spark-worker-2
 74    hostname: spark-worker-2
 75    depends_on:
 76      - spark-master
 77    environment:
 78      - SPARK_MODE=worker
 79      - SPARK_MASTER_URL=spark://spark-master:7077
 80      - SPARK_RPC_AUTHENTICATE=false
 81      - SPARK_RPC_ENCRYPTION=false
 82      - SPARK_LOCAL_STORAGE_ENCRYPTION=false
 83      - SPARK_SSL_ENABLED=no
 84      - SPARK_PUBLIC_DNS=spark-worker-2
 85      - SPARK_MASTER_HOST=spark-master
 86      - SPARK_MASTER_PORT=7077
 87      - SPARK_WORKER_CORES=2
 88      - SPARK_WORKER_MEMORY=2g
 89      - SPARK_EXECUTOR_MEMORY=1g
 90      - SPARK_WORKER_WEBUI_PORT=8081
 91    ports:
 92      - "8082:8081"   # Interface web du worker
 93    volumes:
 94      - ./data:/opt/spark/data
 95    networks:
 96      - spark-network
 97    deploy:
 98      resources:
 99        limits:
100          cpus: '2'
101          memory: 2G
102
103
104networks:
105  spark-network:
106    driver: bridge

Content of the log from the command docker-compose up -d :

1[+] up 4/4
2 ✔ Network spark4_spark-network Created
3 ✔ Container spark-master       Created
4 ✔ Container spark-worker-1     Created
5 ✔ Container spark-worker-2     Created

Content of the log from the stop command docker-compose down :

1[+] down 4/4
2 ✔ Container spark-worker-2     Removed
3 ✔ Container spark-worker-1     Removed
4 ✔ Container spark-master       Removed
5 ✔ Network spark4_spark-network Removed

How to execute a PySpark application

List of ports and interfaces

Based on the configuration in the compose.yml file:

Script execution on the cluster

To execute a PySpark script directly on the local cluster:

  1. Create a Python script named spark-submit-app.py in the application/src directory
  2. Execute the following Docker command to run the created Python script : docker exec -it spark-master spark-submit --master spark://spark-master:7077 --conf spark.driver.host=spark-master --name "TestApp" /home/root/src/spark-submit-app.py
  3. After successful execution of the script, you will find a new file named test.parquet in the data/files directory, corresponding to the file created in the example

Note: the user’s application/src directory is mapped by default, in the configuration defined in the compose.yml file, to the /home/root/src directory

Content of the spark-submit-app.py Python script:

 1from pyspark.sql import SparkSession
 2
 3REP_DATA_FILES = "file:///opt/spark/data/files"
 4
 5spark = SparkSession.builder \
 6    .appName("TestSubmitApp") \
 7    .master("spark://spark-master:7077") \
 8    .getOrCreate()
 9
10data = [(1, "Alice"), (2, "Bob"), (3, "Charlie")]
11df = spark.createDataFrame(data, ["id", "name"])
12df.show()
13df.write.mode("overwrite").format("parquet").save(f"{REP_DATA_FILES}/test.parquet")
14
15spark.stop()

Content of the log from the execution command :

  1WARNING: Using incubator modules: jdk.incubator.vector
  2Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
  326/02/27 16:31:13 INFO SparkContext: Running Spark version 4.1.1
  426/02/27 16:31:13 INFO SparkContext: OS info Linux, 6.6.87.2-microsoft-standard-WSL2, amd64
  526/02/27 16:31:13 INFO SparkContext: Java version 21.0.10+7-LTS
  626/02/27 16:31:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  726/02/27 16:31:13 INFO ResourceUtils: ==============================================================
  826/02/27 16:31:13 INFO ResourceUtils: No custom resources configured for spark.driver.
  926/02/27 16:31:13 INFO ResourceUtils: ==============================================================
 1026/02/27 16:31:13 INFO SparkContext: Submitted application: TestSubmitApp
 1126/02/27 16:31:13 INFO SecurityManager: Changing view acls to: root
 1226/02/27 16:31:13 INFO SecurityManager: Changing modify acls to: root
 1326/02/27 16:31:13 INFO SecurityManager: Changing view acls groups to: root
 1426/02/27 16:31:13 INFO SecurityManager: Changing modify acls groups to: root
 1526/02/27 16:31:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: root groups with view permissions: EMPTY; users with modify permissions: root; groups with modify permissions: EMPTY; RPC SSL disabled
 1626/02/27 16:31:13 INFO Utils: Successfully started service 'sparkDriver' on port 34403.
 1726/02/27 16:31:13 INFO SparkEnv: Registering MapOutputTracker
 1826/02/27 16:31:13 INFO SparkEnv: Registering BlockManagerMaster
 1926/02/27 16:31:13 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
 2026/02/27 16:31:13 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
 2126/02/27 16:31:13 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
 2226/02/27 16:31:13 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-f5f0ee24-17ba-442b-9614-1ee27607cec2
 2326/02/27 16:31:13 INFO SparkEnv: Registering OutputCommitCoordinator
 2426/02/27 16:31:14 INFO JettyUtils: Start Jetty 172.19.0.2:4040 for SparkUI
 2526/02/27 16:31:14 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
 2626/02/27 16:31:14 INFO Utils: Successfully started service 'SparkUI' on port 4041.
 2726/02/27 16:31:14 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
 2826/02/27 16:31:14 INFO ResourceProfile: Limiting resource is cpu
 2926/02/27 16:31:14 INFO ResourceProfileManager: Added ResourceProfile id: 0
 3026/02/27 16:31:14 INFO SecurityManager: Changing view acls to: root
 3126/02/27 16:31:14 INFO SecurityManager: Changing modify acls to: root
 3226/02/27 16:31:14 INFO SecurityManager: Changing view acls groups to: root
 3326/02/27 16:31:14 INFO SecurityManager: Changing modify acls groups to: root
 3426/02/27 16:31:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: root groups with view permissions: EMPTY; users with modify permissions: root; groups with modify permissions: EMPTY; RPC SSL disabled
 3526/02/27 16:31:14 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://spark-master:7077...
 3626/02/27 16:31:14 INFO TransportClientFactory: Successfully created connection to spark-master/172.19.0.2:7077 after 17 ms (0 ms spent in bootstraps)
 3726/02/27 16:31:14 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20260227163114-0001
 3826/02/27 16:31:14 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41333.
 3926/02/27 16:31:14 INFO NettyBlockTransferService: Server created on spark-master:41333
 4026/02/27 16:31:14 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20260227163114-0001/0 on worker-20260227161953-172.19.0.4-40325 (172.19.0.4:40325) with 2 core(s)
 4126/02/27 16:31:14 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
 4226/02/27 16:31:14 INFO StandaloneSchedulerBackend: Granted executor ID app-20260227163114-0001/0 on hostPort 172.19.0.4:40325 with 2 core(s), 1024.0 MiB RAM
 4326/02/27 16:31:14 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20260227163114-0001/1 on worker-20260227161953-172.19.0.3-38037 (172.19.0.3:38037) with 2 core(s)
 4426/02/27 16:31:14 INFO StandaloneSchedulerBackend: Granted executor ID app-20260227163114-0001/1 on hostPort 172.19.0.3:38037 with 2 core(s), 1024.0 MiB RAM
 4526/02/27 16:31:14 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20260227163114-0001/0 is now RUNNING
 4626/02/27 16:31:14 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20260227163114-0001/1 is now RUNNING
 4726/02/27 16:31:14 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, spark-master, 41333, None)
 4826/02/27 16:31:14 INFO BlockManagerMasterEndpoint: Registering block manager spark-master:41333 with 413.9 MiB RAM, BlockManagerId(driver, spark-master, 41333, None)
 4926/02/27 16:31:14 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, spark-master, 41333, None)
 5026/02/27 16:31:14 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, spark-master, 41333, None)
 5126/02/27 16:31:15 INFO RollingEventLogFilesWriter: Logging events to file:/opt/spark/event_logs/eventlog_v2_app-20260227163114-0001/events_1_app-20260227163114-0001.zstd
 5226/02/27 16:31:15 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
 5326/02/27 16:31:16 INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.19.0.3:39714) with ID 1, ResourceProfileId 0
 5426/02/27 16:31:16 INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.19.0.4:55040) with ID 0, ResourceProfileId 0
 5526/02/27 16:31:16 INFO BlockManagerMasterEndpoint: Registering block manager 172.19.0.4:39007 with 434.4 MiB RAM, BlockManagerId(0, 172.19.0.4, 39007, None)
 5626/02/27 16:31:16 INFO BlockManagerMasterEndpoint: Registering block manager 172.19.0.3:40421 with 434.4 MiB RAM, BlockManagerId(1, 172.19.0.3, 40421, None)
 5726/02/27 16:31:18 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
 5826/02/27 16:31:18 INFO SharedState: Warehouse path is 'file:/opt/spark/data/warehouse'.
 5926/02/27 16:31:19 INFO CodeGenerator: Code generated in 314.329914 ms
 6026/02/27 16:31:19 INFO SparkContext: Starting job: showString at NativeMethodAccessorImpl.java:0
 6126/02/27 16:31:19 INFO DAGScheduler: Got job 0 (showString at NativeMethodAccessorImpl.java:0) with 1 output partitions
 6226/02/27 16:31:19 INFO DAGScheduler: Final stage: ResultStage 0 (showString at NativeMethodAccessorImpl.java:0)
 6326/02/27 16:31:19 INFO DAGScheduler: Parents of final stage: List()
 6426/02/27 16:31:19 INFO DAGScheduler: Missing parents: List()
 6526/02/27 16:31:19 INFO DAGScheduler: Missing parents found for ResultStage 0: List()
 6626/02/27 16:31:19 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[6] at showString at NativeMethodAccessorImpl.java:0), which has no missing parents
 6726/02/27 16:31:20 INFO MemoryStore: MemoryStore started with capacity 413.9 MiB
 6826/02/27 16:31:20 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 13.7 KiB, free 413.9 MiB)
 6926/02/27 16:31:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.1 KiB, free 413.9 MiB)
 7026/02/27 16:31:20 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1686
 7126/02/27 16:31:20 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[6] at showString at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
 7226/02/27 16:31:20 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0
 7326/02/27 16:31:20 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (172.19.0.4,executor 0, partition 0, PROCESS_LOCAL, 9679 bytes) 
 7426/02/27 16:31:21 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1178 ms on 172.19.0.4 (executor 0) (1/1)
 7526/02/27 16:31:21 INFO TaskSchedulerImpl: Removed TaskSet 0.0 whose tasks have all completed, from pool 
 7626/02/27 16:31:21 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 49893
 7726/02/27 16:31:21 INFO DAGScheduler: ResultStage 0 (showString at NativeMethodAccessorImpl.java:0) finished in 1397 ms
 7826/02/27 16:31:21 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
 7926/02/27 16:31:21 INFO TaskSchedulerImpl: Canceling stage 0
 8026/02/27 16:31:21 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
 8126/02/27 16:31:21 INFO DAGScheduler: Job 0 finished: showString at NativeMethodAccessorImpl.java:0, took 1480.625651 ms
 8226/02/27 16:31:21 INFO SparkContext: Starting job: showString at NativeMethodAccessorImpl.java:0
 8326/02/27 16:31:21 INFO DAGScheduler: Got job 1 (showString at NativeMethodAccessorImpl.java:0) with 3 output partitions
 8426/02/27 16:31:21 INFO DAGScheduler: Final stage: ResultStage 1 (showString at NativeMethodAccessorImpl.java:0)
 8526/02/27 16:31:21 INFO DAGScheduler: Parents of final stage: List()
 8626/02/27 16:31:21 INFO DAGScheduler: Missing parents: List()
 8726/02/27 16:31:21 INFO DAGScheduler: Missing parents found for ResultStage 1: List()
 8826/02/27 16:31:21 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[6] at showString at NativeMethodAccessorImpl.java:0), which has no missing parents
 8926/02/27 16:31:21 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 13.7 KiB, free 413.9 MiB)
 9026/02/27 16:31:21 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 7.1 KiB, free 413.9 MiB)
 9126/02/27 16:31:21 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1686
 9226/02/27 16:31:21 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (MapPartitionsRDD[6] at showString at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(1, 2, 3))
 9326/02/27 16:31:21 INFO TaskSchedulerImpl: Adding task set 1.0 with 3 tasks resource profile 0
 9426/02/27 16:31:21 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1) (172.19.0.4,executor 0, partition 1, PROCESS_LOCAL, 9716 bytes) 
 9526/02/27 16:31:21 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2) (172.19.0.3,executor 1, partition 2, PROCESS_LOCAL, 9714 bytes) 
 9626/02/27 16:31:21 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3) (172.19.0.4,executor 0, partition 3, PROCESS_LOCAL, 9718 bytes) 
 9726/02/27 16:31:21 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 179 ms on 172.19.0.4 (executor 0) (1/3)
 9826/02/27 16:31:21 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 3) in 211 ms on 172.19.0.4 (executor 0) (2/3)
 9926/02/27 16:31:22 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 1160 ms on 172.19.0.3 (executor 1) (3/3)
10026/02/27 16:31:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0 whose tasks have all completed, from pool 
10126/02/27 16:31:22 INFO DAGScheduler: ResultStage 1 (showString at NativeMethodAccessorImpl.java:0) finished in 1168 ms
10226/02/27 16:31:22 INFO DAGScheduler: Job 1 is finished. Cancelling potential speculative or zombie tasks for this job
10326/02/27 16:31:22 INFO TaskSchedulerImpl: Canceling stage 1
10426/02/27 16:31:22 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished
10526/02/27 16:31:22 INFO DAGScheduler: Job 1 finished: showString at NativeMethodAccessorImpl.java:0, took 1170.986858 ms
10626/02/27 16:31:22 INFO CodeGenerator: Code generated in 10.428925 ms
107+---+-------+
108| id|   name|
109+---+-------+
110|  1|  Alice|
111|  2|    Bob|
112|  3|Charlie|
113+---+-------+
114
11526/02/27 16:31:22 INFO ParquetUtils: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
11626/02/27 16:31:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
11726/02/27 16:31:22 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
11826/02/27 16:31:22 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
11926/02/27 16:31:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
12026/02/27 16:31:22 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
12126/02/27 16:31:22 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
12226/02/27 16:31:22 INFO CodeGenerator: Code generated in 8.009026 ms
12326/02/27 16:31:22 INFO SparkContext: Starting job: save at NativeMethodAccessorImpl.java:0
12426/02/27 16:31:22 INFO DAGScheduler: Got job 2 (save at NativeMethodAccessorImpl.java:0) with 4 output partitions
12526/02/27 16:31:22 INFO DAGScheduler: Final stage: ResultStage 2 (save at NativeMethodAccessorImpl.java:0)
12626/02/27 16:31:22 INFO DAGScheduler: Parents of final stage: List()
12726/02/27 16:31:22 INFO DAGScheduler: Missing parents: List()
12826/02/27 16:31:22 INFO DAGScheduler: Missing parents found for ResultStage 2: List()
12926/02/27 16:31:22 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[8] at save at NativeMethodAccessorImpl.java:0), which has no missing parents
13026/02/27 16:31:23 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 238.7 KiB, free 413.7 MiB)
13126/02/27 16:31:23 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 86.9 KiB, free 413.6 MiB)
13226/02/27 16:31:23 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1686
13326/02/27 16:31:23 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 2 (MapPartitionsRDD[8] at save at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0, 1, 2, 3))
13426/02/27 16:31:23 INFO TaskSchedulerImpl: Adding task set 2.0 with 4 tasks resource profile 0
13526/02/27 16:31:23 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4) (172.19.0.4,executor 0, partition 0, PROCESS_LOCAL, 9679 bytes) 
13626/02/27 16:31:23 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 5) (172.19.0.3,executor 1, partition 1, PROCESS_LOCAL, 9716 bytes) 
13726/02/27 16:31:23 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 6) (172.19.0.4,executor 0, partition 2, PROCESS_LOCAL, 9714 bytes) 
13826/02/27 16:31:23 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 7) (172.19.0.3,executor 1, partition 3, PROCESS_LOCAL, 9718 bytes) 
13926/02/27 16:31:23 INFO TaskSetManager: Finished task 2.0 in stage 2.0 (TID 6) in 778 ms on 172.19.0.4 (executor 0) (1/4)
14026/02/27 16:31:23 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 778 ms on 172.19.0.4 (executor 0) (2/4)
14126/02/27 16:31:23 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 5) in 789 ms on 172.19.0.3 (executor 1) (3/4)
14226/02/27 16:31:23 INFO TaskSetManager: Finished task 3.0 in stage 2.0 (TID 7) in 788 ms on 172.19.0.3 (executor 1) (4/4)
14326/02/27 16:31:23 INFO TaskSchedulerImpl: Removed TaskSet 2.0 whose tasks have all completed, from pool 
14426/02/27 16:31:23 INFO DAGScheduler: ResultStage 2 (save at NativeMethodAccessorImpl.java:0) finished in 865 ms
14526/02/27 16:31:23 INFO DAGScheduler: Job 2 is finished. Cancelling potential speculative or zombie tasks for this job
14626/02/27 16:31:23 INFO TaskSchedulerImpl: Canceling stage 2
14726/02/27 16:31:23 INFO TaskSchedulerImpl: Killing all running tasks in stage 2: Stage finished
14826/02/27 16:31:23 INFO DAGScheduler: Job 2 finished: save at NativeMethodAccessorImpl.java:0, took 868.464989 ms
14926/02/27 16:31:23 INFO FileFormatWriter: Start to commit write Job 1d2cef1f-35fd-49e3-9031-06b56f61da8c.
15026/02/27 16:31:24 INFO FileFormatWriter: Write Job 1d2cef1f-35fd-49e3-9031-06b56f61da8c committed. Elapsed time: 261 ms.
15126/02/27 16:31:24 INFO FileFormatWriter: Finished processing stats for write job 1d2cef1f-35fd-49e3-9031-06b56f61da8c.
15226/02/27 16:31:24 INFO SparkContext: SparkContext is stopping with exitCode 0 from stop at NativeMethodAccessorImpl.java:0.
15326/02/27 16:31:24 INFO SparkUI: Stopped Spark web UI at http://spark-master:4041
15426/02/27 16:31:24 INFO StandaloneSchedulerBackend: Shutting down all executors
15526/02/27 16:31:24 INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Asking each executor to shut down
15626/02/27 16:31:24 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15726/02/27 16:31:24 INFO MemoryStore: MemoryStore cleared
15826/02/27 16:31:24 INFO BlockManager: BlockManager stopped
15926/02/27 16:31:24 INFO BlockManagerMaster: BlockManagerMaster stopped
16026/02/27 16:31:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16126/02/27 16:31:24 INFO SparkContext: Successfully stopped SparkContext (Uptime: 10922 ms)

Script execution using Spark Connect

To execute a PySpark script using Spark Connect:

  1. Create a Python script named spark-connect-app.py in the application/src directory
  2. Execute the created Python script: python spark-connect-app.py
  3. After successful execution of the script, you will find a new file named test_connect.parquet in the data/files directory, corresponding to the file created in the example

Note: to get information about the execution of the Python script with Spark Connect, you need to go to the Connect tab in the Jobs application interface

Content of the spark-connect-app.py Python script :

 1
 2from pyspark.sql import SparkSession
 3from pyspark.sql.types import StructType, StructField, StringType
 4
 5REMOTE_URL = "sc://localhost:15002"
 6REP_DATA_FILES = "file:///opt/spark/data/files"
 7
 8print("Attempting to connect to Spark Connect server...")
 9
10try:
11    # Use the .remote() builder method to connect
12    spark = SparkSession.builder.remote(REMOTE_URL).getOrCreate()
13
14    print("Successfully connected to Spark!")
15    print(f"Spark version: {spark.version}")
16
17    data = [(1, "Alice"), (2, "Bob"), (3, "Charlie")]
18    df = spark.createDataFrame(data, ["id", "name"])
19    
20    df.show()
21    df.write.mode("overwrite").format("parquet").save(f"{REP_DATA_FILES}/test.parquet")
22
23except Exception as e:
24    print(f"❌ Failed to connect or run Spark job: {e}")
25
26finally:
27    # Stop the Spark session
28    if 'spark' in locals():
29        spark.stop()
30    print("\nSpark session stopped.")

Content of the log from the Python script execution :

 1Attempting to connect to Spark Connect server...
 2Successfully connected to Spark!
 3Spark version: 4.1.1
 4+---+-------+
 5| id|   name|
 6+---+-------+
 7|  1|  Alice|
 8|  2|    Bob|
 9|  3|Charlie|
10+---+-------+
11
12
13Spark session stopped.