Spark : v4.x - Introduction
You’ll find in this article some information about Spark v4.x and how to create a local cluster with Docker.
What’s Spark
Apache Spark is a fast, general‑purpose cluster‑computing system that unifies batch, streaming, machine‑learning, and graph workloads under a single execution engine. It was created to address the limitations of Hadoop MapReduce by leveraging in‑memory data structures and an advanced optimizer.
Core features :
- RDD (Resilient Distributed Dataset) : Immutable, fault‑tolerant distributed collections that expose transformations (map, filter, reduce) and actions
- DataFrames / Datasets : Schema‑aware, relational API that blends SQL and programmatic access.
- Spark SQL : SQL engine on top of Catalyst and Tungsten, supporting ANSI SQL, UDFs, and built‑in functions.
- Catalyst Optimizer : A rule‑based planner that rewrites logical plans into an efficient physical plan.
- Tungsten Execution Engine : Binary, off‑heap storage and code generation for CPU‑efficient processing.
- Structured Streaming : Continuous query model that treats streaming data as an unbounded DataFrame.
- MLlib : Scalable machine‑learning library built on top of Spark Core and Spark SQL.
- GraphX : Graph processing API that operates on the same underlying cluster as DataFrames.
What’s new with Spark v4.x
Apache Spark 4.0 was officially released in May 2025 and Spark 4.1.1 was released in January 2026. Apache Spark 4.0 represents the first major version release since Spark 3.0 in June 2020.
This release is a real shift toward making Spark a “SQL-first” engine that rivals traditional cloud data warehouses in usability while improving capabilities in distributed computing.
New key features :
- Spark Connect : Spark Connect decouples client applications from Spark cluster processes through a gRPC-based client-server architecture
- Collation Support : Spark 4.0 introduces explicit collation specification, enabling linguistic sorting (e.g., German ä sorting near a) and case-insensitive operations without custom UDFs.
- ANSI SQL Compliance : ANSI SQL compliance mode enforces stricter SQL semantics matching ISO/IEC 9075 standard
- Session SQL Variables : Session SQL variables enable dynamic parameterization of queries without string concatenation or external configuration files.
- Multi-Statement SQL Scripts : Multi-statement SQL scripts enable executing multiple SQL commands in a single script file with procedural control flow.
- Variant Data Type : The VARIANT data type stores semi-structured data (JSON, XML, nested objects) without predefined schema.
- Spark Declarative Pipeline (SDP) : Declarative framework for building reliable, maintainable, and testable data pipelines on Spark. (Simplifies ELT development)
- And many performance improvement for Structured Streaming, Python API, …
Some examples of Spark 4.x support among managed service providers (February 2026) :
- Databricks : Spark v4.0 is supported starting from Databricks Runtime 17.2 (September 2025) and Spark 4.1 is supported starting from Databricks Runtime 18.0 (January 2026)
- Snowflake : There does not appear to be a Snowflake Connector supporting Spark 4.x yet. Spark 3.5 is the most recent supported version.
- AWS EMR : EMR version 7.12.0 (latest version available) supports Spark 3.5.6.
- AWS Glue : Glue version 5.1 (latest version available) supports Spark 3.5.6.
- Azure Fabric : Runtime 1.3 (latest GA version in February 2026) supports Spark 3.5. Runtime 2.0 under development will support Spark 4.0 (no release date, experimental status)
Additional Information
- Support :
- Support for Java 17 and Java 21
- Support for Python 3.10+
- Support for Scala 2.13
- Enhanced Adaptative Query Execution (AQE)
- Data Skew Optimization : It can dynamically split skewed partitions into smaller parts to better distribute the workload across cluster cores.
- Post-shuffle Partition Coalescing : AQE can now merge partitions after a shuffle operation if they are too small, which reduces the number of tasks and scheduling overhead. This optimizes the execution plan without requiring manual configuration.
- Join Conversion : AQE can convert
sort-mergejoins intobroadcast-hashjoins if one of the tables is small enough, which is much more efficient. In version 4.x, this detection and conversion are more robust and granular.
- Structured Streaming :
- State Store Data Source : Access internal streaming state as a Table/DataFrame (for debugging, monitoring, and auditing).
- TransformWithState - The new arbitrary stateful operator :
- The transformWithState API : Designed to be flexible and extensible.
- Multiple State Variables : Instead of a single monolithic object, it allows declaring multiple types of state variables for the same key.
- Schema Evolution : Allows modifying the structure of your state data (adding or removing fields) without breaking compatibility with existing
checkpoints. - Improved Timer Management : Management of time (Event Time and Processing Time) has been simplified and made more robust.
- Observability :
- Structured Logs : Spark 4.0 is transitioning from an unstructured text log format to a structured JSON format.
How to create a local cluster
To test this new version of Spark, we will create a Docker image and the necessary elements to set up a cluster composed of one Driver and two Workers. The directories used for this project are:
application: Directory to store all PySpark scriptsdata: Directory to store datasets (input and output)logs: Directory to store execution logs from the Spark clusterspark-image-docker: Directory to store the files needed to create the Docker image
Create a docker image
Steps from the spark-image-docker directory:
- Create the
spark-defaults.conffile to define the configuration elements for the Spark cluster - Create the
spark-env.shfile to define the environment elements for the Spark cluster - Create the
spark-start.shfile to define the execution script for Spark cluster applications - Create the
Dockerfileto define the content of the Spark image for creating the Spark cluster - Execute the Docker image build command :
docker build -t spark4 spark-image-docker --no-cache
Content of the spark-defaults.conf file :
1# --- History Server Configuration ---
2spark.eventLog.enabled true
3spark.eventLog.dir file:///opt/spark/event_logs
4spark.history.fs.logDirectory file:///opt/spark/event_logs
5
6# --- Database Configurations ---
7spark.sql.warehouse.dir file:///opt/spark/data/warehouse
Content of the spark-env.sh file :
1#!/bin/env bash
2
3export SPARK_LOCAL_IP=`hostname -i`
4export SPARK_PUBLIC_DNS=`hostname -f`
Content of the spark-start.sh file :
1#!/bin/bash
2
3# Start the SSH daemon
4/usr/sbin/sshd
5if [ $? -ne 0 ]; then
6 echo "Failed to start SSH server. Exiting."
7 exit 1
8fi
9
10if [ "$SPARK_MODE" = "master" ]; then
11 echo "Starting Spark Master..."
12 # Spark Master/Driver
13 $SPARK_HOME/sbin/start-master.sh
14 # Spark Connect
15 $SPARK_HOME/sbin/start-connect-server.sh
16 # History server
17 $SPARK_HOME/sbin/start-history-server.sh
18elif [ "$SPARK_MODE" = "worker" ]; then
19 echo "Starting Spark Worker..."
20 # Spark Worker
21 $SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER_URL
22else
23 echo "Unknown SPARK_MODE: $SPARK_MODE"
24 exit 1
25fi
26
27# Keep the container alive
28tail -f $SPARK_HOME/logs/*
Content of the Dockerfile file :
1# Use OpenJDK base image
2FROM eclipse-temurin:21-jdk-jammy
3
4# Define env variables
5ENV SPARK_MASTER="spark://spark-master:7077"
6ENV SPARK_MASTER_HOST=spark-master
7ENV SPARK_MASTER_PORT=7077
8ENV PYSPARK_PYTHON=python3
9ENV SPARK_HOME=/opt/spark
10ENV PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
11# Define Spark Version
12ENV SPARK_VERSION="4.1.1"
13
14
15# Install tools
16RUN apt-get update \
17 && apt-get install -y --no-install-recommends wget tar iputils-ping rsync openssh-server openssh-client \
18 && apt-get install -y --no-install-recommends python3 python3-pip \
19 && rm -rf /var/lib/apt/lists/*
20
21# Manage SSH informations
22RUN mkdir -p /root/.ssh/ \
23 && ssh-keygen -t rsa -f /root/.ssh/id_rsa -q -N "" \
24 && cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys \
25 && chmod 600 /root/.ssh/authorized_keys \
26 && echo "Host *" >> /root/.ssh/config \
27 && echo " StrictHostKeyChecking no" >> /root/.ssh/config \
28 && chmod 600 /root/.ssh/config \
29 && mkdir -p /var/run/sshd \
30 && ssh-keygen -A
31
32# Download and install Spark
33RUN wget https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz
34RUN tar -xzf spark-${SPARK_VERSION}-bin-hadoop3.tgz \
35 && rm spark-${SPARK_VERSION}-bin-hadoop3.tgz \
36 && mv spark-${SPARK_VERSION}-bin-hadoop3 ${SPARK_HOME} \
37 && chown -R root:root ${SPARK_HOME} \
38 && mkdir -p ${SPARK_HOME}/logs \
39 && mkdir -p ${SPARK_HOME}/event_logs
40
41
42# Load and set up Spark configuration for logging and history server
43COPY spark-defaults.conf $SPARK_HOME/conf/spark-defaults.conf
44
45# Load and set up Spark configuration scripts (env and start)
46COPY spark-env.sh $SPARK_HOME/conf/spark-env.sh
47COPY spark-start.sh $SPARK_HOME/spark-start.sh
48RUN chmod +x $SPARK_HOME/conf/spark-env.sh
49RUN chmod +x $SPARK_HOME/spark-start.sh
50
51# Expose needed ports
52EXPOSE 7077 8080 4040 15002 22
53
54# Entrypoint config
55CMD ["/opt/spark/spark-start.sh"]
Content of the log from the build command :
1[+] Building 36.2s (15/15) FINISHED => [internal] load build definition from Dockerfile
2 => => transferring dockerfile: 3.01kB
3 => [internal] load metadata for docker.io/library/eclipse-temurin:21-jdk-jammy
4 => [internal] load .dockerignore
5 => => transferring context: 2B
6 => [ 1/10] FROM docker.io/library/eclipse-temurin:21-jdk-jammy@sha256:9119073a0b783fd380fbf4b131a40955525c8e2a66083681c87fb15a39ae01d0
7 => => resolve docker.io/library/eclipse-temurin:21-jdk-jammy@sha256:9119073a0b783fd380fbf4b131a40955525c8e2a66083681c87fb15a39ae01d0
8 => => sha256:9119073a0b783fd380fbf4b131a40955525c8e2a66083681c87fb15a39ae01d0 5.25kB / 5.25kB
9 => => sha256:86cbb2dbf3b68d3c30280722d5597afc845af596fe7f6398db56e5c9f9e0bc4e 1.94kB / 1.94kB
10 => => sha256:b3aadbf953f8337a9a81487d7a8a93b72853d9e50781b984c57c7552c8778c7a 5.94kB / 5.94kB
11 => => sha256:b1cba2e842ca52b95817f958faf99734080c78e92e43ce609cde9244867b49ed 29.54MB / 29.54MB
12 => => sha256:1dde4b555a697f85138e99e7759480373b71f47cbad9f7c0fa6cba34f2f5fe1e 20.69MB / 20.69MB
13 => => sha256:878e917e8d81357d9636aab51478d4fe889d01e89988159f87e55dbc3bba337b 157.87MB / 157.87MB
14 => => sha256:2049ec1cef96e43c3946f1be57d5efb77052e16d9e7f1fa3d7c8e4030155eac7 158B / 158B
15 => => sha256:9760149be10a5530ec0649fba4393ef8c2a058a9a97978c7cf43287b890dfcc0 2.28kB / 2.28kB
16 => => extracting sha256:b1cba2e842ca52b95817f958faf99734080c78e92e43ce609cde9244867b49ed
17 => => extracting sha256:1dde4b555a697f85138e99e7759480373b71f47cbad9f7c0fa6cba34f2f5fe1e
18 => => extracting sha256:878e917e8d81357d9636aab51478d4fe889d01e89988159f87e55dbc3bba337b
19 => => extracting sha256:2049ec1cef96e43c3946f1be57d5efb77052e16d9e7f1fa3d7c8e4030155eac7
20 => => extracting sha256:9760149be10a5530ec0649fba4393ef8c2a058a9a97978c7cf43287b890dfcc0
21 => [internal] load build context
22 => => transferring context: 2.11kB
23 => [ 2/10] RUN apt-get update && apt-get install -y --no-install-recommends wget tar iputils-ping rsync openssh-server openssh-client && apt-get install -y --no-install-recommends python3 python3-pip && rm -rf /var/lib/apt/lists/*
24 => [ 3/10] RUN mkdir -p /root/.ssh/ && ssh-keygen -t rsa -f /root/.ssh/id_rsa -q -N "" && cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys && chmod 600 /root/.ssh/authorized_keys && echo "Host *" >> /root/.ssh/config && echo " StrictHostKeyChecking no" >> /root/.ssh/config && chmod 600 /root/.ssh/config &&
25 => [ 4/10] RUN wget https://archive.apache.org/dist/spark/spark-4.1.1/spark-4.1.1-bin-hadoop3.tgz
26 => [ 5/10] RUN tar -xzf spark-4.1.1-bin-hadoop3.tgz && rm spark-4.1.1-bin-hadoop3.tgz && mv spark-4.1.1-bin-hadoop3 /opt/spark && chown -R root:root /opt/spark && mkdir -p /opt/spark/logs && mkdir -p /opt/spark/event_logs
27 => [ 6/10] COPY spark-defaults.conf /opt/spark/conf/spark-defaults.conf
28 => [ 7/10] COPY spark-env.sh /opt/spark/conf/spark-env.sh
29 => [ 8/10] COPY spark-start.sh /opt/spark/spark-start.sh
30 => [ 9/10] RUN chmod +x /opt/spark/conf/spark-env.sh
31 => [10/10] RUN chmod +x /opt/spark/spark-start.sh
32 => exporting to image
33 => => exporting layers
34 => => writing image sha256:ecbea515a6ce1fce621a2b7c8962957862f7069fa2c4f1fc30352fdb8abcb72c
35 => => naming to docker.io/library/spark4
Create a docker compose file
Steps to perform from the root directory :
- Create the
compose.ymlfile to define the various elements of the cluster, which will consist of oneMasternode and twoWorkernodes - Start the local cluster with the command
docker compose up -d - Stop the local cluster with the command
docker compose down
Content of the compose.yml :
1services:
2 spark-master:
3 image: spark4
4 container_name: spark-master
5 hostname: spark-master
6 environment:
7 - SPARK_MODE=master
8 - SPARK_RPC_AUTHENTICATION_ENABLED=false
9 - SPARK_RPC_ENCRYPTION_ENABLED=false
10 - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=false
11 - SPARK_SSL_ENABLED=false
12 - SPARK_PUBLIC_DNS=spark-master
13 - SPARK_MASTER_HOST=spark-master
14 - SPARK_MASTER_PORT=7077
15 - SPARK_DRIVER_MEMORY=1g
16 - SPARK_DRIVER_CORES=1
17 - SPARK_EXECUTOR_MEMORY=1g
18 - SPARK_MASTER_WEBUI_PORT=8080
19 ports:
20 - "4040:4040" # Application UI (Job Details)
21 - "8080:8080" # Interface web du master
22 - "7077:7077" # Port de communication Spark
23 - "15002:15002" # Port Spark Connect
24 - "18080:18080" # Interface History Server
25 deploy:
26 resources:
27 limits:
28 cpus: '1'
29 memory: 1G
30 volumes:
31 - ./data:/opt/spark/data
32 - ./logs/events:/opt/spark/event_logs
33 - ./application/src:/home/root/src
34 networks:
35 - spark-network
36
37
38 spark-worker-1:
39 image: spark4
40 container_name: spark-worker-1
41 hostname: spark-worker-1
42 depends_on:
43 - spark-master
44 environment:
45 - SPARK_MODE=worker
46 - SPARK_MASTER_URL=spark://spark-master:7077
47 - SPARK_RPC_AUTHENTICATE=false
48 - SPARK_RPC_ENCRYPTION=false
49 - SPARK_LOCAL_STORAGE_ENCRYPTION=false
50 - SPARK_SSL_ENABLED=no
51 - SPARK_PUBLIC_DNS=spark-worker-1
52 - SPARK_MASTER_HOST=spark-master
53 - SPARK_MASTER_PORT=7077
54 - SPARK_WORKER_CORES=2
55 - SPARK_WORKER_MEMORY=2g
56 - SPARK_EXECUTOR_MEMORY=1g
57 - SPARK_WORKER_WEBUI_PORT=8081
58 ports:
59 - "8081:8081" # Interface web du worker
60 volumes:
61 - ./data:/opt/spark/data
62 networks:
63 - spark-network
64 deploy:
65 resources:
66 limits:
67 cpus: '2'
68 memory: 2G
69
70
71 spark-worker-2:
72 image: spark4
73 container_name: spark-worker-2
74 hostname: spark-worker-2
75 depends_on:
76 - spark-master
77 environment:
78 - SPARK_MODE=worker
79 - SPARK_MASTER_URL=spark://spark-master:7077
80 - SPARK_RPC_AUTHENTICATE=false
81 - SPARK_RPC_ENCRYPTION=false
82 - SPARK_LOCAL_STORAGE_ENCRYPTION=false
83 - SPARK_SSL_ENABLED=no
84 - SPARK_PUBLIC_DNS=spark-worker-2
85 - SPARK_MASTER_HOST=spark-master
86 - SPARK_MASTER_PORT=7077
87 - SPARK_WORKER_CORES=2
88 - SPARK_WORKER_MEMORY=2g
89 - SPARK_EXECUTOR_MEMORY=1g
90 - SPARK_WORKER_WEBUI_PORT=8081
91 ports:
92 - "8082:8081" # Interface web du worker
93 volumes:
94 - ./data:/opt/spark/data
95 networks:
96 - spark-network
97 deploy:
98 resources:
99 limits:
100 cpus: '2'
101 memory: 2G
102
103
104networks:
105 spark-network:
106 driver: bridge
Content of the log from the command docker-compose up -d :
1[+] up 4/4
2 ✔ Network spark4_spark-network Created
3 ✔ Container spark-master Created
4 ✔ Container spark-worker-1 Created
5 ✔ Container spark-worker-2 Created
Content of the log from the stop command docker-compose down :
1[+] down 4/4
2 ✔ Container spark-worker-2 Removed
3 ✔ Container spark-worker-1 Removed
4 ✔ Container spark-master Removed
5 ✔ Network spark4_spark-network Removed
How to execute a PySpark application
List of ports and interfaces
Based on the configuration in the compose.yml file:
4040: Communication port for the application UI (Jobs UI)7077: Communication port for the Spark cluster (Internal)8080: Communication port for the Master node web interface (driver)8081: Communication port for Worker #1 node web interface8082: Communication port for Worker #2 node web interface15002: Communication port for the Spark Connect server18080: Communication port for the History Server interface
Script execution on the cluster
To execute a PySpark script directly on the local cluster:
- Create a Python script named
spark-submit-app.pyin theapplication/srcdirectory - Execute the following Docker command to run the created Python script :
docker exec -it spark-master spark-submit --master spark://spark-master:7077 --conf spark.driver.host=spark-master --name "TestApp" /home/root/src/spark-submit-app.py - After successful execution of the script, you will find a new file named
test.parquetin thedata/filesdirectory, corresponding to the file created in the example
Note: the user’s
application/srcdirectory is mapped by default, in the configuration defined in thecompose.ymlfile, to the/home/root/srcdirectory
Content of the spark-submit-app.py Python script:
1from pyspark.sql import SparkSession
2
3REP_DATA_FILES = "file:///opt/spark/data/files"
4
5spark = SparkSession.builder \
6 .appName("TestSubmitApp") \
7 .master("spark://spark-master:7077") \
8 .getOrCreate()
9
10data = [(1, "Alice"), (2, "Bob"), (3, "Charlie")]
11df = spark.createDataFrame(data, ["id", "name"])
12df.show()
13df.write.mode("overwrite").format("parquet").save(f"{REP_DATA_FILES}/test.parquet")
14
15spark.stop()
Content of the log from the execution command :
1WARNING: Using incubator modules: jdk.incubator.vector
2Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
326/02/27 16:31:13 INFO SparkContext: Running Spark version 4.1.1
426/02/27 16:31:13 INFO SparkContext: OS info Linux, 6.6.87.2-microsoft-standard-WSL2, amd64
526/02/27 16:31:13 INFO SparkContext: Java version 21.0.10+7-LTS
626/02/27 16:31:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
726/02/27 16:31:13 INFO ResourceUtils: ==============================================================
826/02/27 16:31:13 INFO ResourceUtils: No custom resources configured for spark.driver.
926/02/27 16:31:13 INFO ResourceUtils: ==============================================================
1026/02/27 16:31:13 INFO SparkContext: Submitted application: TestSubmitApp
1126/02/27 16:31:13 INFO SecurityManager: Changing view acls to: root
1226/02/27 16:31:13 INFO SecurityManager: Changing modify acls to: root
1326/02/27 16:31:13 INFO SecurityManager: Changing view acls groups to: root
1426/02/27 16:31:13 INFO SecurityManager: Changing modify acls groups to: root
1526/02/27 16:31:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: root groups with view permissions: EMPTY; users with modify permissions: root; groups with modify permissions: EMPTY; RPC SSL disabled
1626/02/27 16:31:13 INFO Utils: Successfully started service 'sparkDriver' on port 34403.
1726/02/27 16:31:13 INFO SparkEnv: Registering MapOutputTracker
1826/02/27 16:31:13 INFO SparkEnv: Registering BlockManagerMaster
1926/02/27 16:31:13 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2026/02/27 16:31:13 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
2126/02/27 16:31:13 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
2226/02/27 16:31:13 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-f5f0ee24-17ba-442b-9614-1ee27607cec2
2326/02/27 16:31:13 INFO SparkEnv: Registering OutputCommitCoordinator
2426/02/27 16:31:14 INFO JettyUtils: Start Jetty 172.19.0.2:4040 for SparkUI
2526/02/27 16:31:14 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2626/02/27 16:31:14 INFO Utils: Successfully started service 'SparkUI' on port 4041.
2726/02/27 16:31:14 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
2826/02/27 16:31:14 INFO ResourceProfile: Limiting resource is cpu
2926/02/27 16:31:14 INFO ResourceProfileManager: Added ResourceProfile id: 0
3026/02/27 16:31:14 INFO SecurityManager: Changing view acls to: root
3126/02/27 16:31:14 INFO SecurityManager: Changing modify acls to: root
3226/02/27 16:31:14 INFO SecurityManager: Changing view acls groups to: root
3326/02/27 16:31:14 INFO SecurityManager: Changing modify acls groups to: root
3426/02/27 16:31:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: root groups with view permissions: EMPTY; users with modify permissions: root; groups with modify permissions: EMPTY; RPC SSL disabled
3526/02/27 16:31:14 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://spark-master:7077...
3626/02/27 16:31:14 INFO TransportClientFactory: Successfully created connection to spark-master/172.19.0.2:7077 after 17 ms (0 ms spent in bootstraps)
3726/02/27 16:31:14 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20260227163114-0001
3826/02/27 16:31:14 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41333.
3926/02/27 16:31:14 INFO NettyBlockTransferService: Server created on spark-master:41333
4026/02/27 16:31:14 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20260227163114-0001/0 on worker-20260227161953-172.19.0.4-40325 (172.19.0.4:40325) with 2 core(s)
4126/02/27 16:31:14 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
4226/02/27 16:31:14 INFO StandaloneSchedulerBackend: Granted executor ID app-20260227163114-0001/0 on hostPort 172.19.0.4:40325 with 2 core(s), 1024.0 MiB RAM
4326/02/27 16:31:14 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20260227163114-0001/1 on worker-20260227161953-172.19.0.3-38037 (172.19.0.3:38037) with 2 core(s)
4426/02/27 16:31:14 INFO StandaloneSchedulerBackend: Granted executor ID app-20260227163114-0001/1 on hostPort 172.19.0.3:38037 with 2 core(s), 1024.0 MiB RAM
4526/02/27 16:31:14 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20260227163114-0001/0 is now RUNNING
4626/02/27 16:31:14 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20260227163114-0001/1 is now RUNNING
4726/02/27 16:31:14 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, spark-master, 41333, None)
4826/02/27 16:31:14 INFO BlockManagerMasterEndpoint: Registering block manager spark-master:41333 with 413.9 MiB RAM, BlockManagerId(driver, spark-master, 41333, None)
4926/02/27 16:31:14 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, spark-master, 41333, None)
5026/02/27 16:31:14 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, spark-master, 41333, None)
5126/02/27 16:31:15 INFO RollingEventLogFilesWriter: Logging events to file:/opt/spark/event_logs/eventlog_v2_app-20260227163114-0001/events_1_app-20260227163114-0001.zstd
5226/02/27 16:31:15 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
5326/02/27 16:31:16 INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.19.0.3:39714) with ID 1, ResourceProfileId 0
5426/02/27 16:31:16 INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.19.0.4:55040) with ID 0, ResourceProfileId 0
5526/02/27 16:31:16 INFO BlockManagerMasterEndpoint: Registering block manager 172.19.0.4:39007 with 434.4 MiB RAM, BlockManagerId(0, 172.19.0.4, 39007, None)
5626/02/27 16:31:16 INFO BlockManagerMasterEndpoint: Registering block manager 172.19.0.3:40421 with 434.4 MiB RAM, BlockManagerId(1, 172.19.0.3, 40421, None)
5726/02/27 16:31:18 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
5826/02/27 16:31:18 INFO SharedState: Warehouse path is 'file:/opt/spark/data/warehouse'.
5926/02/27 16:31:19 INFO CodeGenerator: Code generated in 314.329914 ms
6026/02/27 16:31:19 INFO SparkContext: Starting job: showString at NativeMethodAccessorImpl.java:0
6126/02/27 16:31:19 INFO DAGScheduler: Got job 0 (showString at NativeMethodAccessorImpl.java:0) with 1 output partitions
6226/02/27 16:31:19 INFO DAGScheduler: Final stage: ResultStage 0 (showString at NativeMethodAccessorImpl.java:0)
6326/02/27 16:31:19 INFO DAGScheduler: Parents of final stage: List()
6426/02/27 16:31:19 INFO DAGScheduler: Missing parents: List()
6526/02/27 16:31:19 INFO DAGScheduler: Missing parents found for ResultStage 0: List()
6626/02/27 16:31:19 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[6] at showString at NativeMethodAccessorImpl.java:0), which has no missing parents
6726/02/27 16:31:20 INFO MemoryStore: MemoryStore started with capacity 413.9 MiB
6826/02/27 16:31:20 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 13.7 KiB, free 413.9 MiB)
6926/02/27 16:31:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.1 KiB, free 413.9 MiB)
7026/02/27 16:31:20 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1686
7126/02/27 16:31:20 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[6] at showString at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
7226/02/27 16:31:20 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0
7326/02/27 16:31:20 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (172.19.0.4,executor 0, partition 0, PROCESS_LOCAL, 9679 bytes)
7426/02/27 16:31:21 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1178 ms on 172.19.0.4 (executor 0) (1/1)
7526/02/27 16:31:21 INFO TaskSchedulerImpl: Removed TaskSet 0.0 whose tasks have all completed, from pool
7626/02/27 16:31:21 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 49893
7726/02/27 16:31:21 INFO DAGScheduler: ResultStage 0 (showString at NativeMethodAccessorImpl.java:0) finished in 1397 ms
7826/02/27 16:31:21 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
7926/02/27 16:31:21 INFO TaskSchedulerImpl: Canceling stage 0
8026/02/27 16:31:21 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
8126/02/27 16:31:21 INFO DAGScheduler: Job 0 finished: showString at NativeMethodAccessorImpl.java:0, took 1480.625651 ms
8226/02/27 16:31:21 INFO SparkContext: Starting job: showString at NativeMethodAccessorImpl.java:0
8326/02/27 16:31:21 INFO DAGScheduler: Got job 1 (showString at NativeMethodAccessorImpl.java:0) with 3 output partitions
8426/02/27 16:31:21 INFO DAGScheduler: Final stage: ResultStage 1 (showString at NativeMethodAccessorImpl.java:0)
8526/02/27 16:31:21 INFO DAGScheduler: Parents of final stage: List()
8626/02/27 16:31:21 INFO DAGScheduler: Missing parents: List()
8726/02/27 16:31:21 INFO DAGScheduler: Missing parents found for ResultStage 1: List()
8826/02/27 16:31:21 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[6] at showString at NativeMethodAccessorImpl.java:0), which has no missing parents
8926/02/27 16:31:21 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 13.7 KiB, free 413.9 MiB)
9026/02/27 16:31:21 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 7.1 KiB, free 413.9 MiB)
9126/02/27 16:31:21 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1686
9226/02/27 16:31:21 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (MapPartitionsRDD[6] at showString at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(1, 2, 3))
9326/02/27 16:31:21 INFO TaskSchedulerImpl: Adding task set 1.0 with 3 tasks resource profile 0
9426/02/27 16:31:21 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1) (172.19.0.4,executor 0, partition 1, PROCESS_LOCAL, 9716 bytes)
9526/02/27 16:31:21 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2) (172.19.0.3,executor 1, partition 2, PROCESS_LOCAL, 9714 bytes)
9626/02/27 16:31:21 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3) (172.19.0.4,executor 0, partition 3, PROCESS_LOCAL, 9718 bytes)
9726/02/27 16:31:21 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 179 ms on 172.19.0.4 (executor 0) (1/3)
9826/02/27 16:31:21 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 3) in 211 ms on 172.19.0.4 (executor 0) (2/3)
9926/02/27 16:31:22 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 1160 ms on 172.19.0.3 (executor 1) (3/3)
10026/02/27 16:31:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0 whose tasks have all completed, from pool
10126/02/27 16:31:22 INFO DAGScheduler: ResultStage 1 (showString at NativeMethodAccessorImpl.java:0) finished in 1168 ms
10226/02/27 16:31:22 INFO DAGScheduler: Job 1 is finished. Cancelling potential speculative or zombie tasks for this job
10326/02/27 16:31:22 INFO TaskSchedulerImpl: Canceling stage 1
10426/02/27 16:31:22 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished
10526/02/27 16:31:22 INFO DAGScheduler: Job 1 finished: showString at NativeMethodAccessorImpl.java:0, took 1170.986858 ms
10626/02/27 16:31:22 INFO CodeGenerator: Code generated in 10.428925 ms
107+---+-------+
108| id| name|
109+---+-------+
110| 1| Alice|
111| 2| Bob|
112| 3|Charlie|
113+---+-------+
114
11526/02/27 16:31:22 INFO ParquetUtils: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
11626/02/27 16:31:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
11726/02/27 16:31:22 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
11826/02/27 16:31:22 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
11926/02/27 16:31:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
12026/02/27 16:31:22 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
12126/02/27 16:31:22 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
12226/02/27 16:31:22 INFO CodeGenerator: Code generated in 8.009026 ms
12326/02/27 16:31:22 INFO SparkContext: Starting job: save at NativeMethodAccessorImpl.java:0
12426/02/27 16:31:22 INFO DAGScheduler: Got job 2 (save at NativeMethodAccessorImpl.java:0) with 4 output partitions
12526/02/27 16:31:22 INFO DAGScheduler: Final stage: ResultStage 2 (save at NativeMethodAccessorImpl.java:0)
12626/02/27 16:31:22 INFO DAGScheduler: Parents of final stage: List()
12726/02/27 16:31:22 INFO DAGScheduler: Missing parents: List()
12826/02/27 16:31:22 INFO DAGScheduler: Missing parents found for ResultStage 2: List()
12926/02/27 16:31:22 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[8] at save at NativeMethodAccessorImpl.java:0), which has no missing parents
13026/02/27 16:31:23 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 238.7 KiB, free 413.7 MiB)
13126/02/27 16:31:23 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 86.9 KiB, free 413.6 MiB)
13226/02/27 16:31:23 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1686
13326/02/27 16:31:23 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 2 (MapPartitionsRDD[8] at save at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0, 1, 2, 3))
13426/02/27 16:31:23 INFO TaskSchedulerImpl: Adding task set 2.0 with 4 tasks resource profile 0
13526/02/27 16:31:23 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4) (172.19.0.4,executor 0, partition 0, PROCESS_LOCAL, 9679 bytes)
13626/02/27 16:31:23 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 5) (172.19.0.3,executor 1, partition 1, PROCESS_LOCAL, 9716 bytes)
13726/02/27 16:31:23 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 6) (172.19.0.4,executor 0, partition 2, PROCESS_LOCAL, 9714 bytes)
13826/02/27 16:31:23 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 7) (172.19.0.3,executor 1, partition 3, PROCESS_LOCAL, 9718 bytes)
13926/02/27 16:31:23 INFO TaskSetManager: Finished task 2.0 in stage 2.0 (TID 6) in 778 ms on 172.19.0.4 (executor 0) (1/4)
14026/02/27 16:31:23 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 778 ms on 172.19.0.4 (executor 0) (2/4)
14126/02/27 16:31:23 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 5) in 789 ms on 172.19.0.3 (executor 1) (3/4)
14226/02/27 16:31:23 INFO TaskSetManager: Finished task 3.0 in stage 2.0 (TID 7) in 788 ms on 172.19.0.3 (executor 1) (4/4)
14326/02/27 16:31:23 INFO TaskSchedulerImpl: Removed TaskSet 2.0 whose tasks have all completed, from pool
14426/02/27 16:31:23 INFO DAGScheduler: ResultStage 2 (save at NativeMethodAccessorImpl.java:0) finished in 865 ms
14526/02/27 16:31:23 INFO DAGScheduler: Job 2 is finished. Cancelling potential speculative or zombie tasks for this job
14626/02/27 16:31:23 INFO TaskSchedulerImpl: Canceling stage 2
14726/02/27 16:31:23 INFO TaskSchedulerImpl: Killing all running tasks in stage 2: Stage finished
14826/02/27 16:31:23 INFO DAGScheduler: Job 2 finished: save at NativeMethodAccessorImpl.java:0, took 868.464989 ms
14926/02/27 16:31:23 INFO FileFormatWriter: Start to commit write Job 1d2cef1f-35fd-49e3-9031-06b56f61da8c.
15026/02/27 16:31:24 INFO FileFormatWriter: Write Job 1d2cef1f-35fd-49e3-9031-06b56f61da8c committed. Elapsed time: 261 ms.
15126/02/27 16:31:24 INFO FileFormatWriter: Finished processing stats for write job 1d2cef1f-35fd-49e3-9031-06b56f61da8c.
15226/02/27 16:31:24 INFO SparkContext: SparkContext is stopping with exitCode 0 from stop at NativeMethodAccessorImpl.java:0.
15326/02/27 16:31:24 INFO SparkUI: Stopped Spark web UI at http://spark-master:4041
15426/02/27 16:31:24 INFO StandaloneSchedulerBackend: Shutting down all executors
15526/02/27 16:31:24 INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Asking each executor to shut down
15626/02/27 16:31:24 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15726/02/27 16:31:24 INFO MemoryStore: MemoryStore cleared
15826/02/27 16:31:24 INFO BlockManager: BlockManager stopped
15926/02/27 16:31:24 INFO BlockManagerMaster: BlockManagerMaster stopped
16026/02/27 16:31:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16126/02/27 16:31:24 INFO SparkContext: Successfully stopped SparkContext (Uptime: 10922 ms)
Script execution using Spark Connect
To execute a PySpark script using Spark Connect:
- Create a Python script named
spark-connect-app.pyin theapplication/srcdirectory - Execute the created Python script:
python spark-connect-app.py - After successful execution of the script, you will find a new file named
test_connect.parquetin thedata/files directory, corresponding to the file created in the example
Note: to get information about the execution of the Python script with Spark Connect, you need to go to the
Connecttab in the Jobs application interface
Content of the spark-connect-app.py Python script :
1
2from pyspark.sql import SparkSession
3from pyspark.sql.types import StructType, StructField, StringType
4
5REMOTE_URL = "sc://localhost:15002"
6REP_DATA_FILES = "file:///opt/spark/data/files"
7
8print("Attempting to connect to Spark Connect server...")
9
10try:
11 # Use the .remote() builder method to connect
12 spark = SparkSession.builder.remote(REMOTE_URL).getOrCreate()
13
14 print("Successfully connected to Spark!")
15 print(f"Spark version: {spark.version}")
16
17 data = [(1, "Alice"), (2, "Bob"), (3, "Charlie")]
18 df = spark.createDataFrame(data, ["id", "name"])
19
20 df.show()
21 df.write.mode("overwrite").format("parquet").save(f"{REP_DATA_FILES}/test.parquet")
22
23except Exception as e:
24 print(f"❌ Failed to connect or run Spark job: {e}")
25
26finally:
27 # Stop the Spark session
28 if 'spark' in locals():
29 spark.stop()
30 print("\nSpark session stopped.")
Content of the log from the Python script execution :
1Attempting to connect to Spark Connect server...
2Successfully connected to Spark!
3Spark version: 4.1.1
4+---+-------+
5| id| name|
6+---+-------+
7| 1| Alice|
8| 2| Bob|
9| 3|Charlie|
10+---+-------+
11
12
13Spark session stopped.
