Spark : v4.x - Features - SQL Collation Support
You’ll find in this article, some informations about the string SQL collation support feature from Spark v4.x.
Introduction
Collation controls how Spark compares and sorts string data. Spark 4.0 introduces explicit collation specification, enabling linguistic sorting and case-insensitive operations without custom UDFs.
Default behavior remains UTF-8 binary comparison for backward compatibility.
Critical for multi-locale analytics, data quality rules, and migrations from traditional databases. Adds query planning overhead and can cause performance regression on large string operations.
Detail
Collation defines the rules for comparing string values. Traditional databases support locale-specific collation (e.g., en_CI or de_AI) that handle language-specific sorting rules and case sensitivity.
Before Spark 4.0, all string operations used UTF-8 binary comparison: byte-by-byte matching with no linguistic awareness.
UTF-8 binary collation sorts by Unicode code point values. This means ‘Z’ (U+005A) sorts before ‘a’ (U+0061) because uppercase letters have lower code points. Accented characters sort at the end of the alphabet rather than near their base letters. Case-insensitive operations required lower() function calls, which create new strings and increase memory usage.
Spark 4.0 adds the COLLATE clause and the spark.sql.session.collation.enabled configuration. You can now specify collation at column definition, expression level, or session default. The implementation leverages ICU (International Components for Unicode) library for linguistic rules.
Three primary collation families exist in Spark 4.0:
- UTF8_BINARY: Default byte-by-byte comparison (backward compatible)
- UTF8_LCASE: Case-insensitive UTF-8 comparison (most common use case)
- UNICODE: Linguistic collation with locale-specific rules (e.g., UNICODE_DE for German)
The collation system integrates with the Catalyst optimizer. Spark analyzes collation requirements during query planning and injects appropriate comparison logic. This adds planning overhead but enables vectorized execution for collation-aware operations rather than falling back to row-by-row processing.
Key Collations :
| Identifier | Description |
|---|---|
UTF8_BINARY | Historical default. Byte-by-byte binary comparison. |
UTF8_LCASE | Case-insensitive, accent-sensitive. |
*_CI_* | Case-insensitive. |
*_CS_* | Case-sensitive. |
*_AI_* | Accent-insensitive. |
*_AS_* | Accent-sensitive. |
*fr.* | French linguistic order. |
*en.* | American Linguistic order. |
*de.* | Deutch Linguistic order. |
Collation SQL functions :
- Collate : Marks a given expression with the specified collation
- Collation : Returns the collation name of a given expression.
- Collations : Get all of the Spark SQL string collations
Advantages
- Internationalization support : Applications serving multiple regions can apply locale-specific sorting. This improves data quality for customer-facing reports and regulatory compliance.
- Code simplification : Eliminates custom UDFs for case-insensitive operations. With
UTF8_LCASEcollation, comparison happens directly without applyinglower(col)function. - Database Migration Compatibility : Alignment with traditional database behavior (PostgreSQL, MySQL, SQL Server, …) that have supported collations for a long time. Teams migrating from these systems can now preserve collation semantics without query rewrites. (This reduces migration risk and testing burden.)
Limitations
- Performance Overhead on Non-Binary Collations : Linguistic collation requires ICU library calls that are slower than binary comparison. The overhead increases with string length and complexity of collation rules.
- Collation Mixing Restrictions : Spark prohibits operations between columns with different collations without explicit
COLLATEcasting. This breaks existing queries that implicitly compare string columns if collations differ. Joins between different collations require an explicit cast, which can trigger re-shuffling. - Very Limited Collation Family Support : Spark 4 has limited support, you need to check if your collation is managed by your spark version. With Spark 4.0.1, only UTF8 is supported with selected locale-specific collations.
- Limited Collation functionality Support : Not all collations are supported for partitioned columns and no collation inference from Data Sources (You must manually specify collation in Spark object definition).
When not to use Collation :
- For performance-critical joins.
- For partition keys.
Real-World Use Cases
- Use Case 1 : Financial Institution Customer Matching
- A bank merges customer records from acquired institutions. Customer names contain accented characters (José, François, Müller). Deduplication logic needs to match “Mueller” with “Müller” according to German collation rules.
de_AIcollation provides linguistic equivalence without custom normalization functions.
- A bank merges customer records from acquired institutions. Customer names contain accented characters (José, François, Müller). Deduplication logic needs to match “Mueller” with “Müller” according to German collation rules.
- Use Case 2 : E-Commerce Product Catalog with Locale-Specific Sorting
- An international retailer displays product listings sorted alphabetically by name. Spanish customers expect “ñ” to sort between “n” and “o”, while default UTF-8 binary sorts it at the end. Different collations per region enable correct sort order without maintaining separate datasets.
Codes
Example 1 : Case-Insensitive User Matching with UTF8_LCASE
This example demonstrates user record matching where email addresses may have inconsistent capitalization. Common scenario in data quality pipelines merging records from multiple source systems.
Spark
1# PySpark Example - Case-Insensitive User Matching
2from pyspark.sql import SparkSession
3from pyspark.sql.functions import col
4
5URL_MASTER = "spark://spark-master:7077"
6
7
8# Initialize Spark with collation support enabled
9# spark.sql.session.collation.enabled must be true (default in 4.0+)
10# spark.sql.collation.defaultCollation sets session-wide default
11spark = SparkSession.builder \
12 .appName("SparkCollationCaseInsensitive_1") \
13 .config("spark.sql.session.collation.enabled", "true") \
14 .config("spark.sql.collation.defaultCollation", "UTF8_BINARY") \
15 .master(URL_MASTER) \
16 .getOrCreate()
17
18print("""
19##########################
20##### Build Datasets #####
21##########################
22""")
23
24# Source system 1: Customer records from CRM
25# Emails stored with original capitalization from user input
26crm_customers = [
27 ("C001", "Alice.Smith@company.com", "Alice Smith", "2021-01-15"),
28 ("C002", "bob.jones@enterprise.net", "Bob Jones", "2025-02-20"),
29 ("C003", "Carol.WHITE@startup.io", "Carol White", "2024-03-10"),
30 ("C004", "david.brown@tech.com", "David Brown", "2024-04-05"),
31 ("C005", "Eve.DAVIS@innovation.com", "Eve Davis", "2025-05-12"),
32]
33
34# Source system 2: Transaction records from payment processor
35# Emails normalized to lowercase by payment gateway
36transaction_emails = [
37 ("T001", "alice.smith@company.com", 450.00, "2026-01-10"),
38 ("T002", "bob.jones@enterprise.net", 1200.00, "2026-01-15"),
39 ("T003", "carol.white@startup.io", 300.00, "2026-01-20"),
40 ("T004", "DAVID.BROWN@tech.com", 890.00, "2026-01-25"),
41 ("T005", "eve.davis@innovation.com", 550.00, "2026-02-01"),
42 ("T006", "frank.miller@unknown.com", 200.00, "2026-02-05"),
43]
44
45
46# Create DataFrames without collation (default UTF8_BINARY)
47crm_df = spark.createDataFrame(crm_customers,["customer_id", "email", "full_name", "registration_date"])
48print(f"crm_customers dataset: {crm_df.count()}")
49crm_df.show(truncate=False)
50
51transactions_df = spark.createDataFrame(transaction_emails,["transaction_id", "email", "amount", "transaction_date"])
52print(f"transaction_emails dataset: {transactions_df.count()}")
53transactions_df.show(truncate=False)
54
55
56# Register DataFrames as temporary views for SQL access
57crm_df.createOrReplaceTempView("crm_customers")
58transactions_df.createOrReplaceTempView("transactions")
59
60
61print("""
62###################################
63##### 1. Standard binary join #####
64###################################
65""")
66
67# Standard binary join - will MISS matches due to case differences
68print("=== Binary Collation Join (Default Behavior) ===")
69binary_join = crm_df.join(transactions_df,crm_df.email == transactions_df.email,"inner") \
70 .select(
71 crm_df.customer_id,
72 crm_df.full_name,
73 crm_df.email.alias("crm_email"),
74 transactions_df.email.alias("txn_email"),
75 transactions_df.amount
76 )
77print(f"Matched records: {binary_join.count()}") # Will show only 2 matches
78binary_join.show(truncate=False)
79
80
81
82print("""
83##############################################################
84##### 2. Case-Insensitive Join with UTF8_LCASE Collation #####
85##############################################################
86""")
87# Apply collation at expression level using SQL COLLATE syntax
88
89# SQL approach: Apply collation in WHERE clause
90case_insensitive_sql = spark.sql("""
91 SELECT
92 c.customer_id,
93 c.full_name,
94 c.email as crm_email,
95 t.email as txn_email,
96 t.amount,
97 t.transaction_date
98 FROM crm_customers c
99 INNER JOIN transactions t
100 ON c.email COLLATE UTF8_LCASE = t.email COLLATE UTF8_LCASE
101 ORDER BY c.customer_id
102""")
103
104print(f"Matched records: {case_insensitive_sql.count()}") # Will show 5 matches
105case_insensitive_sql.show(truncate=False)
106
107
108
109
110print("""
111##########################################
112##### 3. Collation Defined in Schema #####
113##########################################
114""")
115#Define collation at table level
116
117spark.sql("""
118 CREATE OR REPLACE TEMPORARY VIEW crm_customers_collated AS
119 SELECT
120 customer_id,
121 email COLLATE UTF8_LCASE as email,
122 full_name,
123 registration_date
124 FROM crm_customers
125""")
126
127spark.sql("""
128 CREATE OR REPLACE TEMPORARY VIEW transactions_collated AS
129 SELECT
130 transaction_id,
131 email COLLATE UTF8_LCASE as email,
132 amount,
133 transaction_date
134 FROM transactions
135""")
136
137# join without explicit COLLATE - collation is inherited from schema
138schema_collation_join = spark.sql("""
139 SELECT
140 c.customer_id,
141 c.full_name,
142 c.email as crm_email,
143 t.email as txn_email,
144 t.amount,
145 t.transaction_date
146 FROM crm_customers_collated c
147 INNER JOIN transactions_collated t ON c.email = t.email
148 ORDER BY c.customer_id
149""")
150
151print(f"Matched records: {schema_collation_join.count()}") # Shows 5 matches
152schema_collation_join.show(truncate=False)
153
154
155
156print("""
157############################################
158##### 4. Collation metadata inspection #####
159############################################
160""")
161
162print("crm_customers_collated schema :")
163print(spark.sql("SELECT email FROM crm_customers_collated").schema)
164
165# Performance comparison: Show that collation join avoids string transformation
166# Binary case-insensitive requires LOWER() which allocates new strings
167lower_join = spark.sql("""
168 SELECT
169 c.customer_id,
170 c.full_name,
171 COUNT(*) as match_count
172 FROM crm_customers c
173 INNER JOIN transactions t
174 ON LOWER(c.email) = LOWER(t.email)
175 GROUP BY c.customer_id, c.full_name
176""")
177
178print("\n########## Lower() Function Approach (before Spark 4.x) ##########")
179lower_join.explain("formatted") # Shows Project with lower() function calls
180print(lower_join.collect())
181
182print("\n########## Collation Function Approach (Spark 4.x) ##########")
183case_insensitive_sql.explain("formatted") # Shows no function transformation
184
185print("\n########## Collation Column Approach (Spark 4.x) ##########")
186schema_collation_join.explain("formatted") # Shows no function transformation
187
188spark.stop()
Result
1
2##########################
3##### Build Datasets #####
4##########################
5
6crm_customers dataset: 5
7+-----------+------------------------+-----------+-----------------+
8|customer_id|email |full_name |registration_date|
9+-----------+------------------------+-----------+-----------------+
10|C001 |Alice.Smith@company.com |Alice Smith|2021-01-15 |
11|C002 |bob.jones@enterprise.net|Bob Jones |2025-02-20 |
12|C003 |Carol.WHITE@startup.io |Carol White|2024-03-10 |
13|C004 |david.brown@tech.com |David Brown|2024-04-05 |
14|C005 |Eve.DAVIS@innovation.com|Eve Davis |2025-05-12 |
15+-----------+------------------------+-----------+-----------------+
16
17transaction_emails dataset: 6
18+--------------+------------------------+------+----------------+
19|transaction_id|email |amount|transaction_date|
20+--------------+------------------------+------+----------------+
21|T001 |alice.smith@company.com |450.0 |2026-01-10 |
22|T002 |bob.jones@enterprise.net|1200.0|2026-01-15 |
23|T003 |carol.white@startup.io |300.0 |2026-01-20 |
24|T004 |DAVID.BROWN@tech.com |890.0 |2026-01-25 |
25|T005 |eve.davis@innovation.com|550.0 |2026-02-01 |
26|T006 |frank.miller@unknown.com|200.0 |2026-02-05 |
27+--------------+------------------------+------+----------------+
28
29
30###################################
31##### 1. Standard binary join #####
32###################################
33
34=== Binary Collation Join (Default Behavior) ===
35Matched records: 1
36+-----------+---------+------------------------+------------------------+------+
37|customer_id|full_name|crm_email |txn_email |amount|
38+-----------+---------+------------------------+------------------------+------+
39|C002 |Bob Jones|bob.jones@enterprise.net|bob.jones@enterprise.net|1200.0|
40+-----------+---------+------------------------+------------------------+------+
41
42
43##############################################################
44##### 2. Case-Insensitive Join with UTF8_LCASE Collation #####
45##############################################################
46
47Matched records: 5
48+-----------+-----------+------------------------+------------------------+------+----------------+
49|customer_id|full_name |crm_email |txn_email |amount|transaction_date|
50+-----------+-----------+------------------------+------------------------+------+----------------+
51|C001 |Alice Smith|Alice.Smith@company.com |alice.smith@company.com |450.0 |2026-01-10 |
52|C002 |Bob Jones |bob.jones@enterprise.net|bob.jones@enterprise.net|1200.0|2026-01-15 |
53|C003 |Carol White|Carol.WHITE@startup.io |carol.white@startup.io |300.0 |2026-01-20 |
54|C004 |David Brown|david.brown@tech.com |DAVID.BROWN@tech.com |890.0 |2026-01-25 |
55|C005 |Eve Davis |Eve.DAVIS@innovation.com|eve.davis@innovation.com|550.0 |2026-02-01 |
56+-----------+-----------+------------------------+------------------------+------+----------------+
57
58
59##########################################
60##### 3. Collation Defined in Schema #####
61##########################################
62
63Matched records: 5
64+-----------+-----------+------------------------+------------------------+------+----------------+
65|customer_id|full_name |crm_email |txn_email |amount|transaction_date|
66+-----------+-----------+------------------------+------------------------+------+----------------+
67|C001 |Alice Smith|Alice.Smith@company.com |alice.smith@company.com |450.0 |2026-01-10 |
68|C002 |Bob Jones |bob.jones@enterprise.net|bob.jones@enterprise.net|1200.0|2026-01-15 |
69|C003 |Carol White|Carol.WHITE@startup.io |carol.white@startup.io |300.0 |2026-01-20 |
70|C004 |David Brown|david.brown@tech.com |DAVID.BROWN@tech.com |890.0 |2026-01-25 |
71|C005 |Eve Davis |Eve.DAVIS@innovation.com|eve.davis@innovation.com|550.0 |2026-02-01 |
72+-----------+-----------+------------------------+------------------------+------+----------------+
73
74
75############################################
76##### 4. Collation metadata inspection #####
77############################################
78
79crm_customers_collated schema :
80StructType([StructField('email', StringType('UTF8_LCASE'), True)])
81
82########## Lower() Function Approach (before Spark 4.x) ##########
83== Physical Plan ==
84AdaptiveSparkPlan (16)
85+- HashAggregate (15)
86 +- Exchange (14)
87 +- HashAggregate (13)
88 +- Project (12)
89 +- SortMergeJoin Inner (11)
90 :- Sort (5)
91 : +- Exchange (4)
92 : +- Project (3)
93 : +- Filter (2)
94 : +- Scan ExistingRDD (1)
95 +- Sort (10)
96 +- Exchange (9)
97 +- Project (8)
98 +- Filter (7)
99 +- Scan ExistingRDD (6)
100
101
102(1) Scan ExistingRDD
103Output [4]: [customer_id#0, email#1, full_name#2, registration_date#3]
104Arguments: [customer_id#0, email#1, full_name#2, registration_date#3], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
105
106(2) Filter
107Input [4]: [customer_id#0, email#1, full_name#2, registration_date#3]
108Condition : isnotnull(email#1)
109
110(3) Project
111Output [3]: [customer_id#0, email#1, full_name#2]
112Input [4]: [customer_id#0, email#1, full_name#2, registration_date#3]
113
114(4) Exchange
115Input [3]: [customer_id#0, email#1, full_name#2]
116Arguments: hashpartitioning(lower(email#1), 200), ENSURE_REQUIREMENTS, [plan_id=1080]
117
118(5) Sort
119Input [3]: [customer_id#0, email#1, full_name#2]
120Arguments: [lower(email#1) ASC NULLS FIRST], false, 0
121
122(6) Scan ExistingRDD
123Output [4]: [transaction_id#25, email#26, amount#27, transaction_date#28]
124Arguments: [transaction_id#25, email#26, amount#27, transaction_date#28], MapPartitionsRDD[16] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
125
126(7) Filter
127Input [4]: [transaction_id#25, email#26, amount#27, transaction_date#28]
128Condition : isnotnull(email#26)
129
130(8) Project
131Output [1]: [email#26]
132Input [4]: [transaction_id#25, email#26, amount#27, transaction_date#28]
133
134(9) Exchange
135Input [1]: [email#26]
136Arguments: hashpartitioning(lower(email#26), 200), ENSURE_REQUIREMENTS, [plan_id=1081]
137
138(10) Sort
139Input [1]: [email#26]
140Arguments: [lower(email#26) ASC NULLS FIRST], false, 0
141
142(11) SortMergeJoin
143Left keys [1]: [lower(email#1)]
144Right keys [1]: [lower(email#26)]
145Join type: Inner
146Join condition: None
147
148(12) Project
149Output [2]: [customer_id#0, full_name#2]
150Input [4]: [customer_id#0, email#1, full_name#2, email#26]
151
152(13) HashAggregate
153Input [2]: [customer_id#0, full_name#2]
154Keys [2]: [customer_id#0, full_name#2]
155Functions [1]: [partial_count(1)]
156Aggregate Attributes [1]: [count#158L]
157Results [3]: [customer_id#0, full_name#2, count#159L]
158
159(14) Exchange
160Input [3]: [customer_id#0, full_name#2, count#159L]
161Arguments: hashpartitioning(customer_id#0, full_name#2, 200), ENSURE_REQUIREMENTS, [plan_id=1088]
162
163(15) HashAggregate
164Input [3]: [customer_id#0, full_name#2, count#159L]
165Keys [2]: [customer_id#0, full_name#2]
166Functions [1]: [count(1)]
167Aggregate Attributes [1]: [count(1)#157L]
168Results [3]: [customer_id#0, full_name#2, count(1)#157L AS match_count#156L]
169
170(16) AdaptiveSparkPlan
171Output [3]: [customer_id#0, full_name#2, match_count#156L]
172Arguments: isFinalPlan=false
173
174
175[Row(customer_id='C002', full_name='Bob Jones', match_count=1), Row(customer_id='C001', full_name='Alice Smith', match_count=1), Row(customer_id='C005', full_name='Eve Davis', match_count=1), Row(customer_id='C004', full_name='David Brown', match_count=1), Row(customer_id='C003', full_name='Carol White', match_count=1)]
176
177########## Collation Function Approach (Spark 4.x) ##########
178== Physical Plan ==
179AdaptiveSparkPlan (13)
180+- Sort (12)
181 +- Exchange (11)
182 +- Project (10)
183 +- SortMergeJoin Inner (9)
184 :- Sort (4)
185 : +- Exchange (3)
186 : +- Project (2)
187 : +- Scan ExistingRDD (1)
188 +- Sort (8)
189 +- Exchange (7)
190 +- Project (6)
191 +- Scan ExistingRDD (5)
192
193
194(1) Scan ExistingRDD
195Output [4]: [customer_id#0, email#1, full_name#2, registration_date#3]
196Arguments: [customer_id#0, email#1, full_name#2, registration_date#3], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
197
198(2) Project
199Output [3]: [customer_id#0, email#1, full_name#2]
200Input [4]: [customer_id#0, email#1, full_name#2, registration_date#3]
201
202(3) Exchange
203Input [3]: [customer_id#0, email#1, full_name#2]
204Arguments: hashpartitioning(collate(email#1, UTF8_LCASE), 200), ENSURE_REQUIREMENTS, [plan_id=1275]
205
206(4) Sort
207Input [3]: [customer_id#0, email#1, full_name#2]
208Arguments: [collate(email#1, UTF8_LCASE) ASC NULLS FIRST], false, 0
209
210(5) Scan ExistingRDD
211Output [4]: [transaction_id#25, email#26, amount#27, transaction_date#28]
212Arguments: [transaction_id#25, email#26, amount#27, transaction_date#28], MapPartitionsRDD[16] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
213
214(6) Project
215Output [3]: [email#26, amount#27, transaction_date#28]
216Input [4]: [transaction_id#25, email#26, amount#27, transaction_date#28]
217
218(7) Exchange
219Input [3]: [email#26, amount#27, transaction_date#28]
220Arguments: hashpartitioning(collate(email#26, UTF8_LCASE), 200), ENSURE_REQUIREMENTS, [plan_id=1276]
221
222(8) Sort
223Input [3]: [email#26, amount#27, transaction_date#28]
224Arguments: [collate(email#26, UTF8_LCASE) ASC NULLS FIRST], false, 0
225
226(9) SortMergeJoin
227Left keys [1]: [collate(email#1, UTF8_LCASE)]
228Right keys [1]: [collate(email#26, UTF8_LCASE)]
229Join type: Inner
230Join condition: None
231
232(10) Project
233Output [6]: [customer_id#0, full_name#2, email#1 AS crm_email#77, email#26 AS txn_email#78, amount#27, transaction_date#28]
234Input [6]: [customer_id#0, email#1, full_name#2, email#26, amount#27, transaction_date#28]
235
236(11) Exchange
237Input [6]: [customer_id#0, full_name#2, crm_email#77, txn_email#78, amount#27, transaction_date#28]
238Arguments: rangepartitioning(customer_id#0 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=1282]
239
240(12) Sort
241Input [6]: [customer_id#0, full_name#2, crm_email#77, txn_email#78, amount#27, transaction_date#28]
242Arguments: [customer_id#0 ASC NULLS FIRST], true, 0
243
244(13) AdaptiveSparkPlan
245Output [6]: [customer_id#0, full_name#2, crm_email#77, txn_email#78, amount#27, transaction_date#28]
246Arguments: isFinalPlan=false
247
248
249
250########## Collation Column Approach (Spark 4.x) ##########
251== Physical Plan ==
252AdaptiveSparkPlan (13)
253+- Sort (12)
254 +- Exchange (11)
255 +- Project (10)
256 +- SortMergeJoin Inner (9)
257 :- Sort (4)
258 : +- Exchange (3)
259 : +- Project (2)
260 : +- Scan ExistingRDD (1)
261 +- Sort (8)
262 +- Exchange (7)
263 +- Project (6)
264 +- Scan ExistingRDD (5)
265
266
267(1) Scan ExistingRDD
268Output [4]: [customer_id#0, email#1, full_name#2, registration_date#3]
269Arguments: [customer_id#0, email#1, full_name#2, registration_date#3], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
270
271(2) Project
272Output [3]: [customer_id#0, collate(email#1, UTF8_LCASE) AS email#114, full_name#2]
273Input [4]: [customer_id#0, email#1, full_name#2, registration_date#3]
274
275(3) Exchange
276Input [3]: [customer_id#0, email#114, full_name#2]
277Arguments: hashpartitioning(collationkey(email#114), 200), ENSURE_REQUIREMENTS, [plan_id=1313]
278
279(4) Sort
280Input [3]: [customer_id#0, email#114, full_name#2]
281Arguments: [collationkey(email#114) ASC NULLS FIRST], false, 0
282
283(5) Scan ExistingRDD
284Output [4]: [transaction_id#25, email#26, amount#27, transaction_date#28]
285Arguments: [transaction_id#25, email#26, amount#27, transaction_date#28], MapPartitionsRDD[16] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
286
287(6) Project
288Output [3]: [collate(email#26, UTF8_LCASE) AS email#119, amount#27, transaction_date#28]
289Input [4]: [transaction_id#25, email#26, amount#27, transaction_date#28]
290
291(7) Exchange
292Input [3]: [email#119, amount#27, transaction_date#28]
293Arguments: hashpartitioning(collationkey(email#119), 200), ENSURE_REQUIREMENTS, [plan_id=1314]
294
295(8) Sort
296Input [3]: [email#119, amount#27, transaction_date#28]
297Arguments: [collationkey(email#119) ASC NULLS FIRST], false, 0
298
299(9) SortMergeJoin
300Left keys [1]: [collationkey(email#114)]
301Right keys [1]: [collationkey(email#119)]
302Join type: Inner
303Join condition: None
304
305(10) Project
306Output [6]: [customer_id#0, full_name#2, email#114 AS crm_email#110, email#119 AS txn_email#111, amount#27, transaction_date#28]
307Input [6]: [customer_id#0, email#114, full_name#2, email#119, amount#27, transaction_date#28]
308
309(11) Exchange
310Input [6]: [customer_id#0, full_name#2, crm_email#110, txn_email#111, amount#27, transaction_date#28]
311Arguments: rangepartitioning(customer_id#0 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=1320]
312
313(12) Sort
314Input [6]: [customer_id#0, full_name#2, crm_email#110, txn_email#111, amount#27, transaction_date#28]
315Arguments: [customer_id#0 ASC NULLS FIRST], true, 0
316
317(13) AdaptiveSparkPlan
318Output [6]: [customer_id#0, full_name#2, crm_email#110, txn_email#111, amount#27, transaction_date#28]
319Arguments: isFinalPlan=false
Example 2 : Linguistic Sorting for International Names
This example demonstrates locale-specific collation for customer name sorting. Shows difference between binary UTF-8 sorting and German linguistic rules.
Spark
1# PySpark Example - Locale Collation Sorting
2from pyspark.sql import SparkSession
3from pyspark.sql.functions import col
4
5URL_MASTER = "spark://spark-master:7077"
6
7
8# Initialize Spark with collation support enabled
9# spark.sql.session.collation.enabled must be true (default in 4.0+)
10# spark.sql.collation.defaultCollation sets session-wide default
11spark = SparkSession.builder \
12 .appName("SparkCollationCaseInsensitive_1") \
13 .config("spark.sql.session.collation.enabled", "true") \
14 .config("spark.sql.collation.defaultCollation", "UTF8_BINARY") \
15 .master(URL_MASTER) \
16 .getOrCreate()
17
18print("""
19##########################
20##### Build Datasets #####
21##########################
22""")
23
24# Source system 1: German client from a Legacy CRM
25german_customers = [
26 ("D001", "Müller", "Hans", "Munich", "2024-01-10"),
27 ("D002", "Mueller", "Anna", "Berlin", "2024-01-15"),
28 ("D003", "Möbius", "Klaus", "Hamburg", "2024-01-22"),
29 ("D004", "Moeller", "Sophie", "Frankfurt", "2024-01-28"),
30 ("D007", "Äpfel", "Johann", "Dresden", "2024-02-12"),
31 ("D008", "Apfel", "Lisa", "Leipzig", "2024-03-15"),
32]
33
34
35# Create DataFrames without collation (default UTF8_BINARY)
36german_df = spark.createDataFrame(german_customers,["customer_id", "last_name", "first_name", "city", "registration_date"])
37print(f"german_customers dataset: {german_df.count()}")
38german_df.show(truncate=False)
39
40# Register DataFrames as temporary views for SQL access
41german_df.createOrReplaceTempView("german_customers")
42
43
44print("""
45##################################################################
46##### 1. Binary UTF-8 Sorting (Default Spark < 4.0 behavior) #####
47##################################################################
48""")
49# Sorts by Unicode code point values, not linguistic rules")
50
51binary_sort = spark.sql("""
52 SELECT
53 customer_id,
54 last_name,
55 first_name,
56 city
57 FROM german_customers
58 ORDER BY last_name
59""")
60
61# Result: Apfel, Moeller, Mueller, Möbius, Müller, Äpfel
62binary_sort.show(truncate=False)
63
64
65
66print("""
67#####################################################
68##### 2. German Linguistic Sorting (de_AI) #####
69#####################################################
70""")
71
72# Sorts according to German dictionary rules (de_AI)
73german_sort = spark.sql("""
74 SELECT
75 customer_id,
76 last_name,
77 first_name,
78 city
79 FROM german_customers
80 ORDER BY last_name COLLATE de_AI
81""")
82
83# Result: Apfel, Äpfel, Möbius, Moeller, Mueller, Müller
84german_sort.show(truncate=False)
85
86
87print("""
88###################################################
89##### 3. Linguistic Search (Finding ü with u) #####
90###################################################
91""")
92
93# Binary search: exact match only
94binary_search = spark.sql("""
95 SELECT *
96 FROM german_customers
97 WHERE last_name COLLATE UTF8_BINARY = 'Muller'
98""")
99print("Searching for 'Muller' with binary collation:")
100print(f"Matched records: {binary_search.count()}") # Result: 0 match
101binary_search.show(truncate=False)
102
103# German linguistic search: ü and ue are equivalent
104german_search = spark.sql("""
105 SELECT *
106 FROM german_customers
107 WHERE last_name COLLATE de_AI = 'Muller'
108""")
109print("Searching for 'Muller' with 'de' collation:")
110print(f"Matched records: {german_search.count()}") # Result: 1 match
111german_search.show(truncate=False)
112
113spark.stop()
Result
1##########################
2##### Build Datasets #####
3##########################
4
5german_customers dataset: 6
6+-----------+---------+----------+---------+-----------------+
7|customer_id|last_name|first_name|city |registration_date|
8+-----------+---------+----------+---------+-----------------+
9|D001 |Müller |Hans |Munich |2024-01-10 |
10|D002 |Mueller |Anna |Berlin |2024-01-15 |
11|D003 |Möbius |Klaus |Hamburg |2024-01-22 |
12|D004 |Moeller |Sophie |Frankfurt|2024-01-28 |
13|D007 |Äpfel |Johann |Dresden |2024-02-12 |
14|D008 |Apfel |Lisa |Leipzig |2024-03-15 |
15+-----------+---------+----------+---------+-----------------+
16
17
18##################################################################
19##### 1. Binary UTF-8 Sorting (Default Spark < 4.0 behavior) #####
20##################################################################
21
22+-----------+---------+----------+---------+
23|customer_id|last_name|first_name|city |
24+-----------+---------+----------+---------+
25|D008 |Apfel |Lisa |Leipzig |
26|D004 |Moeller |Sophie |Frankfurt|
27|D002 |Mueller |Anna |Berlin |
28|D003 |Möbius |Klaus |Hamburg |
29|D001 |Müller |Hans |Munich |
30|D007 |Äpfel |Johann |Dresden |
31+-----------+---------+----------+---------+
32
33
34#####################################################
35##### 2. German Linguistic Sorting (de_AI) #####
36#####################################################
37
38+-----------+---------+----------+---------+
39|customer_id|last_name|first_name|city |
40+-----------+---------+----------+---------+
41|D007 |Äpfel |Johann |Dresden |
42|D008 |Apfel |Lisa |Leipzig |
43|D003 |Möbius |Klaus |Hamburg |
44|D004 |Moeller |Sophie |Frankfurt|
45|D002 |Mueller |Anna |Berlin |
46|D001 |Müller |Hans |Munich |
47+-----------+---------+----------+---------+
48
49
50###################################################
51##### 3. Linguistic Search (Finding ü with u) #####
52###################################################
53
54Searching for 'Muller' with binary collation:
55Matched records: 0
56+-----------+---------+----------+----+-----------------+
57|customer_id|last_name|first_name|city|registration_date|
58+-----------+---------+----------+----+-----------------+
59+-----------+---------+----------+----+-----------------+
60
61Searching for 'Muller' with 'de' collation:
62Matched records: 1
63+-----------+---------+----------+------+-----------------+
64|customer_id|last_name|first_name|city |registration_date|
65+-----------+---------+----------+------+-----------------+
66|D001 |Müller |Hans |Munich|2024-01-10 |
67+-----------+---------+----------+------+-----------------+
