Spark : v4.x - Features - SQL Collation Support

You’ll find in this article, some informations about the string SQL collation support feature from Spark v4.x.

Introduction

Collation controls how Spark compares and sorts string data. Spark 4.0 introduces explicit collation specification, enabling linguistic sorting and case-insensitive operations without custom UDFs.

Default behavior remains UTF-8 binary comparison for backward compatibility.

Critical for multi-locale analytics, data quality rules, and migrations from traditional databases. Adds query planning overhead and can cause performance regression on large string operations.

Detail

Collation defines the rules for comparing string values. Traditional databases support locale-specific collation (e.g., en_CI or de_AI) that handle language-specific sorting rules and case sensitivity. Before Spark 4.0, all string operations used UTF-8 binary comparison: byte-by-byte matching with no linguistic awareness.

UTF-8 binary collation sorts by Unicode code point values. This means ‘Z’ (U+005A) sorts before ‘a’ (U+0061) because uppercase letters have lower code points. Accented characters sort at the end of the alphabet rather than near their base letters. Case-insensitive operations required lower() function calls, which create new strings and increase memory usage.

Spark 4.0 adds the COLLATE clause and the spark.sql.session.collation.enabled configuration. You can now specify collation at column definition, expression level, or session default. The implementation leverages ICU (International Components for Unicode) library for linguistic rules.

Three primary collation families exist in Spark 4.0:

  • UTF8_BINARY: Default byte-by-byte comparison (backward compatible)
  • UTF8_LCASE: Case-insensitive UTF-8 comparison (most common use case)
  • UNICODE: Linguistic collation with locale-specific rules (e.g., UNICODE_DE for German)

The collation system integrates with the Catalyst optimizer. Spark analyzes collation requirements during query planning and injects appropriate comparison logic. This adds planning overhead but enables vectorized execution for collation-aware operations rather than falling back to row-by-row processing.

Key Collations :

IdentifierDescription
UTF8_BINARYHistorical default. Byte-by-byte binary comparison.
UTF8_LCASECase-insensitive, accent-sensitive.
*_CI_*Case-insensitive.
*_CS_*Case-sensitive.
*_AI_*Accent-insensitive.
*_AS_*Accent-sensitive.
*fr.*French linguistic order.
*en.*American Linguistic order.
*de.*Deutch Linguistic order.

Collation SQL functions :

  • Collate : Marks a given expression with the specified collation
  • Collation : Returns the collation name of a given expression.
  • Collations : Get all of the Spark SQL string collations

Advantages

  1. Internationalization support : Applications serving multiple regions can apply locale-specific sorting. This improves data quality for customer-facing reports and regulatory compliance.
  2. Code simplification : Eliminates custom UDFs for case-insensitive operations. With UTF8_LCASE collation, comparison happens directly without applying lower(col) function.
  3. Database Migration Compatibility : Alignment with traditional database behavior (PostgreSQL, MySQL, SQL Server, …) that have supported collations for a long time. Teams migrating from these systems can now preserve collation semantics without query rewrites. (This reduces migration risk and testing burden.)

Limitations

  1. Performance Overhead on Non-Binary Collations : Linguistic collation requires ICU library calls that are slower than binary comparison. The overhead increases with string length and complexity of collation rules.
  2. Collation Mixing Restrictions : Spark prohibits operations between columns with different collations without explicit COLLATE casting. This breaks existing queries that implicitly compare string columns if collations differ. Joins between different collations require an explicit cast, which can trigger re-shuffling.
  3. Very Limited Collation Family Support : Spark 4 has limited support, you need to check if your collation is managed by your spark version. With Spark 4.0.1, only UTF8 is supported with selected locale-specific collations.
  4. Limited Collation functionality Support : Not all collations are supported for partitioned columns and no collation inference from Data Sources (You must manually specify collation in Spark object definition).

When not to use Collation :

  • For performance-critical joins.
  • For partition keys.

Real-World Use Cases

  • Use Case 1 : Financial Institution Customer Matching
    • A bank merges customer records from acquired institutions. Customer names contain accented characters (José, François, Müller). Deduplication logic needs to match “Mueller” with “Müller” according to German collation rules. de_AI collation provides linguistic equivalence without custom normalization functions.
  • Use Case 2 : E-Commerce Product Catalog with Locale-Specific Sorting
    • An international retailer displays product listings sorted alphabetically by name. Spanish customers expect “ñ” to sort between “n” and “o”, while default UTF-8 binary sorts it at the end. Different collations per region enable correct sort order without maintaining separate datasets.

Codes

Example 1 : Case-Insensitive User Matching with UTF8_LCASE

This example demonstrates user record matching where email addresses may have inconsistent capitalization. Common scenario in data quality pipelines merging records from multiple source systems.

Spark

  1# PySpark Example - Case-Insensitive User Matching
  2from pyspark.sql import SparkSession
  3from pyspark.sql.functions import col
  4
  5URL_MASTER = "spark://spark-master:7077"
  6
  7
  8# Initialize Spark with collation support enabled
  9# spark.sql.session.collation.enabled must be true (default in 4.0+)
 10# spark.sql.collation.defaultCollation sets session-wide default
 11spark = SparkSession.builder \
 12    .appName("SparkCollationCaseInsensitive_1") \
 13    .config("spark.sql.session.collation.enabled", "true") \
 14    .config("spark.sql.collation.defaultCollation", "UTF8_BINARY") \
 15    .master(URL_MASTER) \
 16    .getOrCreate()
 17
 18print("""
 19##########################
 20##### Build Datasets #####
 21##########################
 22""")
 23
 24# Source system 1: Customer records from CRM
 25# Emails stored with original capitalization from user input
 26crm_customers = [
 27    ("C001", "Alice.Smith@company.com", "Alice Smith", "2021-01-15"),
 28    ("C002", "bob.jones@enterprise.net", "Bob Jones", "2025-02-20"),
 29    ("C003", "Carol.WHITE@startup.io", "Carol White", "2024-03-10"),
 30    ("C004", "david.brown@tech.com", "David Brown", "2024-04-05"),
 31    ("C005", "Eve.DAVIS@innovation.com", "Eve Davis", "2025-05-12"),
 32]
 33
 34# Source system 2: Transaction records from payment processor
 35# Emails normalized to lowercase by payment gateway
 36transaction_emails = [
 37    ("T001", "alice.smith@company.com", 450.00, "2026-01-10"),
 38    ("T002", "bob.jones@enterprise.net", 1200.00, "2026-01-15"),
 39    ("T003", "carol.white@startup.io", 300.00, "2026-01-20"),
 40    ("T004", "DAVID.BROWN@tech.com", 890.00, "2026-01-25"),
 41    ("T005", "eve.davis@innovation.com", 550.00, "2026-02-01"),
 42    ("T006", "frank.miller@unknown.com", 200.00, "2026-02-05"),
 43]
 44
 45
 46# Create DataFrames without collation (default UTF8_BINARY)
 47crm_df = spark.createDataFrame(crm_customers,["customer_id", "email", "full_name", "registration_date"])
 48print(f"crm_customers dataset: {crm_df.count()}")  
 49crm_df.show(truncate=False)
 50
 51transactions_df = spark.createDataFrame(transaction_emails,["transaction_id", "email", "amount", "transaction_date"])
 52print(f"transaction_emails dataset: {transactions_df.count()}")  
 53transactions_df.show(truncate=False)
 54
 55
 56# Register DataFrames as temporary views for SQL access
 57crm_df.createOrReplaceTempView("crm_customers")
 58transactions_df.createOrReplaceTempView("transactions")
 59
 60
 61print("""
 62###################################
 63##### 1. Standard binary join #####
 64###################################
 65""")
 66
 67# Standard binary join - will MISS matches due to case differences
 68print("=== Binary Collation Join (Default Behavior) ===")
 69binary_join = crm_df.join(transactions_df,crm_df.email == transactions_df.email,"inner") \
 70                    .select(
 71                            crm_df.customer_id,
 72                            crm_df.full_name,
 73                            crm_df.email.alias("crm_email"),
 74                            transactions_df.email.alias("txn_email"),
 75                            transactions_df.amount
 76                    )
 77print(f"Matched records: {binary_join.count()}")  # Will show only 2 matches
 78binary_join.show(truncate=False)
 79
 80
 81
 82print("""
 83##############################################################
 84##### 2. Case-Insensitive Join with UTF8_LCASE Collation #####
 85##############################################################
 86""")
 87# Apply collation at expression level using SQL COLLATE syntax
 88
 89# SQL approach: Apply collation in WHERE clause
 90case_insensitive_sql = spark.sql("""
 91    SELECT 
 92        c.customer_id,
 93        c.full_name,
 94        c.email as crm_email,
 95        t.email as txn_email,
 96        t.amount,
 97        t.transaction_date
 98    FROM crm_customers c
 99    INNER JOIN transactions t
100        ON c.email COLLATE UTF8_LCASE = t.email COLLATE UTF8_LCASE
101    ORDER BY c.customer_id
102""")
103
104print(f"Matched records: {case_insensitive_sql.count()}")  # Will show 5 matches
105case_insensitive_sql.show(truncate=False)
106
107
108
109
110print("""
111##########################################
112##### 3. Collation Defined in Schema #####
113##########################################
114""")
115#Define collation at table level
116
117spark.sql("""
118    CREATE OR REPLACE TEMPORARY VIEW crm_customers_collated AS
119    SELECT 
120        customer_id,
121        email COLLATE UTF8_LCASE as email,
122        full_name,
123        registration_date
124    FROM crm_customers
125""")
126
127spark.sql("""
128    CREATE OR REPLACE TEMPORARY VIEW transactions_collated AS
129    SELECT 
130        transaction_id,
131        email COLLATE UTF8_LCASE as email,
132        amount,
133        transaction_date
134    FROM transactions
135""")
136
137# join without explicit COLLATE - collation is inherited from schema
138schema_collation_join = spark.sql("""
139    SELECT 
140        c.customer_id,
141        c.full_name,
142        c.email as crm_email,
143        t.email as txn_email,
144        t.amount,
145        t.transaction_date
146    FROM crm_customers_collated c
147    INNER JOIN transactions_collated t ON c.email = t.email
148    ORDER BY c.customer_id
149""")
150
151print(f"Matched records: {schema_collation_join.count()}")  # Shows 5 matches
152schema_collation_join.show(truncate=False)
153
154
155
156print("""
157############################################
158##### 4. Collation metadata inspection #####
159############################################
160""")
161
162print("crm_customers_collated schema :")
163print(spark.sql("SELECT email FROM crm_customers_collated").schema)
164
165# Performance comparison: Show that collation join avoids string transformation
166# Binary case-insensitive requires LOWER() which allocates new strings
167lower_join = spark.sql("""
168    SELECT 
169        c.customer_id,
170        c.full_name,
171        COUNT(*) as match_count
172    FROM crm_customers c
173    INNER JOIN transactions t
174        ON LOWER(c.email) = LOWER(t.email)
175    GROUP BY c.customer_id, c.full_name
176""")
177
178print("\n########## Lower() Function Approach (before Spark 4.x) ##########")
179lower_join.explain("formatted")  # Shows Project with lower() function calls
180print(lower_join.collect())
181
182print("\n########## Collation Function Approach (Spark 4.x) ##########")
183case_insensitive_sql.explain("formatted")  # Shows no function transformation
184
185print("\n########## Collation Column Approach (Spark 4.x) ##########")
186schema_collation_join.explain("formatted")  # Shows no function transformation
187
188spark.stop()

Result

  1
  2##########################
  3##### Build Datasets #####
  4##########################
  5
  6crm_customers dataset: 5
  7+-----------+------------------------+-----------+-----------------+
  8|customer_id|email                   |full_name  |registration_date|
  9+-----------+------------------------+-----------+-----------------+
 10|C001       |Alice.Smith@company.com |Alice Smith|2021-01-15       |
 11|C002       |bob.jones@enterprise.net|Bob Jones  |2025-02-20       |
 12|C003       |Carol.WHITE@startup.io  |Carol White|2024-03-10       |
 13|C004       |david.brown@tech.com    |David Brown|2024-04-05       |
 14|C005       |Eve.DAVIS@innovation.com|Eve Davis  |2025-05-12       |
 15+-----------+------------------------+-----------+-----------------+
 16
 17transaction_emails dataset: 6
 18+--------------+------------------------+------+----------------+
 19|transaction_id|email                   |amount|transaction_date|
 20+--------------+------------------------+------+----------------+
 21|T001          |alice.smith@company.com |450.0 |2026-01-10      |
 22|T002          |bob.jones@enterprise.net|1200.0|2026-01-15      |
 23|T003          |carol.white@startup.io  |300.0 |2026-01-20      |
 24|T004          |DAVID.BROWN@tech.com    |890.0 |2026-01-25      |
 25|T005          |eve.davis@innovation.com|550.0 |2026-02-01      |
 26|T006          |frank.miller@unknown.com|200.0 |2026-02-05      |
 27+--------------+------------------------+------+----------------+
 28
 29
 30###################################
 31##### 1. Standard binary join #####
 32###################################
 33
 34=== Binary Collation Join (Default Behavior) ===
 35Matched records: 1
 36+-----------+---------+------------------------+------------------------+------+
 37|customer_id|full_name|crm_email               |txn_email               |amount|
 38+-----------+---------+------------------------+------------------------+------+
 39|C002       |Bob Jones|bob.jones@enterprise.net|bob.jones@enterprise.net|1200.0|
 40+-----------+---------+------------------------+------------------------+------+
 41
 42
 43##############################################################
 44##### 2. Case-Insensitive Join with UTF8_LCASE Collation #####
 45##############################################################
 46
 47Matched records: 5
 48+-----------+-----------+------------------------+------------------------+------+----------------+
 49|customer_id|full_name  |crm_email               |txn_email               |amount|transaction_date|
 50+-----------+-----------+------------------------+------------------------+------+----------------+
 51|C001       |Alice Smith|Alice.Smith@company.com |alice.smith@company.com |450.0 |2026-01-10      |
 52|C002       |Bob Jones  |bob.jones@enterprise.net|bob.jones@enterprise.net|1200.0|2026-01-15      |
 53|C003       |Carol White|Carol.WHITE@startup.io  |carol.white@startup.io  |300.0 |2026-01-20      |
 54|C004       |David Brown|david.brown@tech.com    |DAVID.BROWN@tech.com    |890.0 |2026-01-25      |
 55|C005       |Eve Davis  |Eve.DAVIS@innovation.com|eve.davis@innovation.com|550.0 |2026-02-01      |
 56+-----------+-----------+------------------------+------------------------+------+----------------+
 57
 58
 59##########################################
 60##### 3. Collation Defined in Schema #####
 61##########################################
 62
 63Matched records: 5
 64+-----------+-----------+------------------------+------------------------+------+----------------+
 65|customer_id|full_name  |crm_email               |txn_email               |amount|transaction_date|
 66+-----------+-----------+------------------------+------------------------+------+----------------+
 67|C001       |Alice Smith|Alice.Smith@company.com |alice.smith@company.com |450.0 |2026-01-10      |
 68|C002       |Bob Jones  |bob.jones@enterprise.net|bob.jones@enterprise.net|1200.0|2026-01-15      |
 69|C003       |Carol White|Carol.WHITE@startup.io  |carol.white@startup.io  |300.0 |2026-01-20      |
 70|C004       |David Brown|david.brown@tech.com    |DAVID.BROWN@tech.com    |890.0 |2026-01-25      |
 71|C005       |Eve Davis  |Eve.DAVIS@innovation.com|eve.davis@innovation.com|550.0 |2026-02-01      |
 72+-----------+-----------+------------------------+------------------------+------+----------------+
 73
 74
 75############################################
 76##### 4. Collation metadata inspection #####
 77############################################
 78
 79crm_customers_collated schema :
 80StructType([StructField('email', StringType('UTF8_LCASE'), True)])
 81
 82########## Lower() Function Approach (before Spark 4.x) ##########
 83== Physical Plan ==
 84AdaptiveSparkPlan (16)
 85+- HashAggregate (15)
 86   +- Exchange (14)
 87      +- HashAggregate (13)
 88         +- Project (12)
 89            +- SortMergeJoin Inner (11)
 90               :- Sort (5)
 91               :  +- Exchange (4)
 92               :     +- Project (3)
 93               :        +- Filter (2)
 94               :           +- Scan ExistingRDD (1)
 95               +- Sort (10)
 96                  +- Exchange (9)
 97                     +- Project (8)
 98                        +- Filter (7)
 99                           +- Scan ExistingRDD (6)
100
101
102(1) Scan ExistingRDD
103Output [4]: [customer_id#0, email#1, full_name#2, registration_date#3]
104Arguments: [customer_id#0, email#1, full_name#2, registration_date#3], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
105
106(2) Filter
107Input [4]: [customer_id#0, email#1, full_name#2, registration_date#3]
108Condition : isnotnull(email#1)
109
110(3) Project
111Output [3]: [customer_id#0, email#1, full_name#2]
112Input [4]: [customer_id#0, email#1, full_name#2, registration_date#3]
113
114(4) Exchange
115Input [3]: [customer_id#0, email#1, full_name#2]
116Arguments: hashpartitioning(lower(email#1), 200), ENSURE_REQUIREMENTS, [plan_id=1080]
117
118(5) Sort
119Input [3]: [customer_id#0, email#1, full_name#2]
120Arguments: [lower(email#1) ASC NULLS FIRST], false, 0
121
122(6) Scan ExistingRDD
123Output [4]: [transaction_id#25, email#26, amount#27, transaction_date#28]
124Arguments: [transaction_id#25, email#26, amount#27, transaction_date#28], MapPartitionsRDD[16] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
125
126(7) Filter
127Input [4]: [transaction_id#25, email#26, amount#27, transaction_date#28]
128Condition : isnotnull(email#26)
129
130(8) Project
131Output [1]: [email#26]
132Input [4]: [transaction_id#25, email#26, amount#27, transaction_date#28]
133
134(9) Exchange
135Input [1]: [email#26]
136Arguments: hashpartitioning(lower(email#26), 200), ENSURE_REQUIREMENTS, [plan_id=1081]
137
138(10) Sort
139Input [1]: [email#26]
140Arguments: [lower(email#26) ASC NULLS FIRST], false, 0
141
142(11) SortMergeJoin
143Left keys [1]: [lower(email#1)]
144Right keys [1]: [lower(email#26)]
145Join type: Inner
146Join condition: None
147
148(12) Project
149Output [2]: [customer_id#0, full_name#2]
150Input [4]: [customer_id#0, email#1, full_name#2, email#26]
151
152(13) HashAggregate
153Input [2]: [customer_id#0, full_name#2]
154Keys [2]: [customer_id#0, full_name#2]
155Functions [1]: [partial_count(1)]
156Aggregate Attributes [1]: [count#158L]
157Results [3]: [customer_id#0, full_name#2, count#159L]
158
159(14) Exchange
160Input [3]: [customer_id#0, full_name#2, count#159L]
161Arguments: hashpartitioning(customer_id#0, full_name#2, 200), ENSURE_REQUIREMENTS, [plan_id=1088]
162
163(15) HashAggregate
164Input [3]: [customer_id#0, full_name#2, count#159L]
165Keys [2]: [customer_id#0, full_name#2]
166Functions [1]: [count(1)]
167Aggregate Attributes [1]: [count(1)#157L]
168Results [3]: [customer_id#0, full_name#2, count(1)#157L AS match_count#156L]
169
170(16) AdaptiveSparkPlan
171Output [3]: [customer_id#0, full_name#2, match_count#156L]
172Arguments: isFinalPlan=false
173
174
175[Row(customer_id='C002', full_name='Bob Jones', match_count=1), Row(customer_id='C001', full_name='Alice Smith', match_count=1), Row(customer_id='C005', full_name='Eve Davis', match_count=1), Row(customer_id='C004', full_name='David Brown', match_count=1), Row(customer_id='C003', full_name='Carol White', match_count=1)]
176
177########## Collation Function Approach (Spark 4.x) ##########
178== Physical Plan ==
179AdaptiveSparkPlan (13)
180+- Sort (12)
181   +- Exchange (11)
182      +- Project (10)
183         +- SortMergeJoin Inner (9)
184            :- Sort (4)
185            :  +- Exchange (3)
186            :     +- Project (2)
187            :        +- Scan ExistingRDD (1)
188            +- Sort (8)
189               +- Exchange (7)
190                  +- Project (6)
191                     +- Scan ExistingRDD (5)
192
193
194(1) Scan ExistingRDD
195Output [4]: [customer_id#0, email#1, full_name#2, registration_date#3]
196Arguments: [customer_id#0, email#1, full_name#2, registration_date#3], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
197
198(2) Project
199Output [3]: [customer_id#0, email#1, full_name#2]
200Input [4]: [customer_id#0, email#1, full_name#2, registration_date#3]
201
202(3) Exchange
203Input [3]: [customer_id#0, email#1, full_name#2]
204Arguments: hashpartitioning(collate(email#1, UTF8_LCASE), 200), ENSURE_REQUIREMENTS, [plan_id=1275]
205
206(4) Sort
207Input [3]: [customer_id#0, email#1, full_name#2]
208Arguments: [collate(email#1, UTF8_LCASE) ASC NULLS FIRST], false, 0
209
210(5) Scan ExistingRDD
211Output [4]: [transaction_id#25, email#26, amount#27, transaction_date#28]
212Arguments: [transaction_id#25, email#26, amount#27, transaction_date#28], MapPartitionsRDD[16] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
213
214(6) Project
215Output [3]: [email#26, amount#27, transaction_date#28]
216Input [4]: [transaction_id#25, email#26, amount#27, transaction_date#28]
217
218(7) Exchange
219Input [3]: [email#26, amount#27, transaction_date#28]
220Arguments: hashpartitioning(collate(email#26, UTF8_LCASE), 200), ENSURE_REQUIREMENTS, [plan_id=1276]
221
222(8) Sort
223Input [3]: [email#26, amount#27, transaction_date#28]
224Arguments: [collate(email#26, UTF8_LCASE) ASC NULLS FIRST], false, 0
225
226(9) SortMergeJoin
227Left keys [1]: [collate(email#1, UTF8_LCASE)]
228Right keys [1]: [collate(email#26, UTF8_LCASE)]
229Join type: Inner
230Join condition: None
231
232(10) Project
233Output [6]: [customer_id#0, full_name#2, email#1 AS crm_email#77, email#26 AS txn_email#78, amount#27, transaction_date#28]
234Input [6]: [customer_id#0, email#1, full_name#2, email#26, amount#27, transaction_date#28]
235
236(11) Exchange
237Input [6]: [customer_id#0, full_name#2, crm_email#77, txn_email#78, amount#27, transaction_date#28]
238Arguments: rangepartitioning(customer_id#0 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=1282]
239
240(12) Sort
241Input [6]: [customer_id#0, full_name#2, crm_email#77, txn_email#78, amount#27, transaction_date#28]
242Arguments: [customer_id#0 ASC NULLS FIRST], true, 0
243
244(13) AdaptiveSparkPlan
245Output [6]: [customer_id#0, full_name#2, crm_email#77, txn_email#78, amount#27, transaction_date#28]
246Arguments: isFinalPlan=false
247
248
249
250########## Collation Column Approach (Spark 4.x) ##########
251== Physical Plan ==
252AdaptiveSparkPlan (13)
253+- Sort (12)
254   +- Exchange (11)
255      +- Project (10)
256         +- SortMergeJoin Inner (9)
257            :- Sort (4)
258            :  +- Exchange (3)
259            :     +- Project (2)
260            :        +- Scan ExistingRDD (1)
261            +- Sort (8)
262               +- Exchange (7)
263                  +- Project (6)
264                     +- Scan ExistingRDD (5)
265
266
267(1) Scan ExistingRDD
268Output [4]: [customer_id#0, email#1, full_name#2, registration_date#3]
269Arguments: [customer_id#0, email#1, full_name#2, registration_date#3], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
270
271(2) Project
272Output [3]: [customer_id#0, collate(email#1, UTF8_LCASE) AS email#114, full_name#2]
273Input [4]: [customer_id#0, email#1, full_name#2, registration_date#3]
274
275(3) Exchange
276Input [3]: [customer_id#0, email#114, full_name#2]
277Arguments: hashpartitioning(collationkey(email#114), 200), ENSURE_REQUIREMENTS, [plan_id=1313]
278
279(4) Sort
280Input [3]: [customer_id#0, email#114, full_name#2]
281Arguments: [collationkey(email#114) ASC NULLS FIRST], false, 0
282
283(5) Scan ExistingRDD
284Output [4]: [transaction_id#25, email#26, amount#27, transaction_date#28]
285Arguments: [transaction_id#25, email#26, amount#27, transaction_date#28], MapPartitionsRDD[16] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
286
287(6) Project
288Output [3]: [collate(email#26, UTF8_LCASE) AS email#119, amount#27, transaction_date#28]
289Input [4]: [transaction_id#25, email#26, amount#27, transaction_date#28]
290
291(7) Exchange
292Input [3]: [email#119, amount#27, transaction_date#28]
293Arguments: hashpartitioning(collationkey(email#119), 200), ENSURE_REQUIREMENTS, [plan_id=1314]
294
295(8) Sort
296Input [3]: [email#119, amount#27, transaction_date#28]
297Arguments: [collationkey(email#119) ASC NULLS FIRST], false, 0
298
299(9) SortMergeJoin
300Left keys [1]: [collationkey(email#114)]
301Right keys [1]: [collationkey(email#119)]
302Join type: Inner
303Join condition: None
304
305(10) Project
306Output [6]: [customer_id#0, full_name#2, email#114 AS crm_email#110, email#119 AS txn_email#111, amount#27, transaction_date#28]
307Input [6]: [customer_id#0, email#114, full_name#2, email#119, amount#27, transaction_date#28]
308
309(11) Exchange
310Input [6]: [customer_id#0, full_name#2, crm_email#110, txn_email#111, amount#27, transaction_date#28]
311Arguments: rangepartitioning(customer_id#0 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=1320]
312
313(12) Sort
314Input [6]: [customer_id#0, full_name#2, crm_email#110, txn_email#111, amount#27, transaction_date#28]
315Arguments: [customer_id#0 ASC NULLS FIRST], true, 0
316
317(13) AdaptiveSparkPlan
318Output [6]: [customer_id#0, full_name#2, crm_email#110, txn_email#111, amount#27, transaction_date#28]
319Arguments: isFinalPlan=false

Example 2 : Linguistic Sorting for International Names

This example demonstrates locale-specific collation for customer name sorting. Shows difference between binary UTF-8 sorting and German linguistic rules.

Spark

  1# PySpark Example - Locale Collation Sorting
  2from pyspark.sql import SparkSession
  3from pyspark.sql.functions import col
  4
  5URL_MASTER = "spark://spark-master:7077"
  6
  7
  8# Initialize Spark with collation support enabled
  9# spark.sql.session.collation.enabled must be true (default in 4.0+)
 10# spark.sql.collation.defaultCollation sets session-wide default
 11spark = SparkSession.builder \
 12    .appName("SparkCollationCaseInsensitive_1") \
 13    .config("spark.sql.session.collation.enabled", "true") \
 14    .config("spark.sql.collation.defaultCollation", "UTF8_BINARY") \
 15    .master(URL_MASTER) \
 16    .getOrCreate()
 17
 18print("""
 19##########################
 20##### Build Datasets #####
 21##########################
 22""")
 23
 24# Source system 1: German client from a Legacy CRM
 25german_customers = [
 26    ("D001", "Müller", "Hans", "Munich", "2024-01-10"),
 27    ("D002", "Mueller", "Anna", "Berlin", "2024-01-15"),
 28    ("D003", "Möbius", "Klaus", "Hamburg", "2024-01-22"),
 29    ("D004", "Moeller", "Sophie", "Frankfurt", "2024-01-28"),
 30    ("D007", "Äpfel", "Johann", "Dresden", "2024-02-12"),
 31    ("D008", "Apfel", "Lisa", "Leipzig", "2024-03-15"),
 32]
 33
 34
 35# Create DataFrames without collation (default UTF8_BINARY)
 36german_df = spark.createDataFrame(german_customers,["customer_id", "last_name", "first_name", "city", "registration_date"])
 37print(f"german_customers dataset: {german_df.count()}")  
 38german_df.show(truncate=False)
 39
 40# Register DataFrames as temporary views for SQL access
 41german_df.createOrReplaceTempView("german_customers")
 42
 43
 44print("""
 45##################################################################
 46##### 1. Binary UTF-8 Sorting (Default Spark < 4.0 behavior) #####
 47##################################################################
 48""")
 49# Sorts by Unicode code point values, not linguistic rules")
 50
 51binary_sort = spark.sql("""
 52    SELECT 
 53        customer_id,
 54        last_name,
 55        first_name,
 56        city
 57    FROM german_customers
 58    ORDER BY last_name
 59""")
 60
 61# Result: Apfel, Moeller, Mueller, Möbius, Müller, Äpfel
 62binary_sort.show(truncate=False)
 63
 64
 65
 66print("""
 67#####################################################
 68##### 2. German Linguistic Sorting (de_AI) #####
 69#####################################################
 70""")
 71
 72# Sorts according to German dictionary rules (de_AI)
 73german_sort = spark.sql("""
 74    SELECT 
 75        customer_id,
 76        last_name,
 77        first_name,
 78        city
 79    FROM german_customers
 80    ORDER BY last_name COLLATE de_AI
 81""")
 82
 83# Result: Apfel, Äpfel, Möbius, Moeller, Mueller, Müller
 84german_sort.show(truncate=False)
 85
 86
 87print("""
 88###################################################
 89##### 3. Linguistic Search (Finding ü with u) #####
 90###################################################
 91""")
 92
 93# Binary search: exact match only
 94binary_search = spark.sql("""
 95    SELECT *
 96    FROM german_customers
 97    WHERE last_name COLLATE UTF8_BINARY = 'Muller'
 98""")
 99print("Searching for 'Muller' with binary collation:")
100print(f"Matched records: {binary_search.count()}")  # Result: 0 match
101binary_search.show(truncate=False) 
102
103# German linguistic search: ü and ue are equivalent
104german_search = spark.sql("""
105    SELECT *
106    FROM german_customers  
107    WHERE last_name COLLATE de_AI = 'Muller'
108""")
109print("Searching for 'Muller' with 'de' collation:")
110print(f"Matched records: {german_search.count()}")  # Result: 1 match
111german_search.show(truncate=False)
112
113spark.stop()

Result

 1##########################
 2##### Build Datasets #####
 3##########################
 4
 5german_customers dataset: 6
 6+-----------+---------+----------+---------+-----------------+
 7|customer_id|last_name|first_name|city     |registration_date|
 8+-----------+---------+----------+---------+-----------------+
 9|D001       |Müller   |Hans      |Munich   |2024-01-10       |
10|D002       |Mueller  |Anna      |Berlin   |2024-01-15       |
11|D003       |Möbius   |Klaus     |Hamburg  |2024-01-22       |
12|D004       |Moeller  |Sophie    |Frankfurt|2024-01-28       |
13|D007       |Äpfel    |Johann    |Dresden  |2024-02-12       |
14|D008       |Apfel    |Lisa      |Leipzig  |2024-03-15       |
15+-----------+---------+----------+---------+-----------------+
16
17
18##################################################################
19##### 1. Binary UTF-8 Sorting (Default Spark < 4.0 behavior) #####
20##################################################################
21
22+-----------+---------+----------+---------+
23|customer_id|last_name|first_name|city     |
24+-----------+---------+----------+---------+
25|D008       |Apfel    |Lisa      |Leipzig  |
26|D004       |Moeller  |Sophie    |Frankfurt|
27|D002       |Mueller  |Anna      |Berlin   |
28|D003       |Möbius   |Klaus     |Hamburg  |
29|D001       |Müller   |Hans      |Munich   |
30|D007       |Äpfel    |Johann    |Dresden  |
31+-----------+---------+----------+---------+
32
33
34#####################################################
35##### 2. German Linguistic Sorting (de_AI) #####
36#####################################################
37
38+-----------+---------+----------+---------+
39|customer_id|last_name|first_name|city     |
40+-----------+---------+----------+---------+
41|D007       |Äpfel    |Johann    |Dresden  |
42|D008       |Apfel    |Lisa      |Leipzig  |
43|D003       |Möbius   |Klaus     |Hamburg  |
44|D004       |Moeller  |Sophie    |Frankfurt|
45|D002       |Mueller  |Anna      |Berlin   |
46|D001       |Müller   |Hans      |Munich   |
47+-----------+---------+----------+---------+
48
49
50###################################################
51##### 3. Linguistic Search (Finding ü with u) #####
52###################################################
53
54Searching for 'Muller' with binary collation:
55Matched records: 0
56+-----------+---------+----------+----+-----------------+
57|customer_id|last_name|first_name|city|registration_date|
58+-----------+---------+----------+----+-----------------+
59+-----------+---------+----------+----+-----------------+
60
61Searching for 'Muller' with 'de' collation:
62Matched records: 1
63+-----------+---------+----------+------+-----------------+
64|customer_id|last_name|first_name|city  |registration_date|
65+-----------+---------+----------+------+-----------------+
66|D001       |Müller   |Hans      |Munich|2024-01-10       |
67+-----------+---------+----------+------+-----------------+