⬅ Previous Topic
Optimization using Catalyst in Apache SparkNext Topic ⮕
Exploding Arrays and Structs in Apache Spark⬅ Previous Topic
Optimization using Catalyst in Apache SparkNext Topic ⮕
Exploding Arrays and Structs in Apache SparkIn real-world data processing, it’s common to encounter JSON files where the data is deeply nested. These nested JSONs are difficult to query or analyze directly unless they are flattened or transformed into tabular formats.
A nested JSON is a JSON structure where values themselves can be other JSON objects or arrays. This results in hierarchical or tree-like data instead of flat rows and columns.
Imagine an e-commerce platform that logs every customer’s order in JSON format. Here's a small sample:
{
"order_id": "1234",
"customer": {
"name": "Alice",
"address": {
"city": "Mumbai",
"zip": "400001"
}
},
"items": [
{"product": "Laptop", "price": 70000},
{"product": "Mouse", "price": 1500}
]
}
This JSON is nested at multiple levels:
customer
is an objectaddress
is nested within customer
items
is an array of objectsWhy can't we just use this JSON directly as a flat table in Spark?
Because Spark SQL and DataFrame APIs expect rows and columns. A nested structure needs to be exploded or flattened so each field is accessible directly for analysis or filtering.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("NestedJSON").getOrCreate()
# Load a sample nested JSON
df = spark.read.json("nested_orders.json")
df.printSchema()
df.show(truncate=False)
The schema will show nested structs and arrays. This helps identify what needs to be flattened.
root |-- customer: struct (nullable = true) | |-- address: struct (nullable = true) | | |-- city: string (nullable = true) | | |-- zip: string (nullable = true) | |-- name: string (nullable = true) |-- items: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- price: long (nullable = true) | | |-- product: string (nullable = true) |-- order_id: string (nullable = true)
To extract nested fields from structs, we use dot notation.
from pyspark.sql.functions import col
flat_df = df.select(
col("order_id"),
col("customer.name").alias("customer_name"),
col("customer.address.city").alias("city"),
col("customer.address.zip").alias("zip")
)
flat_df.show()
+--------+-------------+--------+------+ |order_id|customer_name|city |zip | +--------+-------------+--------+------+ |1234 |Alice |Mumbai |400001| +--------+-------------+--------+------+
The items
field is an array. We use explode()
to convert each array element into a new row.
from pyspark.sql.functions import explode
items_df = df.select(
col("order_id"),
explode(col("items")).alias("item"),
col("customer.name").alias("customer_name")
)
# Flatten the inner struct
final_df = items_df.select(
"order_id",
"customer_name",
col("item.product").alias("product"),
col("item.price").alias("price")
)
final_df.show()
+--------+-------------+--------+------+ |order_id|customer_name|product |price | +--------+-------------+--------+------+ |1234 |Alice |Laptop |70000 | |1234 |Alice |Mouse |1500 | +--------+-------------+--------+------+
What if the items array had 1000 products?
The explode()
function will create 1000 separate rows. This helps Spark process and filter data efficiently, even at large scale.
explode()
Understanding how to transform nested JSON into tabular form is a crucial skill in Big Data workflows, especially when working with semi-structured data.
⬅ Previous Topic
Optimization using Catalyst in Apache SparkNext Topic ⮕
Exploding Arrays and Structs in Apache SparkYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.