⬅ Previous Topic
Exploding Arrays and Structs in Apache SparkNext Topic ⮕
Schema Evolution and Inference in Apache Spark⬅ Previous Topic
Exploding Arrays and Structs in Apache SparkNext Topic ⮕
Schema Evolution and Inference in Apache SparkIn many real-world scenarios, data is not stored in flat tables but comes in nested or hierarchical formats. A common example is JSON files from APIs or NoSQL databases like MongoDB. Flattening hierarchical data means transforming this complex nested structure into a flat table where each value has its own column.
Most data processing and analysis tools (like SQL or DataFrames) work best with flat, tabular structures. Flattening allows us to:
Suppose we receive customer data in the following nested JSON format:
nested_data = [
{
"customer_id": 1,
"name": "Alice",
"orders": [
{"order_id": 101, "amount": 250},
{"order_id": 102, "amount": 450}
]
},
{
"customer_id": 2,
"name": "Bob",
"orders": [
{"order_id": 103, "amount": 150}
]
}
]
This structure is common in API responses. It contains an array of customers, where each customer has an array of orders. We need to flatten this so that each order becomes its own row, with customer details repeated.
After flattening, our table should look like this:
customer_id | name | order_id | amount ------------|-------|----------|-------- 1 | Alice | 101 | 250 1 | Alice | 102 | 450 2 | Bob | 103 | 150
Why is flattening useful here?
Flattening allows us to analyze individual orders across customers, compute average order values, and join with other flat datasets like products or regions.
Let's write PySpark code to flatten such hierarchical JSON structures.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder .appName("Flatten Hierarchical Data") .getOrCreate()
data = [
{
"customer_id": 1,
"name": "Alice",
"orders": [
{"order_id": 101, "amount": 250},
{"order_id": 102, "amount": 450}
]
},
{
"customer_id": 2,
"name": "Bob",
"orders": [
{"order_id": 103, "amount": 150}
]
}
]
df = spark.createDataFrame(data)
df.printSchema()
df.show(truncate=False)
root |-- customer_id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- order_id: long (nullable = true) | | |-- amount: long (nullable = true) +-----------+-----+----------------------------+ |customer_id|name |orders | +-----------+-----+----------------------------+ |1 |Alice|[{101, 250}, {102, 450}] | |2 |Bob |[{103, 150}] | +-----------+-----+----------------------------+
The explode()
function helps convert each element in the nested array into its own row.
from pyspark.sql.functions import col
flat_df = df.withColumn("order", explode(col("orders"))) .select(
col("customer_id"),
col("name"),
col("order.order_id").alias("order_id"),
col("order.amount").alias("amount")
)
flat_df.show()
+-----------+-----+--------+------+ |customer_id|name |order_id|amount| +-----------+-----+--------+------+ |1 |Alice| 101| 250| |1 |Alice| 102| 450| |2 |Bob | 103| 150| +-----------+-----+--------+------+
What happens if the array is empty or null?
The explode function will skip that row. You can handle such rows separately using fillna
or when-otherwise
expressions.
Flattening hierarchical data is a important step when working with nested JSON or complex NoSQL structures. It helps convert semi-structured data into a structured format that’s easier to work with, analyze, and model. PySpark provides powerful and scalable tools like explode
and column operations to perform this transformation efficiently.
⬅ Previous Topic
Exploding Arrays and Structs in Apache SparkNext Topic ⮕
Schema Evolution and Inference in Apache SparkYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.