Flattening Hierarchical Data
In many real-world scenarios, data is not stored in flat tables but comes in nested or hierarchical formats. A common example is JSON files from APIs or NoSQL databases like MongoDB. Flattening hierarchical data means transforming this complex nested structure into a flat table where each value has its own column.
Why Do We Flatten Data?
Most data processing and analysis tools (like SQL or DataFrames) work best with flat, tabular structures. Flattening allows us to:
- Make querying easier
- Prepare data for machine learning models
- Export to CSV or Excel formats
- Visualize the data more effectively
Real-World Example: Customer Orders
Suppose we receive customer data in the following nested JSON format:
nested_data = [
{
"customer_id": 1,
"name": "Alice",
"orders": [
{"order_id": 101, "amount": 250},
{"order_id": 102, "amount": 450}
]
},
{
"customer_id": 2,
"name": "Bob",
"orders": [
{"order_id": 103, "amount": 150}
]
}
]
This structure is common in API responses. It contains an array of customers, where each customer has an array of orders. We need to flatten this so that each order becomes its own row, with customer details repeated.
How Will the Flattened Output Look?
After flattening, our table should look like this:
customer_id | name | order_id | amount ------------|-------|----------|-------- 1 | Alice | 101 | 250 1 | Alice | 102 | 450 2 | Bob | 103 | 150
Question:
Why is flattening useful here?
Answer:
Flattening allows us to analyze individual orders across customers, compute average order values, and join with other flat datasets like products or regions.
Flattening in PySpark
Let's write PySpark code to flatten such hierarchical JSON structures.
Step 1: Set up SparkSession
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder .appName("Flatten Hierarchical Data") .getOrCreate()
Step 2: Create a DataFrame with Nested Structure
data = [
{
"customer_id": 1,
"name": "Alice",
"orders": [
{"order_id": 101, "amount": 250},
{"order_id": 102, "amount": 450}
]
},
{
"customer_id": 2,
"name": "Bob",
"orders": [
{"order_id": 103, "amount": 150}
]
}
]
df = spark.createDataFrame(data)
df.printSchema()
df.show(truncate=False)
root |-- customer_id: long (nullable = true) |-- name: string (nullable = true) |-- orders: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- order_id: long (nullable = true) | | |-- amount: long (nullable = true) +-----------+-----+----------------------------+ |customer_id|name |orders | +-----------+-----+----------------------------+ |1 |Alice|[{101, 250}, {102, 450}] | |2 |Bob |[{103, 150}] | +-----------+-----+----------------------------+
Step 3: Use explode() to Flatten
The explode()
function helps convert each element in the nested array into its own row.
from pyspark.sql.functions import col
flat_df = df.withColumn("order", explode(col("orders"))) .select(
col("customer_id"),
col("name"),
col("order.order_id").alias("order_id"),
col("order.amount").alias("amount")
)
flat_df.show()
+-----------+-----+--------+------+ |customer_id|name |order_id|amount| +-----------+-----+--------+------+ |1 |Alice| 101| 250| |1 |Alice| 102| 450| |2 |Bob | 103| 150| +-----------+-----+--------+------+
Question:
What happens if the array is empty or null?
Answer:
The explode function will skip that row. You can handle such rows separately using fillna
or when-otherwise
expressions.
Conclusion
Flattening hierarchical data is a crucial step when working with nested JSON or complex NoSQL structures. It helps convert semi-structured data into a structured format that’s easier to work with, analyze, and model. PySpark provides powerful and scalable tools like explode
and column operations to perform this transformation efficiently.