Flattening Hierarchical Data

In many real-world scenarios, data is not stored in flat tables but comes in nested or hierarchical formats. A common example is JSON files from APIs or NoSQL databases like MongoDB. Flattening hierarchical data means transforming this complex nested structure into a flat table where each value has its own column.

Why Do We Flatten Data?

Most data processing and analysis tools (like SQL or DataFrames) work best with flat, tabular structures. Flattening allows us to:

Make querying easier
Prepare data for machine learning models
Export to CSV or Excel formats
Visualize the data more effectively

Real-World Example: Customer Orders

Suppose we receive customer data in the following nested JSON format:

nested_data = [
  {
    "customer_id": 1,
    "name": "Alice",
    "orders": [
      {"order_id": 101, "amount": 250},
      {"order_id": 102, "amount": 450}
    ]
  },
  {
    "customer_id": 2,
    "name": "Bob",
    "orders": [
      {"order_id": 103, "amount": 150}
    ]
  }
]

This structure is common in API responses. It contains an array of customers, where each customer has an array of orders. We need to flatten this so that each order becomes its own row, with customer details repeated.

How Will the Flattened Output Look?

After flattening, our table should look like this:

customer_id | name  | order_id | amount
------------|-------|----------|--------
1           | Alice | 101      | 250
1           | Alice | 102      | 450
2           | Bob   | 103      | 150

Question:

Why is flattening useful here?

Answer:

Flattening allows us to analyze individual orders across customers, compute average order values, and join with other flat datasets like products or regions.

Flattening in PySpark

Let's write PySpark code to flatten such hierarchical JSON structures.

Step 1: Set up SparkSession

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode

spark = SparkSession.builder     .appName("Flatten Hierarchical Data")     .getOrCreate()

Step 2: Create a DataFrame with Nested Structure

data = [
  {
    "customer_id": 1,
    "name": "Alice",
    "orders": [
      {"order_id": 101, "amount": 250},
      {"order_id": 102, "amount": 450}
    ]
  },
  {
    "customer_id": 2,
    "name": "Bob",
    "orders": [
      {"order_id": 103, "amount": 150}
    ]
  }
]

df = spark.createDataFrame(data)
df.printSchema()
df.show(truncate=False)

root
  |-- customer_id: long (nullable = true)
  |-- name: string (nullable = true)
  |-- orders: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- order_id: long (nullable = true)
  |    |    |-- amount: long (nullable = true)

+-----------+-----+----------------------------+
|customer_id|name |orders                      |
+-----------+-----+----------------------------+
|1          |Alice|[{101, 250}, {102, 450}]    |
|2          |Bob  |[{103, 150}]                |
+-----------+-----+----------------------------+

Step 3: Use explode() to Flatten

The explode() function helps convert each element in the nested array into its own row.

from pyspark.sql.functions import col

flat_df = df.withColumn("order", explode(col("orders")))             .select(
              col("customer_id"),
              col("name"),
              col("order.order_id").alias("order_id"),
              col("order.amount").alias("amount")
            )

flat_df.show()

+-----------+-----+--------+------+
|customer_id|name |order_id|amount|
+-----------+-----+--------+------+
|1          |Alice|     101|   250|
|1          |Alice|     102|   450|
|2          |Bob  |     103|   150|
+-----------+-----+--------+------+

Question:

What happens if the array is empty or null?

Answer:

The explode function will skip that row. You can handle such rows separately using fillna or when-otherwise expressions.

Conclusion

Flattening hierarchical data is a important step when working with nested JSON or complex NoSQL structures. It helps convert semi-structured data into a structured format that’s easier to work with, analyze, and model. PySpark provides powerful and scalable tools like explode and column operations to perform this transformation efficiently.

Flattening Hierarchical Data