Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Flattening Hierarchical Data



Flattening Hierarchical Data

In many real-world scenarios, data is not stored in flat tables but comes in nested or hierarchical formats. A common example is JSON files from APIs or NoSQL databases like MongoDB. Flattening hierarchical data means transforming this complex nested structure into a flat table where each value has its own column.

Why Do We Flatten Data?

Most data processing and analysis tools (like SQL or DataFrames) work best with flat, tabular structures. Flattening allows us to:

Real-World Example: Customer Orders

Suppose we receive customer data in the following nested JSON format:


nested_data = [
  {
    "customer_id": 1,
    "name": "Alice",
    "orders": [
      {"order_id": 101, "amount": 250},
      {"order_id": 102, "amount": 450}
    ]
  },
  {
    "customer_id": 2,
    "name": "Bob",
    "orders": [
      {"order_id": 103, "amount": 150}
    ]
  }
]
  

This structure is common in API responses. It contains an array of customers, where each customer has an array of orders. We need to flatten this so that each order becomes its own row, with customer details repeated.

How Will the Flattened Output Look?

After flattening, our table should look like this:

customer_id | name  | order_id | amount
------------|-------|----------|--------
1           | Alice | 101      | 250
1           | Alice | 102      | 450
2           | Bob   | 103      | 150
  

Question:

Why is flattening useful here?

Answer:

Flattening allows us to analyze individual orders across customers, compute average order values, and join with other flat datasets like products or regions.

Flattening in PySpark

Let's write PySpark code to flatten such hierarchical JSON structures.

Step 1: Set up SparkSession


from pyspark.sql import SparkSession
from pyspark.sql.functions import explode

spark = SparkSession.builder     .appName("Flatten Hierarchical Data")     .getOrCreate()
  

Step 2: Create a DataFrame with Nested Structure


data = [
  {
    "customer_id": 1,
    "name": "Alice",
    "orders": [
      {"order_id": 101, "amount": 250},
      {"order_id": 102, "amount": 450}
    ]
  },
  {
    "customer_id": 2,
    "name": "Bob",
    "orders": [
      {"order_id": 103, "amount": 150}
    ]
  }
]

df = spark.createDataFrame(data)
df.printSchema()
df.show(truncate=False)
  
root
 |-- customer_id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- order_id: long (nullable = true)
 |    |    |-- amount: long (nullable = true)

+-----------+-----+----------------------------+
|customer_id|name |orders                      |
+-----------+-----+----------------------------+
|1          |Alice|[{101, 250}, {102, 450}]    |
|2          |Bob  |[{103, 150}]                |
+-----------+-----+----------------------------+
  

Step 3: Use explode() to Flatten

The explode() function helps convert each element in the nested array into its own row.


from pyspark.sql.functions import col

flat_df = df.withColumn("order", explode(col("orders")))             .select(
              col("customer_id"),
              col("name"),
              col("order.order_id").alias("order_id"),
              col("order.amount").alias("amount")
            )

flat_df.show()
  
+-----------+-----+--------+------+
|customer_id|name |order_id|amount|
+-----------+-----+--------+------+
|1          |Alice|     101|   250|
|1          |Alice|     102|   450|
|2          |Bob  |     103|   150|
+-----------+-----+--------+------+
  

Question:

What happens if the array is empty or null?

Answer:

The explode function will skip that row. You can handle such rows separately using fillna or when-otherwise expressions.

Conclusion

Flattening hierarchical data is a crucial step when working with nested JSON or complex NoSQL structures. It helps convert semi-structured data into a structured format that’s easier to work with, analyze, and model. PySpark provides powerful and scalable tools like explode and column operations to perform this transformation efficiently.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M