Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Dealing with Nested JSON in Apache Spark



Dealing with Nested JSON in Apache Spark

In real-world data processing, it’s common to encounter JSON files where the data is deeply nested. These nested JSONs are difficult to query or analyze directly unless they are flattened or transformed into tabular formats.

What is a Nested JSON?

A nested JSON is a JSON structure where values themselves can be other JSON objects or arrays. This results in hierarchical or tree-like data instead of flat rows and columns.

Real-World Example: Customer Order Data

Imagine an e-commerce platform that logs every customer’s order in JSON format. Here's a small sample:


{
  "order_id": "1234",
  "customer": {
    "name": "Alice",
    "address": {
      "city": "Mumbai",
      "zip": "400001"
    }
  },
  "items": [
    {"product": "Laptop", "price": 70000},
    {"product": "Mouse", "price": 1500}
  ]
}
    

This JSON is nested at multiple levels:

Question:

Why can't we just use this JSON directly as a flat table in Spark?

Answer:

Because Spark SQL and DataFrame APIs expect rows and columns. A nested structure needs to be exploded or flattened so each field is accessible directly for analysis or filtering.

Step-by-Step: Loading and Flattening Nested JSON using PySpark

Step 1: Load the JSON File


from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("NestedJSON").getOrCreate()

# Load a sample nested JSON
df = spark.read.json("nested_orders.json")
df.printSchema()
df.show(truncate=False)
    

Step 2: Understanding the Schema

The schema will show nested structs and arrays. This helps identify what needs to be flattened.

root
 |-- customer: struct (nullable = true)
 |    |-- address: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- zip: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- price: long (nullable = true)
 |    |    |-- product: string (nullable = true)
 |-- order_id: string (nullable = true)
    

Step 3: Flattening Struct Fields

To extract nested fields from structs, we use dot notation.


from pyspark.sql.functions import col

flat_df = df.select(
    col("order_id"),
    col("customer.name").alias("customer_name"),
    col("customer.address.city").alias("city"),
    col("customer.address.zip").alias("zip")
)

flat_df.show()
    
+--------+-------------+--------+------+
|order_id|customer_name|city    |zip   |
+--------+-------------+--------+------+
|1234    |Alice        |Mumbai  |400001|
+--------+-------------+--------+------+
    

Step 4: Exploding Array of Items

The items field is an array. We use explode() to convert each array element into a new row.


from pyspark.sql.functions import explode

items_df = df.select(
    col("order_id"),
    explode(col("items")).alias("item"),
    col("customer.name").alias("customer_name")
)

# Flatten the inner struct
final_df = items_df.select(
    "order_id",
    "customer_name",
    col("item.product").alias("product"),
    col("item.price").alias("price")
)

final_df.show()
    
+--------+-------------+--------+------+
|order_id|customer_name|product |price |
+--------+-------------+--------+------+
|1234    |Alice        |Laptop  |70000 |
|1234    |Alice        |Mouse   |1500  |
+--------+-------------+--------+------+
    

Question:

What if the items array had 1000 products?

Answer:

The explode() function will create 1000 separate rows. This helps Spark process and filter data efficiently, even at large scale.

Summary

Understanding how to transform nested JSON into tabular form is a crucial skill in Big Data workflows, especially when working with semi-structured data.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M