Dealing with Nested JSON in Apache Spark
In real-world data processing, it’s common to encounter JSON files where the data is deeply nested. These nested JSONs are difficult to query or analyze directly unless they are flattened or transformed into tabular formats.
What is a Nested JSON?
A nested JSON is a JSON structure where values themselves can be other JSON objects or arrays. This results in hierarchical or tree-like data instead of flat rows and columns.
Real-World Example: Customer Order Data
Imagine an e-commerce platform that logs every customer’s order in JSON format. Here's a small sample:
{
"order_id": "1234",
"customer": {
"name": "Alice",
"address": {
"city": "Mumbai",
"zip": "400001"
}
},
"items": [
{"product": "Laptop", "price": 70000},
{"product": "Mouse", "price": 1500}
]
}
This JSON is nested at multiple levels:
customer
is an objectaddress
is nested withincustomer
items
is an array of objects
Question:
Why can't we just use this JSON directly as a flat table in Spark?
Answer:
Because Spark SQL and DataFrame APIs expect rows and columns. A nested structure needs to be exploded or flattened so each field is accessible directly for analysis or filtering.
Step-by-Step: Loading and Flattening Nested JSON using PySpark
Step 1: Load the JSON File
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("NestedJSON").getOrCreate()
# Load a sample nested JSON
df = spark.read.json("nested_orders.json")
df.printSchema()
df.show(truncate=False)
Step 2: Understanding the Schema
The schema will show nested structs and arrays. This helps identify what needs to be flattened.
root |-- customer: struct (nullable = true) | |-- address: struct (nullable = true) | | |-- city: string (nullable = true) | | |-- zip: string (nullable = true) | |-- name: string (nullable = true) |-- items: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- price: long (nullable = true) | | |-- product: string (nullable = true) |-- order_id: string (nullable = true)
Step 3: Flattening Struct Fields
To extract nested fields from structs, we use dot notation.
from pyspark.sql.functions import col
flat_df = df.select(
col("order_id"),
col("customer.name").alias("customer_name"),
col("customer.address.city").alias("city"),
col("customer.address.zip").alias("zip")
)
flat_df.show()
+--------+-------------+--------+------+ |order_id|customer_name|city |zip | +--------+-------------+--------+------+ |1234 |Alice |Mumbai |400001| +--------+-------------+--------+------+
Step 4: Exploding Array of Items
The items
field is an array. We use explode()
to convert each array element into a new row.
from pyspark.sql.functions import explode
items_df = df.select(
col("order_id"),
explode(col("items")).alias("item"),
col("customer.name").alias("customer_name")
)
# Flatten the inner struct
final_df = items_df.select(
"order_id",
"customer_name",
col("item.product").alias("product"),
col("item.price").alias("price")
)
final_df.show()
+--------+-------------+--------+------+ |order_id|customer_name|product |price | +--------+-------------+--------+------+ |1234 |Alice |Laptop |70000 | |1234 |Alice |Mouse |1500 | +--------+-------------+--------+------+
Question:
What if the items array had 1000 products?
Answer:
The explode()
function will create 1000 separate rows. This helps Spark process and filter data efficiently, even at large scale.
Summary
- Nested JSON is common in real-world systems like APIs and logs
- Spark provides tools to access and flatten nested fields using dot notation and
explode()
- Flattened DataFrames are easier to query and analyze using Spark SQL
Understanding how to transform nested JSON into tabular form is a crucial skill in Big Data workflows, especially when working with semi-structured data.