Dealing with Nested JSON in Apache Spark

In real-world data processing, it’s common to encounter JSON files where the data is deeply nested. These nested JSONs are difficult to query or analyze directly unless they are flattened or transformed into tabular formats.

What is a Nested JSON?

A nested JSON is a JSON structure where values themselves can be other JSON objects or arrays. This results in hierarchical or tree-like data instead of flat rows and columns.

Real-World Example: Customer Order Data

Imagine an e-commerce platform that logs every customer’s order in JSON format. Here's a small sample:


{
  "order_id": "1234",
  "customer": {
    "name": "Alice",
    "address": {
      "city": "Mumbai",
      "zip": "400001"
    }
  },
  "items": [
    {"product": "Laptop", "price": 70000},
    {"product": "Mouse", "price": 1500}
  ]
}

This JSON is nested at multiple levels:

customer is an object
address is nested within customer
items is an array of objects

Question:

Why can't we just use this JSON directly as a flat table in Spark?

Answer:

Because Spark SQL and DataFrame APIs expect rows and columns. A nested structure needs to be exploded or flattened so each field is accessible directly for analysis or filtering.

Step-by-Step: Loading and Flattening Nested JSON using PySpark

Step 1: Load the JSON File

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("NestedJSON").getOrCreate()

# Load a sample nested JSON
df = spark.read.json("nested_orders.json")
df.printSchema()
df.show(truncate=False)

Step 2: Understanding the Schema

The schema will show nested structs and arrays. This helps identify what needs to be flattened.

root
  |-- customer: struct (nullable = true)
  |    |-- address: struct (nullable = true)
  |    |    |-- city: string (nullable = true)
  |    |    |-- zip: string (nullable = true)
  |    |-- name: string (nullable = true)
  |-- items: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- price: long (nullable = true)
  |    |    |-- product: string (nullable = true)
  |-- order_id: string (nullable = true)

Step 3: Flattening Struct Fields

To extract nested fields from structs, we use dot notation.

from pyspark.sql.functions import col

flat_df = df.select(
    col("order_id"),
    col("customer.name").alias("customer_name"),
    col("customer.address.city").alias("city"),
    col("customer.address.zip").alias("zip")
)

flat_df.show()

+--------+-------------+--------+------+
|order_id|customer_name|city    |zip   |
+--------+-------------+--------+------+
|1234    |Alice        |Mumbai  |400001|
+--------+-------------+--------+------+

Step 4: Exploding Array of Items

The items field is an array. We use explode() to convert each array element into a new row.

from pyspark.sql.functions import explode

items_df = df.select(
    col("order_id"),
    explode(col("items")).alias("item"),
    col("customer.name").alias("customer_name")
)

# Flatten the inner struct
final_df = items_df.select(
    "order_id",
    "customer_name",
    col("item.product").alias("product"),
    col("item.price").alias("price")
)

final_df.show()

+--------+-------------+--------+------+
|order_id|customer_name|product |price |
+--------+-------------+--------+------+
|1234    |Alice        |Laptop  |70000 |
|1234    |Alice        |Mouse   |1500  |
+--------+-------------+--------+------+

Question:

What if the items array had 1000 products?

Answer:

The explode() function will create 1000 separate rows. This helps Spark process and filter data efficiently, even at large scale.

Summary

Nested JSON is common in real-world systems like APIs and logs
Spark provides tools to access and flatten nested fields using dot notation and explode()
Flattened DataFrames are easier to query and analyze using Spark SQL

Understanding how to transform nested JSON into tabular form is a important skill in Big Data workflows, especially when working with semi-structured data.

⬅ Previous TopicOptimization using Catalyst in Apache Spark

Next Topic ⮕Exploding Arrays and Structs in Apache Spark

Comments

Loading comments...

Dealing with Nested JSON in Apache Spark