⬅ Previous Topic
Dealing with Nested JSON in Apache SparkNext Topic ⮕
Flattening Hierarchical Data⬅ Previous Topic
Dealing with Nested JSON in Apache SparkNext Topic ⮕
Flattening Hierarchical DataIn many real-world datasets, data is not always stored in simple rows and columns. Instead, we often find complex nested structures like arrays and structs inside DataFrame columns. To make this data easier to analyze, we need to "explode" or flatten it into a tabular format.
An array is a collection of elements stored in a single column. For example, a product can have multiple tags or a user can have multiple email addresses.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder.appName("ExplodeArrays").getOrCreate()
data = [
("Alice", ["python", "spark"]),
("Bob", ["java", "hadoop", "kafka"])
]
df = spark.createDataFrame(data, ["name", "skills"])
df.show(truncate=False)
+-----+-------------------+ |name |skills | +-----+-------------------+ |Alice|[python, spark] | |Bob |[java, hadoop, kafka]| +-----+-------------------+
Each row has a list of skills in an array. This is hard to analyze if we want one row per skill. That's where explode()
helps.
exploded_df = df.withColumn("skill", explode(df.skills))
exploded_df.show()
+-----+-------------------+--------+ |name |skills |skill | +-----+-------------------+--------+ |Alice|[python, spark] |python | |Alice|[python, spark] |spark | |Bob |[java, hadoop, kafka]|java | |Bob |[java, hadoop, kafka]|hadoop| |Bob |[java, hadoop, kafka]|kafka | +-----+-------------------+--------+
Why do we get multiple rows for one person?
Because explode()
flattens each element of the array into its own row. It's useful when you want to perform analysis per item (e.g., count how many people know Python).
A struct is like a mini table or dictionary inside a column. It groups related fields together. Think of an address with street, city, and zip inside one column.
from pyspark.sql import Row
data = [
Row(name="Alice", address=Row(street="1st Ave", city="New York", zip="10001")),
Row(name="Bob", address=Row(street="2nd St", city="Los Angeles", zip="90001"))
]
df_struct = spark.createDataFrame(data)
df_struct.show(truncate=False)
+-----+-----------------------------+ |name |address | +-----+-----------------------------+ |Alice|{1st Ave, New York, 10001} | |Bob |{2nd St, Los Angeles, 90001} | +-----+-----------------------------+
We can access individual fields using dot notation like address.city
.
df_flat = df_struct.select(
"name",
"address.street",
"address.city",
"address.zip"
)
df_flat.show()
+-----+--------+-------------+-----+ |name |street |city |zip | +-----+--------+-------------+-----+ |Alice|1st Ave |New York |10001| |Bob |2nd St |Los Angeles |90001| +-----+--------+-------------+-----+
Do we need to explode structs like arrays?
No, structs don’t need to be exploded. They are accessed using dot notation and expanded into multiple columns, unlike arrays which require row duplication.
Sometimes, arrays can contain structs, making things even more nested. Let’s look at one such case:
data = [
("Alice", [{"type": "home", "email": "alice@home.com"}, {"type": "work", "email": "alice@work.com"}]),
("Bob", [{"type": "work", "email": "bob@company.com"}])
]
df_nested = spark.createDataFrame(data, ["name", "emails"])
df_nested.show(truncate=False)
+-----+------------------------------------------------+ |name |emails | +-----+------------------------------------------------+ |Alice|[{home, alice@home.com}, {work, alice@work.com}]| |Bob |[{work, bob@company.com}] | +-----+------------------------------------------------+
from pyspark.sql.functions import col
df_exploded = df_nested.withColumn("email_struct", explode("emails"))
df_final = df_exploded.select(
"name",
col("email_struct.type").alias("email_type"),
col("email_struct.email").alias("email_address")
)
df_final.show()
+-----+----------+------------------+ |name |email_type|email_address | +-----+----------+------------------+ |Alice|home |alice@home.com | |Alice|work |alice@work.com | |Bob |work |bob@company.com | +-----+----------+------------------+
This example demonstrates the real power of combining explode()
with nested fields. Now, each row represents a single email with its type, ready for analysis.
explode()
to create one row per element.Understanding how to work with arrays and structs is essential for handling complex JSON or semi-structured data in Apache Spark. As a beginner, practice these transformations to build confidence in handling real-world nested data.
⬅ Previous Topic
Dealing with Nested JSON in Apache SparkNext Topic ⮕
Flattening Hierarchical DataYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.