Exploding Arrays and Structs in Apache Spark

In many real-world datasets, data is not always stored in simple rows and columns. Instead, we often find complex nested structures like arrays and structs inside DataFrame columns. To make this data easier to analyze, we need to "explode" or flatten it into a tabular format.

What is an Array in Spark?

An array is a collection of elements stored in a single column. For example, a product can have multiple tags or a user can have multiple email addresses.

Example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode

spark = SparkSession.builder.appName("ExplodeArrays").getOrCreate()

data = [
    ("Alice", ["python", "spark"]),
    ("Bob", ["java", "hadoop", "kafka"])
]

df = spark.createDataFrame(data, ["name", "skills"])
df.show(truncate=False)

+-----+-------------------+
|name |skills             |
+-----+-------------------+
|Alice|[python, spark]    |
|Bob  |[java, hadoop, kafka]|
+-----+-------------------+

Each row has a list of skills in an array. This is hard to analyze if we want one row per skill. That's where explode() helps.

Exploding the Array:

exploded_df = df.withColumn("skill", explode(df.skills))
exploded_df.show()

+-----+-------------------+--------+
|name |skills             |skill   |
+-----+-------------------+--------+
|Alice|[python, spark]    |python  |
|Alice|[python, spark]    |spark   |
|Bob  |[java, hadoop, kafka]|java |
|Bob  |[java, hadoop, kafka]|hadoop|
|Bob  |[java, hadoop, kafka]|kafka |
+-----+-------------------+--------+

Question:

Why do we get multiple rows for one person?

Answer:

Because explode() flattens each element of the array into its own row. It's useful when you want to perform analysis per item (e.g., count how many people know Python).

What is a Struct in Spark?

A struct is like a mini table or dictionary inside a column. It groups related fields together. Think of an address with street, city, and zip inside one column.

Example:

from pyspark.sql import Row

data = [
    Row(name="Alice", address=Row(street="1st Ave", city="New York", zip="10001")),
    Row(name="Bob", address=Row(street="2nd St", city="Los Angeles", zip="90001"))
]

df_struct = spark.createDataFrame(data)
df_struct.show(truncate=False)

+-----+-----------------------------+
|name |address                      |
+-----+-----------------------------+
|Alice|{1st Ave, New York, 10001}   |
|Bob  |{2nd St, Los Angeles, 90001} |
+-----+-----------------------------+

Accessing Fields from a Struct

We can access individual fields using dot notation like address.city.

Extracting Struct Fields:

df_flat = df_struct.select(
    "name",
    "address.street",
    "address.city",
    "address.zip"
)

df_flat.show()

+-----+--------+-------------+-----+
|name |street  |city         |zip  |
+-----+--------+-------------+-----+
|Alice|1st Ave |New York     |10001|
|Bob  |2nd St  |Los Angeles  |90001|
+-----+--------+-------------+-----+

Question:

Do we need to explode structs like arrays?

Answer:

No, structs don’t need to be exploded. They are accessed using dot notation and expanded into multiple columns, unlike arrays which require row duplication.

Combining Explode with Structs

Sometimes, arrays can contain structs, making things even more nested. Let’s look at one such case:

Example: Array of Structs

data = [
    ("Alice", [{"type": "home", "email": "alice@home.com"}, {"type": "work", "email": "alice@work.com"}]),
    ("Bob", [{"type": "work", "email": "bob@company.com"}])
]

df_nested = spark.createDataFrame(data, ["name", "emails"])
df_nested.show(truncate=False)

+-----+------------------------------------------------+
|name |emails                                          |
+-----+------------------------------------------------+
|Alice|[{home, alice@home.com}, {work, alice@work.com}]|
|Bob  |[{work, bob@company.com}]                       |
+-----+------------------------------------------------+

Exploding and Flattening Structs Inside Arrays:

from pyspark.sql.functions import col

df_exploded = df_nested.withColumn("email_struct", explode("emails"))
df_final = df_exploded.select(
    "name",
    col("email_struct.type").alias("email_type"),
    col("email_struct.email").alias("email_address")
)

df_final.show()

+-----+----------+------------------+
|name |email_type|email_address     |
+-----+----------+------------------+
|Alice|home      |alice@home.com    |
|Alice|work      |alice@work.com    |
|Bob  |work      |bob@company.com   |
+-----+----------+------------------+

This example demonstrates the real power of combining explode() with nested fields. Now, each row represents a single email with its type, ready for analysis.

Summary

Arrays are lists inside columns. Use explode() to create one row per element.
Structs are groupings of related fields. Access them using dot notation.
Array of Structs can be exploded and then accessed with dot notation to fully flatten the data.

Understanding how to work with arrays and structs is essential for handling complex JSON or semi-structured data in Apache Spark. As a beginner, practice these transformations to build confidence in handling real-world nested data.

Exploding Arrays and Structs in Apache Spark