Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Exploding Arrays and Structs in Apache Spark



Exploding Arrays and Structs in Apache Spark

In many real-world datasets, data is not always stored in simple rows and columns. Instead, we often find complex nested structures like arrays and structs inside DataFrame columns. To make this data easier to analyze, we need to "explode" or flatten it into a tabular format.

What is an Array in Spark?

An array is a collection of elements stored in a single column. For example, a product can have multiple tags or a user can have multiple email addresses.

Example:


from pyspark.sql import SparkSession
from pyspark.sql.functions import explode

spark = SparkSession.builder.appName("ExplodeArrays").getOrCreate()

data = [
    ("Alice", ["python", "spark"]),
    ("Bob", ["java", "hadoop", "kafka"])
]

df = spark.createDataFrame(data, ["name", "skills"])
df.show(truncate=False)
    
+-----+-------------------+
|name |skills             |
+-----+-------------------+
|Alice|[python, spark]    |
|Bob  |[java, hadoop, kafka]|
+-----+-------------------+
    

Each row has a list of skills in an array. This is hard to analyze if we want one row per skill. That's where explode() helps.

Exploding the Array:


exploded_df = df.withColumn("skill", explode(df.skills))
exploded_df.show()
    
+-----+-------------------+--------+
|name |skills             |skill   |
+-----+-------------------+--------+
|Alice|[python, spark]    |python  |
|Alice|[python, spark]    |spark   |
|Bob  |[java, hadoop, kafka]|java |
|Bob  |[java, hadoop, kafka]|hadoop|
|Bob  |[java, hadoop, kafka]|kafka |
+-----+-------------------+--------+
    

Question:

Why do we get multiple rows for one person?

Answer:

Because explode() flattens each element of the array into its own row. It's useful when you want to perform analysis per item (e.g., count how many people know Python).

What is a Struct in Spark?

A struct is like a mini table or dictionary inside a column. It groups related fields together. Think of an address with street, city, and zip inside one column.

Example:


from pyspark.sql import Row

data = [
    Row(name="Alice", address=Row(street="1st Ave", city="New York", zip="10001")),
    Row(name="Bob", address=Row(street="2nd St", city="Los Angeles", zip="90001"))
]

df_struct = spark.createDataFrame(data)
df_struct.show(truncate=False)
    
+-----+-----------------------------+
|name |address                      |
+-----+-----------------------------+
|Alice|{1st Ave, New York, 10001}   |
|Bob  |{2nd St, Los Angeles, 90001} |
+-----+-----------------------------+
    

Accessing Fields from a Struct

We can access individual fields using dot notation like address.city.

Extracting Struct Fields:


df_flat = df_struct.select(
    "name",
    "address.street",
    "address.city",
    "address.zip"
)

df_flat.show()
    
+-----+--------+-------------+-----+
|name |street  |city         |zip  |
+-----+--------+-------------+-----+
|Alice|1st Ave |New York     |10001|
|Bob  |2nd St  |Los Angeles  |90001|
+-----+--------+-------------+-----+
    

Question:

Do we need to explode structs like arrays?

Answer:

No, structs don’t need to be exploded. They are accessed using dot notation and expanded into multiple columns, unlike arrays which require row duplication.

Combining Explode with Structs

Sometimes, arrays can contain structs, making things even more nested. Let’s look at one such case:

Example: Array of Structs


data = [
    ("Alice", [{"type": "home", "email": "alice@home.com"}, {"type": "work", "email": "alice@work.com"}]),
    ("Bob", [{"type": "work", "email": "bob@company.com"}])
]

df_nested = spark.createDataFrame(data, ["name", "emails"])
df_nested.show(truncate=False)
    
+-----+------------------------------------------------+
|name |emails                                          |
+-----+------------------------------------------------+
|Alice|[{home, alice@home.com}, {work, alice@work.com}]|
|Bob  |[{work, bob@company.com}]                       |
+-----+------------------------------------------------+
    

Exploding and Flattening Structs Inside Arrays:


from pyspark.sql.functions import col

df_exploded = df_nested.withColumn("email_struct", explode("emails"))
df_final = df_exploded.select(
    "name",
    col("email_struct.type").alias("email_type"),
    col("email_struct.email").alias("email_address")
)

df_final.show()
    
+-----+----------+------------------+
|name |email_type|email_address     |
+-----+----------+------------------+
|Alice|home      |alice@home.com    |
|Alice|work      |alice@work.com    |
|Bob  |work      |bob@company.com   |
+-----+----------+------------------+
    

This example demonstrates the real power of combining explode() with nested fields. Now, each row represents a single email with its type, ready for analysis.

Summary

Understanding how to work with arrays and structs is essential for handling complex JSON or semi-structured data in Apache Spark. As a beginner, practice these transformations to build confidence in handling real-world nested data.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M