Data Aggregation and Pivots in PySpark

What is Data Aggregation?

Data aggregation is the process of summarizing and transforming data into a more understandable format. It involves operations like calculating sums, averages, counts, and other metrics grouped by certain keys or categories.

Why Do We Aggregate Data?

When working with large datasets, you often need summarized insights. For example, instead of looking at every single sale, you might want to know the total sales per region or the average order value per customer.

Question:

Why not look at every individual record instead of aggregating?

Answer:

Because summaries help identify trends and patterns more easily. Aggregation reduces data size while increasing interpretability.

Common Aggregation Functions

count() – Number of records
sum() – Total of a column
avg() – Average of a column
min(), max() – Minimum and maximum values

What is a Pivot?

A pivot operation is used to rotate rows into columns. It’s like converting tall data into wide format — very useful for comparing metrics side-by-side.

Example:

Suppose you have monthly sales data for different products. A pivot can help you turn rows of months into columns like Jan, Feb, Mar... so you can compare product sales across months easily.

Example Dataset

Let’s work with a simple sales dataset in PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg

spark = SparkSession.builder.appName("AggregationPivotExample").getOrCreate()

data = [
    ("North", "ProductA", "January", 100),
    ("North", "ProductA", "February", 120),
    ("North", "ProductB", "January", 80),
    ("South", "ProductA", "January", 90),
    ("South", "ProductB", "February", 150),
    ("South", "ProductB", "January", 130),
]

columns = ["Region", "Product", "Month", "Sales"]

df = spark.createDataFrame(data, columns)
df.show()

+------+--------+--------+-----+
|Region|Product |Month   |Sales|
+------+--------+--------+-----+
|North |ProductA|January |100  |
|North |ProductA|February|120  |
|North |ProductB|January |80   |
|South |ProductA|January |90   |
|South |ProductB|February|150  |
|South |ProductB|January |130  |
+------+--------+--------+-----+

Grouping and Aggregating Sales by Region

# Total sales per region
df.groupBy("Region").agg(sum("Sales").alias("Total_Sales")).show()

+------+-----------+
|Region|Total_Sales|
+------+-----------+
|North |300        |
|South |370        |
+------+-----------+

This tells us that the South region had slightly higher sales than the North region.

Question:

Can we also find average sales per region?

Answer:

Yes, just use the avg() function instead of sum().

Aggregating by Region and Product

df.groupBy("Region", "Product").agg(sum("Sales").alias("Total_Sales")).show()

+------+--------+-----------+
|Region|Product |Total_Sales|
+------+--------+-----------+
|North |ProductA|220        |
|North |ProductB|80         |
|South |ProductA|90         |
|South |ProductB|280        |
+------+--------+-----------+

This shows how each product performs in different regions.

Performing a Pivot

Let’s pivot the data to see product sales per month in a table format:

df.groupBy("Product").pivot("Month").agg(sum("Sales")).show()

+--------+-------+--------+
|Product |January|February|
+--------+-------+--------+
|ProductA|190    |120     |
|ProductB|210    |150     |
+--------+-------+--------+

Now it's easier to visually compare monthly sales for each product.

Combining Pivot and Region

To make it more detailed, you can group by both Region and Product before pivoting:

df.groupBy("Region", "Product").pivot("Month").agg(sum("Sales")).show()

+------+--------+-------+--------+
|Region|Product |January|February|
+------+--------+-------+--------+
|North |ProductA|100    |120     |
|North |ProductB|80     |null    |
|South |ProductA|90     |null    |
|South |ProductB|130    |150     |
+------+--------+-------+--------+

Summary

Aggregation and pivot operations are essential for summarizing large datasets and extracting meaningful insights. They help analysts and engineers convert raw data into structured, insightful reports.

Final Thought:

As you move to bigger datasets, Spark’s distributed nature makes these operations faster and scalable, especially when handling millions of rows spread across clusters.

⬅ Previous TopicWindow Functions in Apache Spark

Next Topic ⮕What is Spark Streaming?

Comments

Loading comments...

Data Aggregation and Pivots in PySpark

What is Data Aggregation?

Why Do We Aggregate Data?

Question:

Answer:

Common Aggregation Functions

What is a Pivot?

Example:

Example Dataset

Grouping and Aggregating Sales by Region

Question:

Answer:

Aggregating by Region and Product

Performing a Pivot

Combining Pivot and Region

Summary

Final Thought:

Comments

Module 9: Data Cleaning & Transformations❯

Welcome to ProgramGuru

Player Settings