Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Data Aggregation and Pivots in PySpark



What is Data Aggregation?

Data aggregation is the process of summarizing and transforming data into a more understandable format. It involves operations like calculating sums, averages, counts, and other metrics grouped by certain keys or categories.

Why Do We Aggregate Data?

When working with large datasets, you often need summarized insights. For example, instead of looking at every single sale, you might want to know the total sales per region or the average order value per customer.

Question:

Why not look at every individual record instead of aggregating?

Answer:

Because summaries help identify trends and patterns more easily. Aggregation reduces data size while increasing interpretability.

Common Aggregation Functions

What is a Pivot?

A pivot operation is used to rotate rows into columns. It’s like converting tall data into wide format — very useful for comparing metrics side-by-side.

Example:

Suppose you have monthly sales data for different products. A pivot can help you turn rows of months into columns like Jan, Feb, Mar... so you can compare product sales across months easily.

Example Dataset

Let’s work with a simple sales dataset in PySpark:


from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg

spark = SparkSession.builder.appName("AggregationPivotExample").getOrCreate()

data = [
    ("North", "ProductA", "January", 100),
    ("North", "ProductA", "February", 120),
    ("North", "ProductB", "January", 80),
    ("South", "ProductA", "January", 90),
    ("South", "ProductB", "February", 150),
    ("South", "ProductB", "January", 130),
]

columns = ["Region", "Product", "Month", "Sales"]

df = spark.createDataFrame(data, columns)
df.show()
    
+------+--------+--------+-----+
|Region|Product |Month   |Sales|
+------+--------+--------+-----+
|North |ProductA|January |100  |
|North |ProductA|February|120  |
|North |ProductB|January |80   |
|South |ProductA|January |90   |
|South |ProductB|February|150  |
|South |ProductB|January |130  |
+------+--------+--------+-----+
    

Grouping and Aggregating Sales by Region


# Total sales per region
df.groupBy("Region").agg(sum("Sales").alias("Total_Sales")).show()
    
+------+-----------+
|Region|Total_Sales|
+------+-----------+
|North |300        |
|South |370        |
+------+-----------+
    

This tells us that the South region had slightly higher sales than the North region.

Question:

Can we also find average sales per region?

Answer:

Yes, just use the avg() function instead of sum().

Aggregating by Region and Product


df.groupBy("Region", "Product").agg(sum("Sales").alias("Total_Sales")).show()
    
+------+--------+-----------+
|Region|Product |Total_Sales|
+------+--------+-----------+
|North |ProductA|220        |
|North |ProductB|80         |
|South |ProductA|90         |
|South |ProductB|280        |
+------+--------+-----------+
    

This shows how each product performs in different regions.

Performing a Pivot

Let’s pivot the data to see product sales per month in a table format:


df.groupBy("Product").pivot("Month").agg(sum("Sales")).show()
    
+--------+-------+--------+
|Product |January|February|
+--------+-------+--------+
|ProductA|190    |120     |
|ProductB|210    |150     |
+--------+-------+--------+
    

Now it's easier to visually compare monthly sales for each product.

Combining Pivot and Region

To make it more detailed, you can group by both Region and Product before pivoting:


df.groupBy("Region", "Product").pivot("Month").agg(sum("Sales")).show()
    
+------+--------+-------+--------+
|Region|Product |January|February|
+------+--------+-------+--------+
|North |ProductA|100    |120     |
|North |ProductB|80     |null    |
|South |ProductA|90     |null    |
|South |ProductB|130    |150     |
+------+--------+-------+--------+
    

Summary

Aggregation and pivot operations are essential for summarizing large datasets and extracting meaningful insights. They help analysts and engineers convert raw data into structured, insightful reports.

Final Thought:

As you move to bigger datasets, Spark’s distributed nature makes these operations faster and scalable, especially when handling millions of rows spread across clusters.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M