⬅ Previous Topic
Window Functions in Apache SparkNext Topic ⮕
What is Spark Streaming?⬅ Previous Topic
Window Functions in Apache SparkNext Topic ⮕
What is Spark Streaming?Data aggregation is the process of summarizing and transforming data into a more understandable format. It involves operations like calculating sums, averages, counts, and other metrics grouped by certain keys or categories.
When working with large datasets, you often need summarized insights. For example, instead of looking at every single sale, you might want to know the total sales per region or the average order value per customer.
Why not look at every individual record instead of aggregating?
Because summaries help identify trends and patterns more easily. Aggregation reduces data size while increasing interpretability.
A pivot operation is used to rotate rows into columns. It’s like converting tall data into wide format — very useful for comparing metrics side-by-side.
Suppose you have monthly sales data for different products. A pivot can help you turn rows of months into columns like Jan, Feb, Mar... so you can compare product sales across months easily.
Let’s work with a simple sales dataset in PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg
spark = SparkSession.builder.appName("AggregationPivotExample").getOrCreate()
data = [
("North", "ProductA", "January", 100),
("North", "ProductA", "February", 120),
("North", "ProductB", "January", 80),
("South", "ProductA", "January", 90),
("South", "ProductB", "February", 150),
("South", "ProductB", "January", 130),
]
columns = ["Region", "Product", "Month", "Sales"]
df = spark.createDataFrame(data, columns)
df.show()
+------+--------+--------+-----+ |Region|Product |Month |Sales| +------+--------+--------+-----+ |North |ProductA|January |100 | |North |ProductA|February|120 | |North |ProductB|January |80 | |South |ProductA|January |90 | |South |ProductB|February|150 | |South |ProductB|January |130 | +------+--------+--------+-----+
# Total sales per region
df.groupBy("Region").agg(sum("Sales").alias("Total_Sales")).show()
+------+-----------+ |Region|Total_Sales| +------+-----------+ |North |300 | |South |370 | +------+-----------+
This tells us that the South region had slightly higher sales than the North region.
Can we also find average sales per region?
Yes, just use the avg()
function instead of sum()
.
df.groupBy("Region", "Product").agg(sum("Sales").alias("Total_Sales")).show()
+------+--------+-----------+ |Region|Product |Total_Sales| +------+--------+-----------+ |North |ProductA|220 | |North |ProductB|80 | |South |ProductA|90 | |South |ProductB|280 | +------+--------+-----------+
This shows how each product performs in different regions.
Let’s pivot the data to see product sales per month in a table format:
df.groupBy("Product").pivot("Month").agg(sum("Sales")).show()
+--------+-------+--------+ |Product |January|February| +--------+-------+--------+ |ProductA|190 |120 | |ProductB|210 |150 | +--------+-------+--------+
Now it's easier to visually compare monthly sales for each product.
To make it more detailed, you can group by both Region and Product before pivoting:
df.groupBy("Region", "Product").pivot("Month").agg(sum("Sales")).show()
+------+--------+-------+--------+ |Region|Product |January|February| +------+--------+-------+--------+ |North |ProductA|100 |120 | |North |ProductB|80 |null | |South |ProductA|90 |null | |South |ProductB|130 |150 | +------+--------+-------+--------+
Aggregation and pivot operations are essential for summarizing large datasets and extracting meaningful insights. They help analysts and engineers convert raw data into structured, insightful reports.
As you move to bigger datasets, Spark’s distributed nature makes these operations faster and scalable, especially when handling millions of rows spread across clusters.
⬅ Previous Topic
Window Functions in Apache SparkNext Topic ⮕
What is Spark Streaming?You can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.