Common Aggregations and Joins in Spark SQL
In data analysis, aggregations and joins are two of the most commonly used operations. Spark SQL makes these tasks simple and efficient, even when working with very large datasets. In this lesson, you'll learn how to perform common aggregation functions and join operations using PySpark.
What are Aggregations?
Aggregations allow us to summarize data. For example, we can calculate totals, averages, minimums, and maximums across a dataset. In Spark SQL, we can use built-in functions like groupBy
, agg
, count
, avg
, sum
, min
, and max
.
Example: Total Sales by Region
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg, count
# Start a Spark session
spark = SparkSession.builder.appName("AggregationsExample").getOrCreate()
# Sample data
data = [
("North", "Product A", 100),
("North", "Product B", 200),
("South", "Product A", 150),
("South", "Product B", 50),
("East", "Product C", 300)
]
columns = ["region", "product", "sales"]
df = spark.createDataFrame(data, columns)
# Group by region and calculate total sales
agg_df = df.groupBy("region").agg(sum("sales").alias("total_sales"))
agg_df.show()
+------+-----------+ |region|total_sales| +------+-----------+ | North| 300| | South| 200| | East| 300| +------+-----------+
Question:
Why do we use alias("total_sales")
in the aggregation?
Answer:
By default, the aggregated column would have a name like sum(sales)
. The alias()
function renames it to a more readable column name like total_sales
.
Additional Aggregations
We can calculate more than one aggregation at a time.
agg_df = df.groupBy("region").agg(
sum("sales").alias("total_sales"),
avg("sales").alias("average_sales"),
count("product").alias("product_count")
)
agg_df.show()
+------+-----------+-------------+-------------+ |region|total_sales|average_sales|product_count| +------+-----------+-------------+-------------+ | North| 300| 150.0| 2| | South| 200| 100.0| 2| | East| 300| 300.0| 1| +------+-----------+-------------+-------------+
What are Joins?
Joins allow you to combine two DataFrames based on a common column (similar to SQL joins). This is useful when your data is split across multiple sources or tables.
Types of Joins in Spark SQL
- Inner Join: Returns rows that have matching values in both DataFrames.
- Left Join (Left Outer): Returns all rows from the left DataFrame and matched rows from the right.
- Right Join (Right Outer): Returns all rows from the right DataFrame and matched rows from the left.
- Full Outer Join: Returns all rows when there is a match in either left or right DataFrame.
Example: Joining Sales Data with Region Managers
# Sales data
sales_data = [
("North", "Product A", 100),
("South", "Product B", 50),
("East", "Product C", 300)
]
sales_columns = ["region", "product", "sales"]
# Manager data
manager_data = [
("North", "Alice"),
("South", "Bob"),
("West", "Charlie")
]
manager_columns = ["region", "manager"]
sales_df = spark.createDataFrame(sales_data, sales_columns)
manager_df = spark.createDataFrame(manager_data, manager_columns)
# Perform inner join
joined_df = sales_df.join(manager_df, on="region", how="inner")
joined_df.show()
+------+---------+-----+-------+ |region| product |sales|manager| +------+---------+-----+-------+ | North|Product A| 100| Alice| | South|Product B| 50| Bob| +------+---------+-----+-------+
Question:
Why is "East" region not included in the result?
Answer:
Because it's an inner join, only regions that exist in both DataFrames are included. The "East" region doesn’t have a corresponding manager entry.
Example: Left Join
left_join_df = sales_df.join(manager_df, on="region", how="left")
left_join_df.show()
+------+---------+-----+-------+ |region| product |sales|manager| +------+---------+-----+-------+ | North|Product A| 100| Alice| | South|Product B| 50| Bob| | East|Product C| 300| null| +------+---------+-----+-------+
Notice how the "East" row is still present, but the manager column is null. That’s the behavior of a left join — all rows from the left DataFrame are kept.
Summary
- Use
groupBy
with aggregation functions to summarize data in Spark SQL. - Joins combine datasets using a shared key, and Spark supports inner, left, right, and full outer joins.
- Aggregations and joins are essential for real-world data analysis tasks, such as combining customer orders with customer details or summarizing sales performance by region.