Yandex

Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Common Aggregations and Joins in Spark SQL



Common Aggregations and Joins in Spark SQL

In data analysis, aggregations and joins are two of the most commonly used operations. Spark SQL makes these tasks simple and efficient, even when working with very large datasets. In this lesson, you'll learn how to perform common aggregation functions and join operations using PySpark.

What are Aggregations?

Aggregations allow us to summarize data. For example, we can calculate totals, averages, minimums, and maximums across a dataset. In Spark SQL, we can use built-in functions like groupBy, agg, count, avg, sum, min, and max.

Example: Total Sales by Region

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg, count

# Start a Spark session
spark = SparkSession.builder.appName("AggregationsExample").getOrCreate()

# Sample data
data = [
    ("North", "Product A", 100),
    ("North", "Product B", 200),
    ("South", "Product A", 150),
    ("South", "Product B", 50),
    ("East", "Product C", 300)
]
columns = ["region", "product", "sales"]

df = spark.createDataFrame(data, columns)

# Group by region and calculate total sales
agg_df = df.groupBy("region").agg(sum("sales").alias("total_sales"))

agg_df.show()
+------+-----------+
|region|total_sales|
+------+-----------+
| North|        300|
| South|        200|
|  East|        300|
+------+-----------+

Question:

Why do we use alias("total_sales") in the aggregation?

Answer:

By default, the aggregated column would have a name like sum(sales). The alias() function renames it to a more readable column name like total_sales.

Additional Aggregations

We can calculate more than one aggregation at a time.

agg_df = df.groupBy("region").agg(
    sum("sales").alias("total_sales"),
    avg("sales").alias("average_sales"),
    count("product").alias("product_count")
)

agg_df.show()
+------+-----------+-------------+-------------+
|region|total_sales|average_sales|product_count|
+------+-----------+-------------+-------------+
| North|        300|        150.0|            2|
| South|        200|        100.0|            2|
|  East|        300|        300.0|            1|
+------+-----------+-------------+-------------+

What are Joins?

Joins allow you to combine two DataFrames based on a common column (similar to SQL joins). This is useful when your data is split across multiple sources or tables.

Types of Joins in Spark SQL

  • Inner Join: Returns rows that have matching values in both DataFrames.
  • Left Join (Left Outer): Returns all rows from the left DataFrame and matched rows from the right.
  • Right Join (Right Outer): Returns all rows from the right DataFrame and matched rows from the left.
  • Full Outer Join: Returns all rows when there is a match in either left or right DataFrame.

Example: Joining Sales Data with Region Managers

# Sales data
sales_data = [
    ("North", "Product A", 100),
    ("South", "Product B", 50),
    ("East", "Product C", 300)
]
sales_columns = ["region", "product", "sales"]

# Manager data
manager_data = [
    ("North", "Alice"),
    ("South", "Bob"),
    ("West", "Charlie")
]
manager_columns = ["region", "manager"]

sales_df = spark.createDataFrame(sales_data, sales_columns)
manager_df = spark.createDataFrame(manager_data, manager_columns)

# Perform inner join
joined_df = sales_df.join(manager_df, on="region", how="inner")

joined_df.show()
+------+---------+-----+-------+
|region| product |sales|manager|
+------+---------+-----+-------+
| North|Product A|  100|  Alice|
| South|Product B|   50|    Bob|
+------+---------+-----+-------+

Question:

Why is "East" region not included in the result?

Answer:

Because it's an inner join, only regions that exist in both DataFrames are included. The "East" region doesn’t have a corresponding manager entry.

Example: Left Join

left_join_df = sales_df.join(manager_df, on="region", how="left")

left_join_df.show()
+------+---------+-----+-------+
|region| product |sales|manager|
+------+---------+-----+-------+
| North|Product A|  100|  Alice|
| South|Product B|   50|    Bob|
|  East|Product C|  300|   null|
+------+---------+-----+-------+

Notice how the "East" row is still present, but the manager column is null. That’s the behavior of a left join — all rows from the left DataFrame are kept.

Summary

  • Use groupBy with aggregation functions to summarize data in Spark SQL.
  • Joins combine datasets using a shared key, and Spark supports inner, left, right, and full outer joins.
  • Aggregations and joins are essential for real-world data analysis tasks, such as combining customer orders with customer details or summarizing sales performance by region.


Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

You can support this website with a contribution of your choice.

When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M