Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Common Aggregations and Joins in Spark SQL



Common Aggregations and Joins in Spark SQL

In data analysis, aggregations and joins are two of the most commonly used operations. Spark SQL makes these tasks simple and efficient, even when working with very large datasets. In this lesson, you'll learn how to perform common aggregation functions and join operations using PySpark.

What are Aggregations?

Aggregations allow us to summarize data. For example, we can calculate totals, averages, minimums, and maximums across a dataset. In Spark SQL, we can use built-in functions like groupBy, agg, count, avg, sum, min, and max.

Example: Total Sales by Region


from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg, count

# Start a Spark session
spark = SparkSession.builder.appName("AggregationsExample").getOrCreate()

# Sample data
data = [
    ("North", "Product A", 100),
    ("North", "Product B", 200),
    ("South", "Product A", 150),
    ("South", "Product B", 50),
    ("East", "Product C", 300)
]
columns = ["region", "product", "sales"]

df = spark.createDataFrame(data, columns)

# Group by region and calculate total sales
agg_df = df.groupBy("region").agg(sum("sales").alias("total_sales"))

agg_df.show()
    
+------+-----------+
|region|total_sales|
+------+-----------+
| North|        300|
| South|        200|
|  East|        300|
+------+-----------+
    

Question:

Why do we use alias("total_sales") in the aggregation?

Answer:

By default, the aggregated column would have a name like sum(sales). The alias() function renames it to a more readable column name like total_sales.

Additional Aggregations

We can calculate more than one aggregation at a time.


agg_df = df.groupBy("region").agg(
    sum("sales").alias("total_sales"),
    avg("sales").alias("average_sales"),
    count("product").alias("product_count")
)

agg_df.show()
    
+------+-----------+-------------+-------------+
|region|total_sales|average_sales|product_count|
+------+-----------+-------------+-------------+
| North|        300|        150.0|            2|
| South|        200|        100.0|            2|
|  East|        300|        300.0|            1|
+------+-----------+-------------+-------------+
    

What are Joins?

Joins allow you to combine two DataFrames based on a common column (similar to SQL joins). This is useful when your data is split across multiple sources or tables.

Types of Joins in Spark SQL

Example: Joining Sales Data with Region Managers


# Sales data
sales_data = [
    ("North", "Product A", 100),
    ("South", "Product B", 50),
    ("East", "Product C", 300)
]
sales_columns = ["region", "product", "sales"]

# Manager data
manager_data = [
    ("North", "Alice"),
    ("South", "Bob"),
    ("West", "Charlie")
]
manager_columns = ["region", "manager"]

sales_df = spark.createDataFrame(sales_data, sales_columns)
manager_df = spark.createDataFrame(manager_data, manager_columns)

# Perform inner join
joined_df = sales_df.join(manager_df, on="region", how="inner")

joined_df.show()
    
+------+---------+-----+-------+
|region| product |sales|manager|
+------+---------+-----+-------+
| North|Product A|  100|  Alice|
| South|Product B|   50|    Bob|
+------+---------+-----+-------+
    

Question:

Why is "East" region not included in the result?

Answer:

Because it's an inner join, only regions that exist in both DataFrames are included. The "East" region doesn’t have a corresponding manager entry.

Example: Left Join


left_join_df = sales_df.join(manager_df, on="region", how="left")

left_join_df.show()
    
+------+---------+-----+-------+
|region| product |sales|manager|
+------+---------+-----+-------+
| North|Product A|  100|  Alice|
| South|Product B|   50|    Bob|
|  East|Product C|  300|   null|
+------+---------+-----+-------+
    

Notice how the "East" row is still present, but the manager column is null. That’s the behavior of a left join — all rows from the left DataFrame are kept.

Summary



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M