Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Window Operations and Aggregations in Spark Streaming



Understanding Window Operations and Aggregations in Spark Streaming

In real-time data processing, events do not always arrive neatly one after another. Data might arrive late or continuously. That’s where window operations in Spark Streaming become useful. They allow us to apply aggregation functions (like count, sum, average) over chunks of time called "windows".

What Are Window Operations?

Window operations let you divide your data stream into manageable time intervals (windows) and apply computations to each window independently. This is crucial when working with live data such as server logs, financial transactions, or sensor readings.

Key Concepts

Types of Windows

Real-World Example: Counting Page Views Every 10 Seconds

Suppose you run a news website and want to count how many pages are viewed in real-time, grouped by the user’s country. You want to count the views in every 10-second window, updated every 5 seconds.

Question:

Why would you want overlapping windows?

Answer:

To get more frequent updates. For example, in fraud detection or monitoring server traffic, waiting for a full window (say 10 mins) might delay alerts. Overlapping (sliding) windows give you early signals.

Sample PySpark Code: Windowed Aggregation

Let’s write a working Spark Structured Streaming code that reads data from a socket and performs windowed counts.


from pyspark.sql import SparkSession
from pyspark.sql.functions import window, col

# Initialize Spark Session
spark = SparkSession.builder     .appName("WindowedAggregationExample")     .getOrCreate()

# Read data from a socket
lines = spark.readStream     .format("socket")     .option("host", "localhost")     .option("port", 9999)     .load()

# Simulate streaming page view data (format: "country")
words = lines.selectExpr("value as country")

# Apply window operation: count by country every 10 sec window, sliding every 5 sec
windowedCounts = words.groupBy(
    window(col("country").cast("timestamp"), "10 seconds", "5 seconds"),
    col("country")
).count()

# Write output to console
query = windowedCounts.writeStream     .outputMode("complete")     .format("console")     .option("truncate", False)     .start()

query.awaitTermination()
    

Expected Output (Example)

+------------------------------------------+--------+-----+
|window                                    |country |count|
+------------------------------------------+--------+-----+
|{2025-05-02 11:00:00, 2025-05-02 11:00:10}|India   |  12 |
|{2025-05-02 11:00:00, 2025-05-02 11:00:10}|USA     |   8 |
+------------------------------------------+--------+-----+
    

Explanation of the Code

Question:

What if my data has a delay or comes out-of-order?

Answer:

Spark supports watermarking, which tells Spark to wait for late data up to a certain time. You can use withWatermark("timestamp", "5 minutes") to handle delays.

Use Cases of Windowed Aggregations

Summary

Window operations allow you to segment streaming data into fixed or sliding time frames. This makes it possible to apply aggregations like counts, sums, and averages in real-time. They are especially useful for time-based analysis in Spark Streaming pipelines.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M