⬅ Previous Topic
Reading from Kafka, Socket, and File Stream in SparkNext Topic ⮕
Overview of Spark MLlib⬅ Previous Topic
Reading from Kafka, Socket, and File Stream in SparkNext Topic ⮕
Overview of Spark MLlibIn real-time data processing, events do not always arrive neatly one after another. Data might arrive late or continuously. That’s where window operations in Spark Streaming become useful. They allow us to apply aggregation functions (like count, sum, average) over chunks of time called "windows".
Window operations let you divide your data stream into manageable time intervals (windows) and apply computations to each window independently. This is important when working with live data such as server logs, financial transactions, or sensor readings.
Suppose you run a news website and want to count how many pages are viewed in real-time, grouped by the user’s country. You want to count the views in every 10-second window, updated every 5 seconds.
Why would you want overlapping windows?
To get more frequent updates. For example, in fraud detection or monitoring server traffic, waiting for a full window (say 10 mins) might delay alerts. Overlapping (sliding) windows give you early signals.
Let’s write a working Spark Structured Streaming code that reads data from a socket and performs windowed counts.
from pyspark.sql import SparkSession
from pyspark.sql.functions import window, col
# Initialize Spark Session
spark = SparkSession.builder .appName("WindowedAggregationExample") .getOrCreate()
# Read data from a socket
lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()
# Simulate streaming page view data (format: "country")
words = lines.selectExpr("value as country")
# Apply window operation: count by country every 10 sec window, sliding every 5 sec
windowedCounts = words.groupBy(
window(col("country").cast("timestamp"), "10 seconds", "5 seconds"),
col("country")
).count()
# Write output to console
query = windowedCounts.writeStream .outputMode("complete") .format("console") .option("truncate", False) .start()
query.awaitTermination()
+------------------------------------------+--------+-----+ |window |country |count| +------------------------------------------+--------+-----+ |{2025-05-02 11:00:00, 2025-05-02 11:00:10}|India | 12 | |{2025-05-02 11:00:00, 2025-05-02 11:00:10}|USA | 8 | +------------------------------------------+--------+-----+
readStream
: Reads live text data from the specified host and port.window
: Groups records into time windows using the event time (here, we simulate by converting 'country' as timestamp for demo).groupBy(...).count()
: Groups the data by country and window, then counts records per group.outputMode("complete")
: Outputs full counts for each window every time.What if my data has a delay or comes out-of-order?
Spark supports watermarking, which tells Spark to wait for late data up to a certain time. You can use withWatermark("timestamp", "5 minutes")
to handle delays.
Window operations allow you to segment streaming data into fixed or sliding time frames. This makes it possible to apply aggregations like counts, sums, and averages in real-time. They are especially useful for time-based analysis in Spark Streaming pipelines.
⬅ Previous Topic
Reading from Kafka, Socket, and File Stream in SparkNext Topic ⮕
Overview of Spark MLlibYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.