Micro-batching and Continuous Data Processing in Spark Streaming
When we think about real-time data processing, we often imagine data being handled instantly as it arrives. Spark Streaming handles real-time data using a method called micro-batching. It divides the continuous stream of data into small chunks (micro-batches), and processes them in intervals. This gives developers a balance between traditional batch processing and fully real-time streaming.
What is Micro-batching?
Micro-batching is a technique where incoming data is collected for a short interval (e.g., every 1 or 2 seconds), and then processed as a small batch. Spark processes this mini-batch using the same distributed processing engine it uses for regular batch jobs.
How is Micro-batching Different from Traditional Batching?
In traditional batching, we collect data for hours or days before processing it. In contrast, micro-batching happens in small, rapid intervals, allowing data to be analyzed almost in real time.
Question:
Is micro-batching the same as real-time processing?
Answer:
Not exactly. Micro-batching is near real-time. There is always a slight delay (a few seconds) because the system needs to wait to collect enough data for the mini-batch. Fully real-time systems (like Apache Flink) handle each record as it arrives without waiting.
What is Continuous Data Processing?
Continuous data processing refers to systems that ingest and analyze data as soon as it arrives, without any buffering or batching. It’s useful in use-cases where even milliseconds of delay can be critical — such as fraud detection or stock market trading.
Structured Streaming in Spark introduced experimental support for continuous processing mode starting from Spark 2.3. This mode provides lower latency by pushing records through the processing engine as fast as possible — but it comes with limitations in terms of supported operations.
Real-Life Example: Analyzing Tweets in Real-Time
Imagine we want to analyze Twitter hashtags as they come in, and count how often each hashtag appears every few seconds.
Using Spark Structured Streaming, we can process these tweets using micro-batches and show live hashtag trends.
Question:
Why not wait until the end of the day and analyze all tweets at once?
Answer:
Because trends change fast. If we wait too long, we might miss what's trending right now. Real-time analysis gives us immediate insights that batch processing cannot.
Example in PySpark: Simulating Micro-batching with a Stream
Let's simulate a streaming job that reads data from a directory (e.g., CSV files dropped every few seconds) and counts the number of rows in each micro-batch.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
# Create SparkSession with streaming enabled
spark = SparkSession.builder .appName("MicroBatchingExample") .getOrCreate()
# Read stream from a folder (CSV files are added every few seconds)
stream_df = spark.readStream .option("header", "true") .csv("stream_input/") # Directory path
# Count rows in each micro-batch
row_count = stream_df.groupBy().count()
# Write output to console
query = row_count.writeStream .outputMode("complete") .format("console") .trigger(processingTime="5 seconds") .start()
query.awaitTermination()
+--------+ | count| +--------+ | 120| +--------+
In this example:
- The data is read every 5 seconds (micro-batch interval).
- Files dropped in the folder
stream_input/
are read automatically. - The count of records is printed to the console after every micro-batch.
Question:
What if no new files are added during a batch interval?
Answer:
The micro-batch will still execute but return zero results. Spark checks for new data regularly based on the interval specified.
When to Use Micro-batching vs Continuous Processing
Scenario | Best Option |
---|---|
Real-time monitoring of sensors (within 1–2 seconds latency) | Micro-batching |
High-frequency trading (needs millisecond latency) | Continuous processing |
Daily sales reports | Traditional batch |
Summary
Spark Streaming processes real-time data using a powerful micro-batching approach. This gives us the benefit of near real-time insights while still using the familiar Spark batch APIs. For ultra-low latency use cases, Spark’s continuous processing mode (though limited) can also be considered.