Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

What is Spark Streaming?



What is Spark Streaming?

Spark Streaming is a powerful feature of Apache Spark that allows you to process real-time data streams. Unlike traditional batch processing (which works with data already stored), Spark Streaming processes data in near real-time as it arrives — such as logs from web servers, messages from Kafka, or clicks from websites.

Why Do We Need Streaming?

In many industries, waiting hours or even minutes for data processing is not acceptable. Imagine fraud detection, live stock market monitoring, or traffic control systems — they all need instant data processing to react quickly.

How Spark Streaming Works

Under the hood, Spark Streaming breaks the live data stream into small chunks (called micro-batches) and processes them using the Spark engine. Each batch is treated like a small DataFrame or RDD, which Spark knows how to process efficiently.

Question:

Is Spark Streaming truly real-time?

Answer:

Spark Streaming works in micro-batches, so it’s near real-time. Tools like Apache Flink or Kafka Streams offer true event-by-event processing, but Spark Streaming gives a great balance of real-time performance and batch-style reliability.

Real-Life Example 1: Monitoring Website Traffic

Suppose you run an e-commerce website and want to monitor how many users visit your site every minute. You can set up a stream to read log data continuously from a file or socket, then count hits per minute using Spark Streaming.

Python Code Example (Socket Streaming)

Let’s simulate a simple example using PySpark and a TCP socket stream.


from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split

# Create Spark session with streaming support
spark = SparkSession.builder     .appName("StreamingExample")     .master("local[2]")     .getOrCreate()

# Set log level to WARN to avoid extra logs
spark.sparkContext.setLogLevel("WARN")

# Read streaming data from socket (nc -lk 9999)
lines = spark.readStream     .format("socket")     .option("host", "localhost")     .option("port", 9999)     .load()

# Split each line into words
words = lines.select(explode(split(lines.value, " ")).alias("word"))

# Count word frequency
word_counts = words.groupBy("word").count()

# Output the results to console
query = word_counts.writeStream     .outputMode("complete")     .format("console")     .start()

query.awaitTermination()
    

To test this code:

  1. Run the Python script.
  2. In a separate terminal, use nc -lk 9999 to simulate streaming input.
  3. Type some words and see Spark count them in real time.

Question:

Why do we use explode and split in the above code?

Answer:

Each line from the socket is a string. split turns the line into a list of words, and explode converts each word into its own row so we can count them individually.

Real-Life Example 2: Processing Real-Time Orders from Kafka

In a production environment, data often comes through tools like Apache Kafka. For instance, every order placed on an app is sent to Kafka as an event. Spark Streaming can read from Kafka topics, process the order stream, and generate live reports like total sales per product.

This is especially useful for fraud detection, inventory tracking, or sending live notifications.

Question:

Can we store the output of a stream to a file or database?

Answer:

Yes. Spark Streaming supports output to console, files, Kafka topics, or databases like Cassandra and PostgreSQL. You just need to change the writeStream configuration.

Structured Streaming vs DStreams

We now mostly use Structured Streaming for its simplicity, scalability, and better integration with SQL and machine learning APIs.

Summary

Spark Streaming enables you to process live data streams using the same APIs as batch processing. You can count words, detect trends, flag suspicious activity — all while data is still flowing in. Whether you're monitoring website traffic or analyzing sensor data in real time, Spark Streaming gives you powerful tools with simple Python code.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M