Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Spark Ecosystem Components: Core, SQL, Streaming, MLlib



Spark Ecosystem Components: Core, SQL, Streaming, MLlib

Apache Spark is not just a single tool — it is an entire ecosystem of components that together enable powerful distributed data processing. Each component serves a different purpose but is built on top of the same Spark Core engine.

1. Spark Core

Spark Core is the foundational engine of Apache Spark. It provides basic functionalities such as:

The Spark Core works with RDDs (Resilient Distributed Datasets), which are immutable collections of objects distributed across a cluster.

Example: Create a basic RDD


from pyspark import SparkContext

sc = SparkContext("local", "CoreExample")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

print("RDD Elements:", rdd.collect())
    
RDD Elements: [1, 2, 3, 4, 5]
    

This code initializes a SparkContext, creates an RDD from a Python list, and prints the elements. In real-world scenarios, this RDD could be created from huge files or streaming data.

Question:

Why use RDDs instead of Python lists?

Answer:

RDDs can be distributed across multiple machines and processed in parallel, making them ideal for large-scale data operations.

2. Spark SQL

Spark SQL allows you to run SQL queries on structured data. It supports reading data from various sources like CSV, JSON, Hive, and Parquet, and offers DataFrames — a tabular form similar to a database table or Excel sheet.

Example: Load and query data using Spark SQL


from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SQLExample").getOrCreate()
df = spark.read.csv("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv", header=True)

df.createOrReplaceTempView("flights")
result = spark.sql("SELECT * FROM flights WHERE 1958 > 340")
result.show()
    
+-----+-----+-----+-----+
|Month| 1958| 1959| 1960|
+-----+-----+-----+-----+
|  MAR|  362|  406|  419|
|  APR|  348|  396|  461|
|  MAY|  363|  420|  472|
...
+-----+-----+-----+-----+
    

Here, we read a CSV file into a DataFrame and execute an SQL query to filter rows. This is powerful for analysts familiar with SQL who want to process large datasets efficiently.

Question:

Why use Spark SQL instead of a traditional database?

Answer:

Spark SQL can handle massive datasets that might not fit in a single database machine, and can process data stored across distributed file systems like HDFS or S3.

3. Spark Streaming

Spark Streaming allows processing of real-time data streams — data that arrives continuously like logs, sensor readings, or social media feeds.

It processes data in mini-batches using a micro-batch model, enabling near real-time analytics.

Example Use Case:

Imagine you are building a dashboard that updates every second with the number of users visiting your website. Logs are written continuously, and Spark Streaming reads and analyzes them in real-time.

Question:

How is Spark Streaming different from batch processing?

Answer:

Batch processing handles data already stored (like files or databases), while Spark Streaming works with data that is continuously generated and never fully "stored".

Though full streaming examples require socket/kafka setup, here's a conceptual snippet:


from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("StreamExample").getOrCreate()

# Simulate streaming from a folder
stream_df = spark.readStream.option("header", True).csv("/path/to/streaming-folder")

query = stream_df.writeStream.outputMode("append").format("console").start()
query.awaitTermination()
    

This would read streaming CSV files and print the contents in the console. Useful in environments like IoT or live feeds.

4. MLlib (Machine Learning Library)

MLlib is Spark’s machine learning library. It provides scalable implementations of algorithms for:

Example: Logistic Regression with MLlib


from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MLExample").getOrCreate()

data = [(1.0, 2.0), (2.0, 3.0), (3.0, 4.0), (4.0, 5.0)]
df = spark.createDataFrame(data, ["feature", "label"])

assembler = VectorAssembler(inputCols=["feature"], outputCol="features")
train_data = assembler.transform(df)

lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(train_data)

predictions = model.transform(train_data)
predictions.select("features", "label", "prediction").show()
    
+--------+-----+----------+
|features|label|prediction|
+--------+-----+----------+
| [1.0]  | 2.0 |   1.0    |
| [2.0]  | 3.0 |   1.0    |
...
+--------+-----+----------+
    

This demonstrates building a simple classification model. In real-world cases, datasets would be much larger and more complex — and Spark MLlib handles them efficiently using distributed computation.

Summary

The Spark ecosystem consists of powerful components, each solving specific data problems:

As a beginner, understanding these modules will help you choose the right tool for the right job and build robust data pipelines in Spark.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M