Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Overview of Spark MLlib



Overview of Spark MLlib

Spark MLlib is Apache Spark’s built-in machine learning library. It provides scalable, distributed machine learning algorithms for classification, regression, clustering, recommendation, and more.

MLlib is designed to work efficiently with large datasets distributed across multiple machines. This makes it perfect for real-world big data machine learning tasks that can't be handled by tools running on a single computer.

Why Use Spark MLlib?

Common Tasks Supported by MLlib

Real-Life Example: Predicting Flight Delays

Suppose an airline wants to predict whether a flight will be delayed based on data like weather conditions, day of the week, departure time, etc. Using Spark MLlib, we can process millions of flight records and build a classification model to predict delays.

Question:

Why not use regular Python libraries like scikit-learn for this task?

Answer:

Scikit-learn works great for small datasets that fit in memory. But when the dataset is massive (like millions of flight records), it becomes inefficient or fails. Spark MLlib distributes the workload across a cluster, making it scalable and faster.

How MLlib Works in Spark

MLlib uses the concept of a pipeline, which is a sequence of steps applied to data. Each step is either a Transformer (e.g., scale or encode data) or an Estimator (e.g., train a model).

A typical MLlib pipeline includes:

  1. Reading and cleaning the data
  2. Indexing or encoding categorical features
  3. Assembling all features into a single vector
  4. Training the model (e.g., logistic regression)
  5. Evaluating the model

PySpark Example: Logistic Regression with MLlib

Let’s walk through a simple example using PySpark to train a logistic regression model.

Dataset:

We’ll use the classic Iris dataset (converted for binary classification for simplicity).


from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

# Create Spark session
spark = SparkSession.builder.appName("MLlibIntro").getOrCreate()

# Load sample dataset
data = spark.read.csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv", header=True, inferSchema=True)

# Convert to binary classification: Setosa vs Non-Setosa
data = data.withColumn("label", (data["species"] != "setosa").cast("int"))

# Assemble features into a single vector
assembler = VectorAssembler(inputCols=["sepal_length", "sepal_width", "petal_length", "petal_width"], outputCol="features")

# Create Logistic Regression model
lr = LogisticRegression(featuresCol="features", labelCol="label")

# Create pipeline
pipeline = Pipeline(stages=[assembler, lr])

# Split data
train_data, test_data = data.randomSplit([0.8, 0.2], seed=42)

# Train model
model = pipeline.fit(train_data)

# Predict on test data
predictions = model.transform(test_data)

# Show predictions
predictions.select("features", "label", "prediction", "probability").show()
  
+-----------------+-----+----------+--------------------+
|         features|label|prediction|         probability|
+-----------------+-----+----------+--------------------+
|[4.9,2.5,4.5,1.7]|    1|       1.0|[0.007,0.993]       |
|[5.0,2.0,3.5,1.0]|    1|       1.0|[0.022,0.978]       |
|[5.2,3.5,1.5,0.2]|    0|       0.0|[0.950,0.050]       |
|[5.7,2.8,4.1,1.3]|    1|       1.0|[0.028,0.972]       |
+-----------------+-----+----------+--------------------+
  

Explanation of the Code

Summary

Spark MLlib enables you to build scalable and efficient machine learning models for big data. It offers tools for classification, regression, clustering, and pipelines. With PySpark, beginners can write Pythonic code that runs efficiently on a distributed cluster.

In upcoming lessons, we will dive deeper into MLlib pipelines and explore other algorithms like Decision Trees and KMeans.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M