⬅ Previous Topic
Overview of Spark MLlibYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.
⬅ Previous Topic
Overview of Spark MLlibBefore we can feed data into a machine learning model, we must convert it into a numerical format. In PySpark, we use tools like StringIndexer
and VectorAssembler
to transform raw data into features that models can understand and learn from.
Machine learning algorithms only work with numbers. However, real-world data often contains text, categories, or even dates. Feature engineering transforms this raw data into numerical feature vectors.
StringIndexer
is used to convert categorical string columns (like "male", "female") into numerical indices (like 0.0, 1.0). This is important because algorithms don’t understand text.
Imagine a dataset with a column "gender" containing values: "male" and "female". Here's how StringIndexer helps:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
spark = SparkSession.builder.appName("StringIndexerExample").getOrCreate()
# Sample data
data = spark.createDataFrame([
(0, "male"),
(1, "female"),
(2, "female"),
(3, "male")
], ["id", "gender"])
# Apply StringIndexer
indexer = StringIndexer(inputCol="gender", outputCol="gender_index")
indexed = indexer.fit(data).transform(data)
indexed.show()
+---+------+------------+ | id|gender|gender_index| +---+------+------------+ | 0| male| 1.0| | 1|female| 0.0| | 2|female| 0.0| | 3| male| 1.0| +---+------+------------+
Why not just use text labels in the ML model?
Because models interpret string data as categorical or unrecognized values, which leads to errors. Indexing provides numerical consistency and allows efficient processing.
VectorAssembler
is used to combine multiple feature columns into a single vector column. This is the format that Spark MLlib expects as input to machine learning algorithms.
Suppose we have a dataset with age, salary, and a gender_index column (created using StringIndexer). We use VectorAssembler to combine these into a single features column.
from pyspark.ml.feature import VectorAssembler
# Sample data with numeric and categorical fields
data = spark.createDataFrame([
(25, 50000, 1.0),
(30, 60000, 0.0),
(45, 80000, 1.0)
], ["age", "salary", "gender_index"])
# Assemble features into a single vector
assembler = VectorAssembler(
inputCols=["age", "salary", "gender_index"],
outputCol="features"
)
output = assembler.transform(data)
output.select("features").show(truncate=False)
+-------------------+ |features | +-------------------+ |[25.0,50000.0,1.0] | |[30.0,60000.0,0.0] | |[45.0,80000.0,1.0] | +-------------------+
Why combine columns into a single vector?
Machine learning models in Spark expect a single input column containing feature vectors. Combining all relevant columns using VectorAssembler standardizes the input format.
Now let’s combine everything into a mini-pipeline where we index a categorical column and assemble it with numeric ones.
# Create full dataset
data = spark.createDataFrame([
(25, 50000, "male"),
(30, 60000, "female"),
(45, 80000, "male")
], ["age", "salary", "gender"])
# Step 1: Index the gender column
indexer = StringIndexer(inputCol="gender", outputCol="gender_index")
data_indexed = indexer.fit(data).transform(data)
# Step 2: Assemble features
assembler = VectorAssembler(
inputCols=["age", "salary", "gender_index"],
outputCol="features"
)
final_data = assembler.transform(data_indexed)
final_data.select("age", "salary", "gender", "gender_index", "features").show(truncate=False)
+---+------+------+------------+-------------------+ |age|salary|gender|gender_index|features | +---+------+------+------------+-------------------+ |25 |50000 |male |1.0 |[25.0,50000.0,1.0] | |30 |60000 |female|0.0 |[30.0,60000.0,0.0] | |45 |80000 |male |1.0 |[45.0,80000.0,1.0] | +---+------+------+------------+-------------------+
StringIndexer
helps convert text labels into numeric values, and VectorAssembler
combines multiple features into a single vector column. These are essential steps in preparing any dataset for machine learning with Spark MLlib.
With these tools, we ensure that the data is in the correct shape and format for Spark to learn patterns and make predictions efficiently.
⬅ Previous Topic
Overview of Spark MLlibYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.