Feature Engineering with VectorAssembler and StringIndexer
Before we can feed data into a machine learning model, we must convert it into a numerical format. In PySpark, we use tools like StringIndexer
and VectorAssembler
to transform raw data into features that models can understand and learn from.
Why Feature Engineering?
Machine learning algorithms only work with numbers. However, real-world data often contains text, categories, or even dates. Feature engineering transforms this raw data into numerical feature vectors.
What is StringIndexer?
StringIndexer
is used to convert categorical string columns (like "male", "female") into numerical indices (like 0.0, 1.0). This is important because algorithms don’t understand text.
Example: Converting Gender to Numeric
Imagine a dataset with a column "gender" containing values: "male" and "female". Here's how StringIndexer helps:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
spark = SparkSession.builder.appName("StringIndexerExample").getOrCreate()
# Sample data
data = spark.createDataFrame([
(0, "male"),
(1, "female"),
(2, "female"),
(3, "male")
], ["id", "gender"])
# Apply StringIndexer
indexer = StringIndexer(inputCol="gender", outputCol="gender_index")
indexed = indexer.fit(data).transform(data)
indexed.show()
+---+------+------------+ | id|gender|gender_index| +---+------+------------+ | 0| male| 1.0| | 1|female| 0.0| | 2|female| 0.0| | 3| male| 1.0| +---+------+------------+
Question:
Why not just use text labels in the ML model?
Answer:
Because models interpret string data as categorical or unrecognized values, which leads to errors. Indexing provides numerical consistency and allows efficient processing.
What is VectorAssembler?
VectorAssembler
is used to combine multiple feature columns into a single vector column. This is the format that Spark MLlib expects as input to machine learning algorithms.
Example: Combining Age, Salary, and Gender Index
Suppose we have a dataset with age, salary, and a gender_index column (created using StringIndexer). We use VectorAssembler to combine these into a single features column.
from pyspark.ml.feature import VectorAssembler
# Sample data with numeric and categorical fields
data = spark.createDataFrame([
(25, 50000, 1.0),
(30, 60000, 0.0),
(45, 80000, 1.0)
], ["age", "salary", "gender_index"])
# Assemble features into a single vector
assembler = VectorAssembler(
inputCols=["age", "salary", "gender_index"],
outputCol="features"
)
output = assembler.transform(data)
output.select("features").show(truncate=False)
+-------------------+ |features | +-------------------+ |[25.0,50000.0,1.0] | |[30.0,60000.0,0.0] | |[45.0,80000.0,1.0] | +-------------------+
Question:
Why combine columns into a single vector?
Answer:
Machine learning models in Spark expect a single input column containing feature vectors. Combining all relevant columns using VectorAssembler standardizes the input format.
Putting It All Together: StringIndexer + VectorAssembler
Now let’s combine everything into a mini-pipeline where we index a categorical column and assemble it with numeric ones.
# Create full dataset
data = spark.createDataFrame([
(25, 50000, "male"),
(30, 60000, "female"),
(45, 80000, "male")
], ["age", "salary", "gender"])
# Step 1: Index the gender column
indexer = StringIndexer(inputCol="gender", outputCol="gender_index")
data_indexed = indexer.fit(data).transform(data)
# Step 2: Assemble features
assembler = VectorAssembler(
inputCols=["age", "salary", "gender_index"],
outputCol="features"
)
final_data = assembler.transform(data_indexed)
final_data.select("age", "salary", "gender", "gender_index", "features").show(truncate=False)
+---+------+------+------------+-------------------+ |age|salary|gender|gender_index|features | +---+------+------+------------+-------------------+ |25 |50000 |male |1.0 |[25.0,50000.0,1.0] | |30 |60000 |female|0.0 |[30.0,60000.0,0.0] | |45 |80000 |male |1.0 |[45.0,80000.0,1.0] | +---+------+------+------------+-------------------+
Summary
StringIndexer
helps convert text labels into numeric values, and VectorAssembler
combines multiple features into a single vector column. These are essential steps in preparing any dataset for machine learning with Spark MLlib.
With these tools, we ensure that the data is in the correct shape and format for Spark to learn patterns and make predictions efficiently.