Feature Engineering with VectorAssembler and StringIndexer in PySpark

Feature Engineering with VectorAssembler and StringIndexer

Before we can feed data into a machine learning model, we must convert it into a numerical format. In PySpark, we use tools like StringIndexer and VectorAssembler to transform raw data into features that models can understand and learn from.

Why Feature Engineering?

Machine learning algorithms only work with numbers. However, real-world data often contains text, categories, or even dates. Feature engineering transforms this raw data into numerical feature vectors.

What is StringIndexer?

StringIndexer is used to convert categorical string columns (like "male", "female") into numerical indices (like 0.0, 1.0). This is important because algorithms don’t understand text.

Example: Converting Gender to Numeric

Imagine a dataset with a column "gender" containing values: "male" and "female". Here's how StringIndexer helps:

from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer

spark = SparkSession.builder.appName("StringIndexerExample").getOrCreate()

# Sample data
data = spark.createDataFrame([
    (0, "male"),
    (1, "female"),
    (2, "female"),
    (3, "male")
], ["id", "gender"])

# Apply StringIndexer
indexer = StringIndexer(inputCol="gender", outputCol="gender_index")
indexed = indexer.fit(data).transform(data)
indexed.show()

+---+------+------------+
| id|gender|gender_index|
+---+------+------------+
|  0|  male|         1.0|
|  1|female|         0.0|
|  2|female|         0.0|
|  3|  male|         1.0|
+---+------+------------+

Question:

Why not just use text labels in the ML model?

Answer:

Because models interpret string data as categorical or unrecognized values, which leads to errors. Indexing provides numerical consistency and allows efficient processing.

What is VectorAssembler?

VectorAssembler is used to combine multiple feature columns into a single vector column. This is the format that Spark MLlib expects as input to machine learning algorithms.

Example: Combining Age, Salary, and Gender Index

Suppose we have a dataset with age, salary, and a gender_index column (created using StringIndexer). We use VectorAssembler to combine these into a single features column.

from pyspark.ml.feature import VectorAssembler

# Sample data with numeric and categorical fields
data = spark.createDataFrame([
    (25, 50000, 1.0),
    (30, 60000, 0.0),
    (45, 80000, 1.0)
], ["age", "salary", "gender_index"])

# Assemble features into a single vector
assembler = VectorAssembler(
    inputCols=["age", "salary", "gender_index"],
    outputCol="features"
)

output = assembler.transform(data)
output.select("features").show(truncate=False)

+-------------------+
|features           |
+-------------------+
|[25.0,50000.0,1.0] |
|[30.0,60000.0,0.0] |
|[45.0,80000.0,1.0] |
+-------------------+

Question:

Why combine columns into a single vector?

Answer:

Machine learning models in Spark expect a single input column containing feature vectors. Combining all relevant columns using VectorAssembler standardizes the input format.

Putting It All Together: StringIndexer + VectorAssembler

Now let’s combine everything into a mini-pipeline where we index a categorical column and assemble it with numeric ones.

# Create full dataset
data = spark.createDataFrame([
    (25, 50000, "male"),
    (30, 60000, "female"),
    (45, 80000, "male")
], ["age", "salary", "gender"])

# Step 1: Index the gender column
indexer = StringIndexer(inputCol="gender", outputCol="gender_index")
data_indexed = indexer.fit(data).transform(data)

# Step 2: Assemble features
assembler = VectorAssembler(
    inputCols=["age", "salary", "gender_index"],
    outputCol="features"
)
final_data = assembler.transform(data_indexed)

final_data.select("age", "salary", "gender", "gender_index", "features").show(truncate=False)

+---+------+------+------------+-------------------+
|age|salary|gender|gender_index|features           |
+---+------+------+------------+-------------------+
|25 |50000 |male  |1.0         |[25.0,50000.0,1.0] |
|30 |60000 |female|0.0         |[30.0,60000.0,0.0] |
|45 |80000 |male  |1.0         |[45.0,80000.0,1.0] |
+---+------+------+------------+-------------------+

Summary

StringIndexer helps convert text labels into numeric values, and VectorAssembler combines multiple features into a single vector column. These are essential steps in preparing any dataset for machine learning with Spark MLlib.

With these tools, we ensure that the data is in the correct shape and format for Spark to learn patterns and make predictions efficiently.

⬅ Previous TopicOverview of Spark MLlib

Next Topic ⮕Building Machine Learning Pipelines in Apache Spark

Feature Engineering with VectorAssembler and StringIndexer in PySpark

Feature Engineering with VectorAssembler and StringIndexer

Why Feature Engineering?

What is StringIndexer?

Example: Converting Gender to Numeric

Question:

Answer:

What is VectorAssembler?

Example: Combining Age, Salary, and Gender Index

Question:

Answer:

Putting It All Together: StringIndexer + VectorAssembler

Summary

Module 11: Introduction to Machine Learning with Spark MLlib❯

Player Settings