📉 Dimensionality Reduction with PCA
In real-world machine learning problems, we often deal with datasets that have a large number of features (also called dimensions). These high-dimensional datasets can be computationally expensive, difficult to visualize, and may lead to overfitting. This is where Principal Component Analysis (PCA) becomes useful.
What is PCA?
PCA (Principal Component Analysis) is a mathematical technique used to reduce the number of input variables in a dataset, while still preserving as much information (variance) as possible. It does this by transforming the original features into a new set of uncorrelated variables called principal components.
- First Principal Component: The direction of maximum variance in the data.
- Second Principal Component: Orthogonal to the first and captures the next highest variance.
Why use PCA?
- To speed up model training
- To remove redundant or noisy features
- To visualize high-dimensional data (e.g., reduce to 2D or 3D)
Example 1: PCA on a Synthetic Dataset
Let’s create a simple 3D dataset and reduce it to 2D using PCA.
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
# Step 1: Create synthetic dataset
X, _ = make_classification(n_samples=300, n_features=3, n_informative=3, n_redundant=0, random_state=42)
# Step 2: Convert to DataFrame for visualization
df = pd.DataFrame(X, columns=['Feature1', 'Feature2', 'Feature3'])
# Step 3: 3D Visualization
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['Feature1'], df['Feature2'], df['Feature3'], c='skyblue')
ax.set_title('Original 3D Data')
plt.show()
# Step 4: Apply PCA to reduce to 2D
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
# Step 5: Plot reduced 2D data
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c='orange')
plt.title('Data after PCA (2D)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()
A 3D scatter plot of the original data followed by a 2D plot showing the PCA result.
Code Description:
make_classification
generates synthetic features with variance.PCA(n_components=2)
reduces our 3D data into 2D.- The result is visualized to show how PCA compresses data while preserving structure.
Why does PCA work even without labels?
Because PCA is an unsupervised technique — it looks only at the variance of features, not the target values.
✳️ Let's test your understanding
What happens if we apply PCA to a dataset where all features are highly correlated?
➤ It will likely reduce well to fewer components, because the variance is along a single direction — PCA captures that shared information efficiently.
Example 2: PCA on the Iris Dataset
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Step 1: Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 2: Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Step 3: Visualize the result
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset')
plt.grid(True)
plt.colorbar(label='Species')
plt.show()
A 2D scatter plot showing clusters of the 3 Iris species after PCA.
Code Description:
StandardScaler
is important before PCA to give all features equal weight.load_iris()
gives us 4D data, and we reduce it to 2D using PCA.- Visualization shows clusters, meaning PCA retained useful structure.
Is PCA the same as feature selection?
➤ No. PCA creates new features (principal components), while feature selection chooses from the original features.
Quick Recap Questions
What are the benefits of PCA?
➤ Speed, visualization, noise reduction, better generalization.
When should you apply PCA?
➤ When your dataset has too many features, or features are correlated, or you want better performance/visualization.
Summary
- PCA reduces dimensionality by transforming features into principal components.
- Useful in preprocessing, visualization, and noise reduction.
- Always standardize features before applying PCA.
Now that you understand PCA, try using it on a dataset with 10+ features and see how it performs!