Machine Learning for BeginnersMachine Learning for Beginners1

Dimensionality Reduction with PCA in Machine Learning



📉 Dimensionality Reduction with PCA

In real-world machine learning problems, we often deal with datasets that have a large number of features (also called dimensions). These high-dimensional datasets can be computationally expensive, difficult to visualize, and may lead to overfitting. This is where Principal Component Analysis (PCA) becomes useful.

What is PCA?

PCA (Principal Component Analysis) is a mathematical technique used to reduce the number of input variables in a dataset, while still preserving as much information (variance) as possible. It does this by transforming the original features into a new set of uncorrelated variables called principal components.

Why use PCA?

  1. To speed up model training
  2. To remove redundant or noisy features
  3. To visualize high-dimensional data (e.g., reduce to 2D or 3D)

Example 1: PCA on a Synthetic Dataset

Let’s create a simple 3D dataset and reduce it to 2D using PCA.


from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd

# Step 1: Create synthetic dataset
X, _ = make_classification(n_samples=300, n_features=3, n_informative=3, n_redundant=0, random_state=42)

# Step 2: Convert to DataFrame for visualization
df = pd.DataFrame(X, columns=['Feature1', 'Feature2', 'Feature3'])

# Step 3: 3D Visualization
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['Feature1'], df['Feature2'], df['Feature3'], c='skyblue')
ax.set_title('Original 3D Data')
plt.show()

# Step 4: Apply PCA to reduce to 2D
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# Step 5: Plot reduced 2D data
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c='orange')
plt.title('Data after PCA (2D)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()
A 3D scatter plot of the original data followed by a 2D plot showing the PCA result.

Code Description:

Why does PCA work even without labels?

Because PCA is an unsupervised technique — it looks only at the variance of features, not the target values.

✳️ Let's test your understanding

What happens if we apply PCA to a dataset where all features are highly correlated?

➤ It will likely reduce well to fewer components, because the variance is along a single direction — PCA captures that shared information efficiently.


Example 2: PCA on the Iris Dataset


from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Step 1: Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Step 3: Visualize the result
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset')
plt.grid(True)
plt.colorbar(label='Species')
plt.show()
A 2D scatter plot showing clusters of the 3 Iris species after PCA.

Code Description:

Is PCA the same as feature selection?

➤ No. PCA creates new features (principal components), while feature selection chooses from the original features.

Quick Recap Questions

What are the benefits of PCA?

➤ Speed, visualization, noise reduction, better generalization.

When should you apply PCA?

➤ When your dataset has too many features, or features are correlated, or you want better performance/visualization.

Summary

Now that you understand PCA, try using it on a dataset with 10+ features and see how it performs!



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M