Yandex

Machine Learning for BeginnersMachine Learning for Beginners1

Dimensionality Reduction with PCA in Machine Learning



📉 Dimensionality Reduction with PCA

In real-world machine learning problems, we often deal with datasets that have a large number of features (also called dimensions). These high-dimensional datasets can be computationally expensive, difficult to visualize, and may lead to overfitting. This is where Principal Component Analysis (PCA) becomes useful.

What is PCA?

PCA (Principal Component Analysis) is a mathematical technique used to reduce the number of input variables in a dataset, while still preserving as much information (variance) as possible. It does this by transforming the original features into a new set of uncorrelated variables called principal components.

  • First Principal Component: The direction of maximum variance in the data.
  • Second Principal Component: Orthogonal to the first and captures the next highest variance.

Why use PCA?

  1. To speed up model training
  2. To remove redundant or noisy features
  3. To visualize high-dimensional data (e.g., reduce to 2D or 3D)

Example 1: PCA on a Synthetic Dataset

Let’s create a simple 3D dataset and reduce it to 2D using PCA.

from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd

# Step 1: Create synthetic dataset
X, _ = make_classification(n_samples=300, n_features=3, n_informative=3, n_redundant=0, random_state=42)

# Step 2: Convert to DataFrame for visualization
df = pd.DataFrame(X, columns=['Feature1', 'Feature2', 'Feature3'])

# Step 3: 3D Visualization
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['Feature1'], df['Feature2'], df['Feature3'], c='skyblue')
ax.set_title('Original 3D Data')
plt.show()

# Step 4: Apply PCA to reduce to 2D
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# Step 5: Plot reduced 2D data
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c='orange')
plt.title('Data after PCA (2D)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()
A 3D scatter plot of the original data followed by a 2D plot showing the PCA result.

Code Description:

  • make_classification generates synthetic features with variance.
  • PCA(n_components=2) reduces our 3D data into 2D.
  • The result is visualized to show how PCA compresses data while preserving structure.

Why does PCA work even without labels?

Because PCA is an unsupervised technique — it looks only at the variance of features, not the target values.

✳️ Let's test your understanding

What happens if we apply PCA to a dataset where all features are highly correlated?

➤ It will likely reduce well to fewer components, because the variance is along a single direction — PCA captures that shared information efficiently.


Example 2: PCA on the Iris Dataset

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Step 1: Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Step 3: Visualize the result
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset')
plt.grid(True)
plt.colorbar(label='Species')
plt.show()
A 2D scatter plot showing clusters of the 3 Iris species after PCA.

Code Description:

  • StandardScaler is important before PCA to give all features equal weight.
  • load_iris() gives us 4D data, and we reduce it to 2D using PCA.
  • Visualization shows clusters, meaning PCA retained useful structure.

Is PCA the same as feature selection?

➤ No. PCA creates new features (principal components), while feature selection chooses from the original features.

Quick Recap Questions

What are the benefits of PCA?

➤ Speed, visualization, noise reduction, better generalization.

When should you apply PCA?

➤ When your dataset has too many features, or features are correlated, or you want better performance/visualization.

Summary

  • PCA reduces dimensionality by transforming features into principal components.
  • Useful in preprocessing, visualization, and noise reduction.
  • Always standardize features before applying PCA.

Now that you understand PCA, try using it on a dataset with 10+ features and see how it performs!



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

You can support this website with a contribution of your choice.

When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M