What is K-Nearest Neighbors (KNN)?
K-Nearest Neighbors (KNN) is one of the simplest yet powerful machine learning algorithms. It's a supervised learning algorithm used for classification and regression problems.
Real-Life Analogy to Understand KNN
Imagine you're new in a city and want to decide which restaurant to try. You ask your 5 nearest neighbors (K=5) for recommendations. If 3 out of 5 suggest an Italian place, you're likely to go there. That's KNN!
Question:
Why do we consider "nearest" neighbors?
Answer:
Because similar data points are likely to belong to the same class. "Nearest" means "most similar" in terms of feature distance (usually Euclidean distance).
How KNN Works (Step-by-Step)
- Choose the number of neighbors,
K
. - Calculate distance between the new data point and all other points.
- Select the K closest neighbors.
- Majority voting (for classification) or average (for regression).
Example: Predicting Fruit Type Based on Weight and Size
Let’s say you have a dataset of fruits with their weight and size and labels: Apple or Orange.
Weight | Size | Fruit |
---|---|---|
150 | 7 | Apple |
130 | 6 | Apple |
180 | 8 | Orange |
170 | 7 | Orange |
160 | 7 | ? |
Now we want to classify the new fruit (160g, size 7). We’ll use K=3
. The algorithm will find the 3 closest neighbors using distance and count how many are Apples vs Oranges.
Question:
What if there's a tie? (e.g., 2 neighbors are Apple, 2 are Orange with K=4)
Answer:
In case of tie, reduce K or use distance-weighted voting (closer neighbors get more weight).
Python Code: KNN Using Scikit-Learn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
# Step 1: Sample dataset
data = {
'Weight': [150, 130, 180, 170, 160],
'Size': [7, 6, 8, 7, 7],
'Fruit': ['Apple', 'Apple', 'Orange', 'Orange', 'Unknown']
}
df = pd.DataFrame(data)
# Separate known and unknown
known_df = df[df['Fruit'] != 'Unknown']
unknown_df = df[df['Fruit'] == 'Unknown']
# Step 2: Features and target
X = known_df[['Weight', 'Size']]
y = known_df['Fruit']
# Step 3: Train the model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)
# Step 4: Predict unknown fruit
prediction = knn.predict(unknown_df[['Weight', 'Size']])
print("Predicted Fruit:", prediction[0])
Output:
Predicted Fruit: Orange
Code Explanation:
- Step 1: We create a simple DataFrame of known fruits and one unknown.
- Step 2: We separate the features (
Weight
,Size
) and labels. - Step 3: We train the KNN model with
n_neighbors=3
. - Step 4: We predict the label for the unknown fruit using the trained model.
Question:
Should you always pick K=3?
Answer:
No. You should try different K values and use cross-validation to choose the best one based on accuracy.
Key Points to Remember
- KNN is simple and works well with small datasets.
- It doesn’t learn an internal model — it memorizes the dataset.
- Slow for large datasets (lazy learner).
- Feature scaling (like normalization) is important in KNN!
Next Steps
In the next topic, we will learn about Decision Trees which are another popular supervised learning technique that builds internal decision rules instead of relying on neighbors.