Project Overview
In this project, we will build a machine learning model to predict whether a passenger survived the Titanic shipwreck based on features such as age, sex, passenger class, etc.
This is one of the most famous beginner datasets from Kaggle and is perfect for understanding real-world ML workflows.
Dataset Description
We will use the Titanic dataset which contains the following columns:
PassengerId
– Unique ID for each passengerSurvived
– 0 = No, 1 = YesPclass
– Ticket class (1 = upper, 2 = middle, 3 = lower)Name
,Sex
,Age
,SibSp
,Parch
Ticket
,Fare
,Cabin
,Embarked
🕵️♀️ Why do you think 'Sex' might be an important feature in survival prediction?
☞ Because during the evacuation, the policy was "women and children first". Hence, gender likely played a major role in determining survival chances.
Step 1: Load the Data
import pandas as pd
# Load the Titanic dataset (from URL or local CSV)
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)
# Display first 5 rows
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings female 38.0 1 0 PC 17599 71.2833 C85 C ...
🔧 Step 2: Data Preprocessing
Let’s clean and prepare the data for the ML model:
# Drop columns we won’t use
df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
# Handle missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
# Convert categorical to numeric
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
df.head()
Survived Pclass Sex Age SibSp Parch Fare Embarked_Q Embarked_S 0 0 3 0 22.0 1 0 7.2500 0 1 1 1 1 1 38.0 1 0 71.2833 0 0 ...
Why do we convert categorical variables like 'Sex' and 'Embarked' to numbers?
☞ Because machine learning models work only with numbers — they cannot interpret text or strings directly.
🧠 Step 3: Train-Test Split
from sklearn.model_selection import train_test_split
X = df.drop('Survived', axis=1)
y = df['Survived']
# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
🤖 Step 4: Train the Model
from sklearn.ensemble import RandomForestClassifier
# Initialize and train the model
model = RandomForestClassifier()
model.fit(X_train, y_train)
Step 5: Evaluate the Model
from sklearn.metrics import accuracy_score, classification_report
# Make predictions
y_pred = model.predict(X_test)
# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Accuracy: 0.81 precision recall f1-score support 0 0.84 0.84 0.84 105 1 0.75 0.75 0.75 74 accuracy 0.81 179
🧠 Model Explanation
- Accuracy: Proportion of total correct predictions.
- Precision: How many predicted positives were actually positive.
- Recall: How many actual positives were captured by the model.
- F1-score: Harmonic mean of precision and recall.
✨ Why use Random Forest instead of a single Decision Tree?
☞ Random Forest combines many decision trees, reducing overfitting and improving prediction accuracy.
Summary
- We explored, cleaned, and preprocessed the Titanic dataset.
- We trained a Random Forest Classifier to predict survival.
- We evaluated it using accuracy and classification metrics.
🚀 What’s Next?
Try experimenting with other algorithms like Logistic Regression or K-Nearest Neighbors, and try improving the model using feature engineering or hyperparameter tuning.