Titanic Survival Prediction Using Machine Learning

Project Overview

In this project, we will build a machine learning model to predict whether a passenger survived the Titanic shipwreck based on features such as age, sex, passenger class, etc.

This is one of the most famous beginner datasets from Kaggle and is perfect for understanding real-world ML workflows.

Dataset Description

We will use the Titanic dataset which contains the following columns:

PassengerId – Unique ID for each passenger
Survived – 0 = No, 1 = Yes
Pclass – Ticket class (1 = upper, 2 = middle, 3 = lower)
Name, Sex, Age, SibSp, Parch
Ticket, Fare, Cabin, Embarked

🕵️‍♀️ Why do you think 'Sex' might be an important feature in survival prediction?

☞ Because during the evacuation, the policy was "women and children first". Hence, gender likely played a major role in determining survival chances.

Step 1: Load the Data

import pandas as pd

# Load the Titanic dataset (from URL or local CSV)
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)

# Display first 5 rows
df.head()

   PassengerId  Survived  Pclass     Name   Sex   Age  SibSp  Parch     Ticket     Fare Cabin Embarked
0            1         0       3  Braund   male  22.0      1      0  A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings  female 38.0      1      0  PC 17599  71.2833   C85        C
...

🔧 Step 2: Data Preprocessing

Let’s clean and prepare the data for the ML model:

# Drop columns we won’t use
df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

# Handle missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Convert categorical to numeric
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

df.head()

   Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked_Q  Embarked_S
0         0       3    0  22.0      1      0   7.2500           0           1
1         1       1    1  38.0      1      0  71.2833           0           0
...

Why do we convert categorical variables like 'Sex' and 'Embarked' to numbers?

☞ Because machine learning models work only with numbers — they cannot interpret text or strings directly.

Step 3: Train-Test Split

from sklearn.model_selection import train_test_split

X = df.drop('Survived', axis=1)
y = df['Survived']

# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

🤖 Step 4: Train the Model

from sklearn.ensemble import RandomForestClassifier

# Initialize and train the model
model = RandomForestClassifier()
model.fit(X_train, y_train)

Step 5: Evaluate the Model

from sklearn.metrics import accuracy_score, classification_report

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.81
              precision    recall  f1-score   support

           0       0.84      0.84      0.84       105
           1       0.75      0.75      0.75        74

    accuracy                           0.81       179

Model Explanation

Accuracy: Proportion of total correct predictions.
Precision: How many predicted positives were actually positive.
Recall: How many actual positives were captured by the model.
F1-score: Harmonic mean of precision and recall.

✨ Why use Random Forest instead of a single Decision Tree?

☞ Random Forest combines many decision trees, reducing overfitting and improving prediction accuracy.

Summary

We explored, cleaned, and preprocessed the Titanic dataset.
We trained a Random Forest Classifier to predict survival.
We evaluated it using accuracy and classification metrics.

🚀 What’s Next?

Try experimenting with other algorithms like Logistic Regression or K-Nearest Neighbors, and try improving the model using feature engineering or hyperparameter tuning.

⬅ Previous TopicBias-Variance Tradeoff in Machine Learning (with Python Examples)

Next Topic ⮕House Price Prediction Using Machine Learning (Step-by-Step Tutorial for Beginners)