You can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.
In this project, we will build a machine learning model to predict whether a passenger survived the Titanic shipwreck based on features such as age, sex, passenger class, etc.
This is one of the most famous beginner datasets from Kaggle and is perfect for understanding real-world ML workflows.
We will use the Titanic dataset which contains the following columns:
PassengerId
– Unique ID for each passengerSurvived
– 0 = No, 1 = YesPclass
– Ticket class (1 = upper, 2 = middle, 3 = lower)Name
, Sex
, Age
, SibSp
, Parch
Ticket
, Fare
, Cabin
, Embarked
☞ Because during the evacuation, the policy was "women and children first". Hence, gender likely played a major role in determining survival chances.
import pandas as pd
# Load the Titanic dataset (from URL or local CSV)
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)
# Display first 5 rows
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings female 38.0 1 0 PC 17599 71.2833 C85 C ...
Let’s clean and prepare the data for the ML model:
# Drop columns we won’t use
df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
# Handle missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
# Convert categorical to numeric
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
df.head()
Survived Pclass Sex Age SibSp Parch Fare Embarked_Q Embarked_S 0 0 3 0 22.0 1 0 7.2500 0 1 1 1 1 1 38.0 1 0 71.2833 0 0 ...
☞ Because machine learning models work only with numbers — they cannot interpret text or strings directly.
from sklearn.model_selection import train_test_split
X = df.drop('Survived', axis=1)
y = df['Survived']
# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.ensemble import RandomForestClassifier
# Initialize and train the model
model = RandomForestClassifier()
model.fit(X_train, y_train)
from sklearn.metrics import accuracy_score, classification_report
# Make predictions
y_pred = model.predict(X_test)
# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Accuracy: 0.81 precision recall f1-score support 0 0.84 0.84 0.84 105 1 0.75 0.75 0.75 74 accuracy 0.81 179
☞ Random Forest combines many decision trees, reducing overfitting and improving prediction accuracy.
Try experimenting with other algorithms like Logistic Regression or K-Nearest Neighbors, and try improving the model using feature engineering or hyperparameter tuning.
You can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.