Machine Learning for BeginnersMachine Learning for Beginners1

Titanic Survival Prediction Using Machine Learning



Project Overview

In this project, we will build a machine learning model to predict whether a passenger survived the Titanic shipwreck based on features such as age, sex, passenger class, etc.

This is one of the most famous beginner datasets from Kaggle and is perfect for understanding real-world ML workflows.

Dataset Description

We will use the Titanic dataset which contains the following columns:

🕵️‍♀️ Why do you think 'Sex' might be an important feature in survival prediction?

☞ Because during the evacuation, the policy was "women and children first". Hence, gender likely played a major role in determining survival chances.

Step 1: Load the Data


import pandas as pd

# Load the Titanic dataset (from URL or local CSV)
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)

# Display first 5 rows
df.head()
   PassengerId  Survived  Pclass     Name   Sex   Age  SibSp  Parch     Ticket     Fare Cabin Embarked
0            1         0       3  Braund   male  22.0      1      0  A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings  female 38.0      1      0  PC 17599  71.2833   C85        C
...

🔧 Step 2: Data Preprocessing

Let’s clean and prepare the data for the ML model:


# Drop columns we won’t use
df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

# Handle missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Convert categorical to numeric
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

df.head()
   Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked_Q  Embarked_S
0         0       3    0  22.0      1      0   7.2500           0           1
1         1       1    1  38.0      1      0  71.2833           0           0
...

Why do we convert categorical variables like 'Sex' and 'Embarked' to numbers?

☞ Because machine learning models work only with numbers — they cannot interpret text or strings directly.

🧠 Step 3: Train-Test Split


from sklearn.model_selection import train_test_split

X = df.drop('Survived', axis=1)
y = df['Survived']

# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

🤖 Step 4: Train the Model


from sklearn.ensemble import RandomForestClassifier

# Initialize and train the model
model = RandomForestClassifier()
model.fit(X_train, y_train)

Step 5: Evaluate the Model


from sklearn.metrics import accuracy_score, classification_report

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Accuracy: 0.81
              precision    recall  f1-score   support

           0       0.84      0.84      0.84       105
           1       0.75      0.75      0.75        74

    accuracy                           0.81       179

🧠 Model Explanation

✨ Why use Random Forest instead of a single Decision Tree?

☞ Random Forest combines many decision trees, reducing overfitting and improving prediction accuracy.

Summary

🚀 What’s Next?

Try experimenting with other algorithms like Logistic Regression or K-Nearest Neighbors, and try improving the model using feature engineering or hyperparameter tuning.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M