Machine Learning – Loading Datasets in Python (CSV, Excel, Built-in, and URLs)

Machine Learning – Loading Datasets in Python

Before we build any Machine Learning model, the first step is to load the dataset into memory. In this tutorial, you’ll learn how to load datasets using Python with libraries like pandas and scikit-learn. We'll cover the following:

Loading CSV files
Loading Excel files
Loading built-in datasets from scikit-learn
Loading datasets from URLs

Prerequisite

Make sure you have the following libraries installed:

pip install pandas scikit-learn openpyxl

1. Loading CSV Files using Pandas

CSV (Comma-Separated Values) is the most common format used for datasets.

import pandas as pd

# Load dataset from local CSV file
df = pd.read_csv("data/iris.csv")

# Show first 5 rows
print(df.head())

Q: What if the CSV file is not in the same folder as your code?

A: You need to provide the full or relative path to the file. For example:

df = pd.read_csv("/Users/yourname/Downloads/iris.csv")

Useful Parameters:

sep=";": If the file uses semicolon instead of comma
header=None: If your file has no column names
names=["col1", "col2"]: To assign column names manually

2. Loading Excel Files

Excel files usually have .xls or .xlsx extensions. Use the read_excel method from pandas:

# You need openpyxl to read .xlsx files
df = pd.read_excel("data/sales_data.xlsx")

# Show top rows
print(df.head())

Q: How do you load a specific sheet?

A: Use the sheet_name parameter:

df = pd.read_excel("data/sales_data.xlsx", sheet_name="Q1")

3. Loading Built-in Datasets from Scikit-Learn

Scikit-learn comes with several built-in datasets like iris, digits, wine, and breast cancer. These are perfect for practice.

from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()

# Features (X) and labels (y)
X = iris.data
y = iris.target

# Column names
print("Feature names:", iris.feature_names)
print("First row of data:", X[0])
print("Target:", y[0])

Q: What is the format of built-in datasets in sklearn?

A: These datasets are returned as a Bunch object, similar to a dictionary.

Other datasets:

from sklearn.datasets import load_wine, load_digits

wine = load_wine()
digits = load_digits()

4. Loading Datasets from URLs

If the data is hosted online (e.g. GitHub, Kaggle), you can directly load it using the URL.

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"

df = pd.read_csv(url)

print(df.head())

Q: What if the dataset is zipped or needs authentication?

A: You may need to use requests, zipfile, or API keys, depending on the site.

Summary

Use pd.read_csv() for CSV files
Use pd.read_excel() for Excel files
Use sklearn.datasets for built-in datasets
You can load datasets from URL directly with Pandas

Intuition Check

Q: Why do we usually convert the loaded data into X and y?

A: In supervised learning, we separate features (X) and target (y) for model training. X helps the model learn patterns, and y is what we want the model to predict.

Next Step

Now that you’ve learned how to load data, the next step is to preprocess the data — handling missing values, encoding, scaling, etc.

⬅ Previous TopicReal-Life Examples of Machine Learning: Recommendation Systems, Spam Filters, Chatbots

Next Topic ⮕Handling Missing Values and Outliers in Machine Learning

Comments

Loading comments...