Machine Learning – Loading Datasets in Python
Before we build any Machine Learning model, the first step is to load the dataset into memory. In this tutorial, you’ll learn how to load datasets using Python with libraries like pandas
and scikit-learn
. We'll cover the following:
- Loading CSV files
- Loading Excel files
- Loading built-in datasets from scikit-learn
- Loading datasets from URLs
Prerequisite
Make sure you have the following libraries installed:
pip install pandas scikit-learn openpyxl
1. Loading CSV Files using Pandas
CSV (Comma-Separated Values) is the most common format used for datasets.
import pandas as pd
# Load dataset from local CSV file
df = pd.read_csv("data/iris.csv")
# Show first 5 rows
print(df.head())
Q: What if the CSV file is not in the same folder as your code?
A: You need to provide the full or relative path to the file. For example:
df = pd.read_csv("/Users/yourname/Downloads/iris.csv")
Useful Parameters:
sep=";"
: If the file uses semicolon instead of commaheader=None
: If your file has no column namesnames=["col1", "col2"]
: To assign column names manually
2. Loading Excel Files
Excel files usually have .xls
or .xlsx
extensions. Use the read_excel
method from pandas:
# You need openpyxl to read .xlsx files
df = pd.read_excel("data/sales_data.xlsx")
# Show top rows
print(df.head())
Q: How do you load a specific sheet?
A: Use the sheet_name
parameter:
df = pd.read_excel("data/sales_data.xlsx", sheet_name="Q1")
3. Loading Built-in Datasets from Scikit-Learn
Scikit-learn comes with several built-in datasets like iris, digits, wine, and breast cancer. These are perfect for practice.
from sklearn.datasets import load_iris
# Load the iris dataset
iris = load_iris()
# Features (X) and labels (y)
X = iris.data
y = iris.target
# Column names
print("Feature names:", iris.feature_names)
print("First row of data:", X[0])
print("Target:", y[0])
Q: What is the format of built-in datasets in sklearn?
A: These datasets are returned as a Bunch
object, similar to a dictionary.
Other datasets:
from sklearn.datasets import load_wine, load_digits
wine = load_wine()
digits = load_digits()
4. Loading Datasets from URLs
If the data is hosted online (e.g. GitHub, Kaggle), you can directly load it using the URL.
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)
print(df.head())
Q: What if the dataset is zipped or needs authentication?
A: You may need to use requests
, zipfile
, or API keys, depending on the site.
Summary
- Use
pd.read_csv()
for CSV files - Use
pd.read_excel()
for Excel files - Use
sklearn.datasets
for built-in datasets - You can load datasets from
URL
directly with Pandas
Intuition Check
Q: Why do we usually convert the loaded data into X
and y
?
A: In supervised learning, we separate features (X) and target (y) for model training. X
helps the model learn patterns, and y
is what we want the model to predict.
Next Step
Now that you’ve learned how to load data, the next step is to preprocess the data — handling missing values, encoding, scaling, etc.