Machine Learning for BeginnersMachine Learning for Beginners1

Machine Learning – Loading Datasets in Python (CSV, Excel, Built-in, and URLs)



Machine Learning – Loading Datasets in Python

Before we build any Machine Learning model, the first step is to load the dataset into memory. In this tutorial, you’ll learn how to load datasets using Python with libraries like pandas and scikit-learn. We'll cover the following:

Prerequisite

Make sure you have the following libraries installed:

pip install pandas scikit-learn openpyxl

1. Loading CSV Files using Pandas

CSV (Comma-Separated Values) is the most common format used for datasets.

import pandas as pd

# Load dataset from local CSV file
df = pd.read_csv("data/iris.csv")

# Show first 5 rows
print(df.head())

Q: What if the CSV file is not in the same folder as your code?

A: You need to provide the full or relative path to the file. For example:

df = pd.read_csv("/Users/yourname/Downloads/iris.csv")

Useful Parameters:


2. Loading Excel Files

Excel files usually have .xls or .xlsx extensions. Use the read_excel method from pandas:

# You need openpyxl to read .xlsx files
df = pd.read_excel("data/sales_data.xlsx")

# Show top rows
print(df.head())

Q: How do you load a specific sheet?

A: Use the sheet_name parameter:

df = pd.read_excel("data/sales_data.xlsx", sheet_name="Q1")

3. Loading Built-in Datasets from Scikit-Learn

Scikit-learn comes with several built-in datasets like iris, digits, wine, and breast cancer. These are perfect for practice.

from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()

# Features (X) and labels (y)
X = iris.data
y = iris.target

# Column names
print("Feature names:", iris.feature_names)
print("First row of data:", X[0])
print("Target:", y[0])

Q: What is the format of built-in datasets in sklearn?

A: These datasets are returned as a Bunch object, similar to a dictionary.

Other datasets:

from sklearn.datasets import load_wine, load_digits

wine = load_wine()
digits = load_digits()

4. Loading Datasets from URLs

If the data is hosted online (e.g. GitHub, Kaggle), you can directly load it using the URL.

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"

df = pd.read_csv(url)

print(df.head())

Q: What if the dataset is zipped or needs authentication?

A: You may need to use requests, zipfile, or API keys, depending on the site.


Summary

Intuition Check

Q: Why do we usually convert the loaded data into X and y?

A: In supervised learning, we separate features (X) and target (y) for model training. X helps the model learn patterns, and y is what we want the model to predict.


Next Step

Now that you’ve learned how to load data, the next step is to preprocess the data — handling missing values, encoding, scaling, etc.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M