Practical Examples with Real Datasets

In this section, we'll explore how to use PySpark with real datasets. These hands-on examples will help you apply everything you’ve learned so far about DataFrames and transformations.

Setting Up PySpark

Before working with datasets, we must initialize PySpark in our Python environment.

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder     .appName("PracticalExamples")     .getOrCreate()

Example 1: Analyzing a Flight Delay Dataset

Let’s work with a real-world dataset: flight delays. This data contains information about flight dates, carriers, departure/arrival times, and delays.

Loading the dataset

We'll load a CSV file into a Spark DataFrame. For demonstration, we use a public sample dataset.

# Load CSV from a public URL
df = spark.read.csv("https://raw.githubusercontent.com/databricks/spark-csv/master/src/test/resources/cars.csv", header=True, inferSchema=True)

# Show first few rows
df.show(5)

+-------------+---+----+
|    Make     |MPG|Cyl |
+-------------+---+----+
|Chevrolet    | 18|  8 |
|Buick        | 20|  6 |
|Toyota       | 24|  4 |
|Ford         | 22|  6 |
|Volkswagen   | 30|  4 |
+-------------+---+----+

Basic analysis

# Count total records
print("Total rows:", df.count())

# View the schema
df.printSchema()

# Group by cylinder and find average MPG
df.groupBy("Cyl").avg("MPG").show()

Total rows: 5
root
  |-- Make: string (nullable = true)
  |-- MPG: integer (nullable = true)
  |-- Cyl: integer (nullable = true)

+----+--------+
|Cyl |avg(MPG)|
+----+--------+
|  4 |    27.0|
|  6 |    21.0|
|  8 |    18.0|
+----+--------+

Question:

Why do we use inferSchema=True while reading the CSV?

Answer:

By default, Spark reads all columns as strings. Using inferSchema=True tells Spark to detect the actual data types like integer or float based on the values.

Example 2: COVID-19 Dataset Analysis

Let’s now analyze a simplified version of COVID-19 data. This dataset includes country, total cases, total deaths, and population.

Loading the dataset

covid_df = spark.read.csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/latest/owid-covid-latest.csv", header=True, inferSchema=True)

# Select relevant columns
covid_df = covid_df.select("location", "total_cases", "total_deaths", "population")

# Show some data
covid_df.show(5)

+-----------+-----------+------------+-----------+
|  location |total_cases|total_deaths|population |
+-----------+-----------+------------+-----------+
|Afghanistan|   225105.0|      7820.0|38928346.0 |
|Albania    |   334457.0|      3598.0|2877797.0  |
|Algeria    |   271441.0|      6881.0|43851044.0 |
|Andorra    |    47523.0|        165.0|77265.0    |
|Angola     |   105095.0|      1934.0|32866268.0 |
+-----------+-----------+------------+-----------+

Data transformation: Death rate

Let’s create a new column called death_rate = total_deaths / total_cases and sort countries by this rate.

from pyspark.sql.functions import col

# Add death_rate column
covid_df = covid_df.withColumn("death_rate", col("total_deaths") / col("total_cases"))

# Sort by highest death rate
covid_df.orderBy(col("death_rate").desc()).show(10)

Question:

What happens if total_cases is 0 or null?

Answer:

Spark will return null for the division, and that record will not show in sorted results unless handled explicitly with a filter or fill.

Best Practices with Real Datasets

Always inspect the data types using printSchema().
Use dropna() or fillna() to handle missing values.
When chaining multiple operations, prefer readable formatting using line breaks.
Always test with a small sample before applying transformations on full data.

Summary

Working with real datasets in PySpark teaches you how to load, inspect, clean, transform, and analyze data at scale. These practical examples not only build confidence but also lay the foundation for working with larger data pipelines using Spark.

⬅ Previous TopicWorking with Columns, Expressions, and User-Defined Functions in PySpark

Next Topic ⮕Creating Temporary Views and Global Views in Spark SQL

Comments

Loading comments...

Practical Examples with Real Datasets