Practical Examples with Real Datasets
In this section, we'll explore how to use PySpark with real datasets. These hands-on examples will help you apply everything you’ve learned so far about DataFrames and transformations.
Setting Up PySpark
Before working with datasets, we must initialize PySpark in our Python environment.
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder .appName("PracticalExamples") .getOrCreate()
Example 1: Analyzing a Flight Delay Dataset
Let’s work with a real-world dataset: flight delays. This data contains information about flight dates, carriers, departure/arrival times, and delays.
Loading the dataset
We'll load a CSV file into a Spark DataFrame. For demonstration, we use a public sample dataset.
# Load CSV from a public URL
df = spark.read.csv("https://raw.githubusercontent.com/databricks/spark-csv/master/src/test/resources/cars.csv", header=True, inferSchema=True)
# Show first few rows
df.show(5)
+-------------+---+----+ | Make |MPG|Cyl | +-------------+---+----+ |Chevrolet | 18| 8 | |Buick | 20| 6 | |Toyota | 24| 4 | |Ford | 22| 6 | |Volkswagen | 30| 4 | +-------------+---+----+
Basic analysis
# Count total records
print("Total rows:", df.count())
# View the schema
df.printSchema()
# Group by cylinder and find average MPG
df.groupBy("Cyl").avg("MPG").show()
Total rows: 5 root |-- Make: string (nullable = true) |-- MPG: integer (nullable = true) |-- Cyl: integer (nullable = true) +----+--------+ |Cyl |avg(MPG)| +----+--------+ | 4 | 27.0| | 6 | 21.0| | 8 | 18.0| +----+--------+
Question:
Why do we use inferSchema=True
while reading the CSV?
Answer:
By default, Spark reads all columns as strings. Using inferSchema=True
tells Spark to detect the actual data types like integer or float based on the values.
Example 2: COVID-19 Dataset Analysis
Let’s now analyze a simplified version of COVID-19 data. This dataset includes country, total cases, total deaths, and population.
Loading the dataset
covid_df = spark.read.csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/latest/owid-covid-latest.csv", header=True, inferSchema=True)
# Select relevant columns
covid_df = covid_df.select("location", "total_cases", "total_deaths", "population")
# Show some data
covid_df.show(5)
+-----------+-----------+------------+-----------+ | location |total_cases|total_deaths|population | +-----------+-----------+------------+-----------+ |Afghanistan| 225105.0| 7820.0|38928346.0 | |Albania | 334457.0| 3598.0|2877797.0 | |Algeria | 271441.0| 6881.0|43851044.0 | |Andorra | 47523.0| 165.0|77265.0 | |Angola | 105095.0| 1934.0|32866268.0 | +-----------+-----------+------------+-----------+
Data transformation: Death rate
Let’s create a new column called death_rate = total_deaths / total_cases and sort countries by this rate.
from pyspark.sql.functions import col
# Add death_rate column
covid_df = covid_df.withColumn("death_rate", col("total_deaths") / col("total_cases"))
# Sort by highest death rate
covid_df.orderBy(col("death_rate").desc()).show(10)
Question:
What happens if total_cases is 0 or null?
Answer:
Spark will return null for the division, and that record will not show in sorted results unless handled explicitly with a filter or fill.
Best Practices with Real Datasets
- Always inspect the data types using
printSchema()
. - Use
dropna()
orfillna()
to handle missing values. - When chaining multiple operations, prefer readable formatting using line breaks.
- Always test with a small sample before applying transformations on full data.
Summary
Working with real datasets in PySpark teaches you how to load, inspect, clean, transform, and analyze data at scale. These practical examples not only build confidence but also lay the foundation for working with larger data pipelines using Spark.