Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Practical Examples with Real Datasets



Practical Examples with Real Datasets

In this section, we'll explore how to use PySpark with real datasets. These hands-on examples will help you apply everything you’ve learned so far about DataFrames and transformations.

Setting Up PySpark

Before working with datasets, we must initialize PySpark in our Python environment.


from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder     .appName("PracticalExamples")     .getOrCreate()
    

Example 1: Analyzing a Flight Delay Dataset

Let’s work with a real-world dataset: flight delays. This data contains information about flight dates, carriers, departure/arrival times, and delays.

Loading the dataset

We'll load a CSV file into a Spark DataFrame. For demonstration, we use a public sample dataset.


# Load CSV from a public URL
df = spark.read.csv("https://raw.githubusercontent.com/databricks/spark-csv/master/src/test/resources/cars.csv", header=True, inferSchema=True)

# Show first few rows
df.show(5)
    
+-------------+---+----+
|    Make     |MPG|Cyl |
+-------------+---+----+
|Chevrolet    | 18|  8 |
|Buick        | 20|  6 |
|Toyota       | 24|  4 |
|Ford         | 22|  6 |
|Volkswagen   | 30|  4 |
+-------------+---+----+
    

Basic analysis


# Count total records
print("Total rows:", df.count())

# View the schema
df.printSchema()

# Group by cylinder and find average MPG
df.groupBy("Cyl").avg("MPG").show()
    
Total rows: 5
root
 |-- Make: string (nullable = true)
 |-- MPG: integer (nullable = true)
 |-- Cyl: integer (nullable = true)

+----+--------+
|Cyl |avg(MPG)|
+----+--------+
|  4 |    27.0|
|  6 |    21.0|
|  8 |    18.0|
+----+--------+
    

Question:

Why do we use inferSchema=True while reading the CSV?

Answer:

By default, Spark reads all columns as strings. Using inferSchema=True tells Spark to detect the actual data types like integer or float based on the values.

Example 2: COVID-19 Dataset Analysis

Let’s now analyze a simplified version of COVID-19 data. This dataset includes country, total cases, total deaths, and population.

Loading the dataset


covid_df = spark.read.csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/latest/owid-covid-latest.csv", header=True, inferSchema=True)

# Select relevant columns
covid_df = covid_df.select("location", "total_cases", "total_deaths", "population")

# Show some data
covid_df.show(5)
    
+-----------+-----------+------------+-----------+
|  location |total_cases|total_deaths|population |
+-----------+-----------+------------+-----------+
|Afghanistan|   225105.0|      7820.0|38928346.0 |
|Albania    |   334457.0|      3598.0|2877797.0  |
|Algeria    |   271441.0|      6881.0|43851044.0 |
|Andorra    |    47523.0|        165.0|77265.0    |
|Angola     |   105095.0|      1934.0|32866268.0 |
+-----------+-----------+------------+-----------+
    

Data transformation: Death rate

Let’s create a new column called death_rate = total_deaths / total_cases and sort countries by this rate.


from pyspark.sql.functions import col

# Add death_rate column
covid_df = covid_df.withColumn("death_rate", col("total_deaths") / col("total_cases"))

# Sort by highest death rate
covid_df.orderBy(col("death_rate").desc()).show(10)
    

Question:

What happens if total_cases is 0 or null?

Answer:

Spark will return null for the division, and that record will not show in sorted results unless handled explicitly with a filter or fill.

Best Practices with Real Datasets

Summary

Working with real datasets in PySpark teaches you how to load, inspect, clean, transform, and analyze data at scale. These practical examples not only build confidence but also lay the foundation for working with larger data pipelines using Spark.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M