You can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.
In this section, we'll explore how to use PySpark with real datasets. These hands-on examples will help you apply everything you’ve learned so far about DataFrames and transformations.
Before working with datasets, we must initialize PySpark in our Python environment.
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder .appName("PracticalExamples") .getOrCreate()
Let’s work with a real-world dataset: flight delays. This data contains information about flight dates, carriers, departure/arrival times, and delays.
We'll load a CSV file into a Spark DataFrame. For demonstration, we use a public sample dataset.
# Load CSV from a public URL
df = spark.read.csv("https://raw.githubusercontent.com/databricks/spark-csv/master/src/test/resources/cars.csv", header=True, inferSchema=True)
# Show first few rows
df.show(5)
+-------------+---+----+ | Make |MPG|Cyl | +-------------+---+----+ |Chevrolet | 18| 8 | |Buick | 20| 6 | |Toyota | 24| 4 | |Ford | 22| 6 | |Volkswagen | 30| 4 | +-------------+---+----+
# Count total records
print("Total rows:", df.count())
# View the schema
df.printSchema()
# Group by cylinder and find average MPG
df.groupBy("Cyl").avg("MPG").show()
Total rows: 5 root |-- Make: string (nullable = true) |-- MPG: integer (nullable = true) |-- Cyl: integer (nullable = true) +----+--------+ |Cyl |avg(MPG)| +----+--------+ | 4 | 27.0| | 6 | 21.0| | 8 | 18.0| +----+--------+
Why do we use inferSchema=True
while reading the CSV?
By default, Spark reads all columns as strings. Using inferSchema=True
tells Spark to detect the actual data types like integer or float based on the values.
Let’s now analyze a simplified version of COVID-19 data. This dataset includes country, total cases, total deaths, and population.
covid_df = spark.read.csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/latest/owid-covid-latest.csv", header=True, inferSchema=True)
# Select relevant columns
covid_df = covid_df.select("location", "total_cases", "total_deaths", "population")
# Show some data
covid_df.show(5)
+-----------+-----------+------------+-----------+ | location |total_cases|total_deaths|population | +-----------+-----------+------------+-----------+ |Afghanistan| 225105.0| 7820.0|38928346.0 | |Albania | 334457.0| 3598.0|2877797.0 | |Algeria | 271441.0| 6881.0|43851044.0 | |Andorra | 47523.0| 165.0|77265.0 | |Angola | 105095.0| 1934.0|32866268.0 | +-----------+-----------+------------+-----------+
Let’s create a new column called death_rate = total_deaths / total_cases and sort countries by this rate.
from pyspark.sql.functions import col
# Add death_rate column
covid_df = covid_df.withColumn("death_rate", col("total_deaths") / col("total_cases"))
# Sort by highest death rate
covid_df.orderBy(col("death_rate").desc()).show(10)
What happens if total_cases is 0 or null?
Spark will return null for the division, and that record will not show in sorted results unless handled explicitly with a filter or fill.
printSchema()
.dropna()
or fillna()
to handle missing values.Working with real datasets in PySpark teaches you how to load, inspect, clean, transform, and analyze data at scale. These practical examples not only build confidence but also lay the foundation for working with larger data pipelines using Spark.
You can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.