Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Movie Ratings Project with Apache Spark



Movie Ratings Project with Apache Spark

In this project, we will use Apache Spark and PySpark to analyze a real-world movie ratings dataset. The goal is to understand how to build a data pipeline: from loading raw data, performing transformations, analyzing results, and drawing insights.

Dataset Overview

We will use a sample version of the popular MovieLens dataset, which includes user-generated movie ratings. This dataset is widely used in recommendation systems research.

Project Objectives

Step 1: Load the Data

We’ll start by loading the two CSV files into Spark DataFrames using PySpark.


from pyspark.sql import SparkSession

spark = SparkSession.builder     .appName("Movie Ratings Project")     .getOrCreate()

# Load ratings and movies
ratings = spark.read.csv("ratings.csv", header=True, inferSchema=True)
movies = spark.read.csv("movies.csv", header=True, inferSchema=True)

ratings.show(5)
movies.show(5)
    
+-------+-------+------+-------------+
|userId |movieId|rating|timestamp    |
+-------+-------+------+-------------+
|1      |31     |2.5   |1260759144   |
|1      |1029   |3.0   |1260759179   |
|1      |1061   |3.0   |1260759182   |
|1      |1129   |2.0   |1260759185   |
|1      |1172   |4.0   |1260759205   |
+-------+-------+------+-------------+
    

Step 2: Merge Ratings with Movie Titles

To analyze movie names instead of just IDs, we join both DataFrames on movieId.


combined = ratings.join(movies, on="movieId", how="inner")
combined.select("title", "rating").show(5)
    
+--------------------+------+
|title               |rating|
+--------------------+------+
|Dangerous Minds (...|2.5   |
|Dumbo (1941)        |3.0   |
|Sleepers (1996)     |3.0   |
|Escape from L.A. ...|2.0   |
|Lost Highway (1997) |4.0   |
+--------------------+------+
    

Step 3: Average Rating per Movie

We calculate the average rating for each movie.


from pyspark.sql.functions import avg

average_ratings = combined.groupBy("title").agg(avg("rating").alias("avg_rating"))
average_ratings.orderBy("avg_rating", ascending=False).show(10)
    
+------------------------+----------+
|title                   |avg_rating|
+------------------------+----------+
|Shawshank Redemption...|4.8       |
|Godfather, The (1972)  |4.7       |
|Fight Club (1999)      |4.6       |
|Inception (2010)       |4.6       |
|Pulp Fiction (1994)    |4.5       |
+------------------------+----------+
    

Question:

Why do some average ratings seem artificially high?

Answer:

Some movies have only 1 or 2 reviews. With such low volume, a single 5-star rating can skew the average. We’ll fix this next.

Step 4: Filter by Minimum Number of Ratings

Let’s keep only those movies that have at least 100 ratings.


from pyspark.sql.functions import count

ratings_count = combined.groupBy("title").agg(
    count("rating").alias("rating_count"),
    avg("rating").alias("avg_rating")
)

filtered = ratings_count.filter("rating_count >= 100")
filtered.orderBy("avg_rating", ascending=False).show(10)
    
+------------------------+-------------+----------+
|title                   |rating_count |avg_rating|
+------------------------+-------------+----------+
|Shawshank Redemption...|1200         |4.7       |
|Forrest Gump (1994)     |1100         |4.6       |
|Matrix, The (1999)      |1050         |4.6       |
|Pulp Fiction (1994)     |1001         |4.5       |
+------------------------+-------------+----------+
    

Step 5: Most Popular Genre by Average Rating

We now find the best-rated genres. First, we need to explode the genres column.


from pyspark.sql.functions import explode, split

# Split genre string into array and explode
genres_split = movies.withColumn("genre", explode(split("genres", "\|")))
joined = ratings.join(genres_split, "movieId")

# Average rating by genre
genre_avg = joined.groupBy("genre").agg(avg("rating").alias("avg_rating"))
genre_avg.orderBy("avg_rating", ascending=False).show()
    
+------------+----------+
|genre       |avg_rating|
+------------+----------+
|Documentary |4.3       |
|Drama       |4.1       |
|Animation   |4.0       |
|Mystery     |3.9       |
|Comedy      |3.8       |
+------------+----------+
    

Conclusion

In this project, you learned how to build a real-world Spark data pipeline to process and analyze movie ratings data. You:

By doing this project, you now understand how Spark works on real data and how scalable processing helps in making informed decisions from large datasets.


Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M