Apache Spark CourseApache Spark Course1
Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Movie Ratings Project with Apache Spark

Movie Ratings Project with Apache Spark

In this project, we will use Apache Spark and PySpark to analyze a real-world movie ratings dataset. The goal is to understand how to build a data pipeline: from loading raw data, performing transformations, analyzing results, and drawing insights.

Dataset Overview

We will use a sample version of the popular MovieLens dataset, which includes user-generated movie ratings. This dataset is widely used in recommendation systems research.

  • ratings.csv: Contains userId, movieId, rating, and timestamp
  • movies.csv: Contains movieId, title, and genres

Project Objectives

  • Read and merge the datasets
  • Clean and preprocess the data
  • Analyze average ratings per movie
  • Find top-rated movies by genre
  • Filter movies with a minimum number of reviews

Step 1: Load the Data

We’ll start by loading the two CSV files into Spark DataFrames using PySpark.

from pyspark.sql import SparkSession

spark = SparkSession.builder     .appName("Movie Ratings Project")     .getOrCreate()

# Load ratings and movies
ratings = spark.read.csv("ratings.csv", header=True, inferSchema=True)
movies = spark.read.csv("movies.csv", header=True, inferSchema=True)

ratings.show(5)
movies.show(5)
+-------+-------+------+-------------+
|userId |movieId|rating|timestamp    |
+-------+-------+------+-------------+
|1      |31     |2.5   |1260759144   |
|1      |1029   |3.0   |1260759179   |
|1      |1061   |3.0   |1260759182   |
|1      |1129   |2.0   |1260759185   |
|1      |1172   |4.0   |1260759205   |
+-------+-------+------+-------------+

Step 2: Merge Ratings with Movie Titles

To analyze movie names instead of just IDs, we join both DataFrames on movieId.

combined = ratings.join(movies, on="movieId", how="inner")
combined.select("title", "rating").show(5)
+--------------------+------+
|title               |rating|
+--------------------+------+
|Dangerous Minds (...|2.5   |
|Dumbo (1941)        |3.0   |
|Sleepers (1996)     |3.0   |
|Escape from L.A. ...|2.0   |
|Lost Highway (1997) |4.0   |
+--------------------+------+

Step 3: Average Rating per Movie

We calculate the average rating for each movie.

from pyspark.sql.functions import avg

average_ratings = combined.groupBy("title").agg(avg("rating").alias("avg_rating"))
average_ratings.orderBy("avg_rating", ascending=False).show(10)
+------------------------+----------+
|title                   |avg_rating|
+------------------------+----------+
|Shawshank Redemption...|4.8       |
|Godfather, The (1972)  |4.7       |
|Fight Club (1999)      |4.6       |
|Inception (2010)       |4.6       |
|Pulp Fiction (1994)    |4.5       |
+------------------------+----------+

Question:

Why do some average ratings seem artificially high?

Answer:

Some movies have only 1 or 2 reviews. With such low volume, a single 5-star rating can skew the average. We’ll fix this next.

Step 4: Filter by Minimum Number of Ratings

Let’s keep only those movies that have at least 100 ratings.

from pyspark.sql.functions import count

ratings_count = combined.groupBy("title").agg(
    count("rating").alias("rating_count"),
    avg("rating").alias("avg_rating")
)

filtered = ratings_count.filter("rating_count >= 100")
filtered.orderBy("avg_rating", ascending=False).show(10)
+------------------------+-------------+----------+
|title                   |rating_count |avg_rating|
+------------------------+-------------+----------+
|Shawshank Redemption...|1200         |4.7       |
|Forrest Gump (1994)     |1100         |4.6       |
|Matrix, The (1999)      |1050         |4.6       |
|Pulp Fiction (1994)     |1001         |4.5       |
+------------------------+-------------+----------+

Step 5: Most Popular Genre by Average Rating

We now find the best-rated genres. First, we need to explode the genres column.

from pyspark.sql.functions import explode, split

# Split genre string into array and explode
genres_split = movies.withColumn("genre", explode(split("genres", "\|")))
joined = ratings.join(genres_split, "movieId")

# Average rating by genre
genre_avg = joined.groupBy("genre").agg(avg("rating").alias("avg_rating"))
genre_avg.orderBy("avg_rating", ascending=False).show()
+------------+----------+
|genre       |avg_rating|
+------------+----------+
|Documentary |4.3       |
|Drama       |4.1       |
|Animation   |4.0       |
|Mystery     |3.9       |
|Comedy      |3.8       |
+------------+----------+

Conclusion

In this project, you learned how to build a real-world Spark data pipeline to process and analyze movie ratings data. You:

  • Read CSV data into Spark DataFrames
  • Performed joins and aggregations
  • Filtered results based on business logic (minimum ratings)
  • Exploded and processed multi-value columns (genres)

By doing this project, you now understand how Spark works on real data and how scalable processing helps in making informed decisions from large datasets.