Movie Ratings Project with Apache Spark

In this project, we will use Apache Spark and PySpark to analyze a real-world movie ratings dataset. The goal is to understand how to build a data pipeline: from loading raw data, performing transformations, analyzing results, and drawing insights.

Dataset Overview

We will use a sample version of the popular MovieLens dataset, which includes user-generated movie ratings. This dataset is widely used in recommendation systems research.

ratings.csv: Contains userId, movieId, rating, and timestamp
movies.csv: Contains movieId, title, and genres

Project Objectives

Read and merge the datasets
Clean and preprocess the data
Analyze average ratings per movie
Find top-rated movies by genre
Filter movies with a minimum number of reviews

Step 1: Load the Data

We’ll start by loading the two CSV files into Spark DataFrames using PySpark.

from pyspark.sql import SparkSession

spark = SparkSession.builder     .appName("Movie Ratings Project")     .getOrCreate()

# Load ratings and movies
ratings = spark.read.csv("ratings.csv", header=True, inferSchema=True)
movies = spark.read.csv("movies.csv", header=True, inferSchema=True)

ratings.show(5)
movies.show(5)

+-------+-------+------+-------------+
|userId |movieId|rating|timestamp    |
+-------+-------+------+-------------+
|1      |31     |2.5   |1260759144   |
|1      |1029   |3.0   |1260759179   |
|1      |1061   |3.0   |1260759182   |
|1      |1129   |2.0   |1260759185   |
|1      |1172   |4.0   |1260759205   |
+-------+-------+------+-------------+

Step 2: Merge Ratings with Movie Titles

To analyze movie names instead of just IDs, we join both DataFrames on movieId.

combined = ratings.join(movies, on="movieId", how="inner")
combined.select("title", "rating").show(5)

+--------------------+------+
|title               |rating|
+--------------------+------+
|Dangerous Minds (...|2.5   |
|Dumbo (1941)        |3.0   |
|Sleepers (1996)     |3.0   |
|Escape from L.A. ...|2.0   |
|Lost Highway (1997) |4.0   |
+--------------------+------+

Step 3: Average Rating per Movie

We calculate the average rating for each movie.

from pyspark.sql.functions import avg

average_ratings = combined.groupBy("title").agg(avg("rating").alias("avg_rating"))
average_ratings.orderBy("avg_rating", ascending=False).show(10)

+------------------------+----------+
|title                   |avg_rating|
+------------------------+----------+
|Shawshank Redemption...|4.8       |
|Godfather, The (1972)  |4.7       |
|Fight Club (1999)      |4.6       |
|Inception (2010)       |4.6       |
|Pulp Fiction (1994)    |4.5       |
+------------------------+----------+

Question:

Why do some average ratings seem artificially high?

Answer:

Some movies have only 1 or 2 reviews. With such low volume, a single 5-star rating can skew the average. We’ll fix this next.

Step 4: Filter by Minimum Number of Ratings

Let’s keep only those movies that have at least 100 ratings.

from pyspark.sql.functions import count

ratings_count = combined.groupBy("title").agg(
    count("rating").alias("rating_count"),
    avg("rating").alias("avg_rating")
)

filtered = ratings_count.filter("rating_count >= 100")
filtered.orderBy("avg_rating", ascending=False).show(10)

+------------------------+-------------+----------+
|title                   |rating_count |avg_rating|
+------------------------+-------------+----------+
|Shawshank Redemption...|1200         |4.7       |
|Forrest Gump (1994)     |1100         |4.6       |
|Matrix, The (1999)      |1050         |4.6       |
|Pulp Fiction (1994)     |1001         |4.5       |
+------------------------+-------------+----------+

Step 5: Most Popular Genre by Average Rating

We now find the best-rated genres. First, we need to explode the genres column.

from pyspark.sql.functions import explode, split

# Split genre string into array and explode
genres_split = movies.withColumn("genre", explode(split("genres", "\|")))
joined = ratings.join(genres_split, "movieId")

# Average rating by genre
genre_avg = joined.groupBy("genre").agg(avg("rating").alias("avg_rating"))
genre_avg.orderBy("avg_rating", ascending=False).show()

+------------+----------+
|genre       |avg_rating|
+------------+----------+
|Documentary |4.3       |
|Drama       |4.1       |
|Animation   |4.0       |
|Mystery     |3.9       |
|Comedy      |3.8       |
+------------+----------+

Conclusion

In this project, you learned how to build a real-world Spark data pipeline to process and analyze movie ratings data. You:

Read CSV data into Spark DataFrames
Performed joins and aggregations
Filtered results based on business logic (minimum ratings)
Exploded and processed multi-value columns (genres)

By doing this project, you now understand how Spark works on real data and how scalable processing helps in making informed decisions from large datasets.

Comments

Loading comments...

Movie Ratings Project with Apache Spark