Movie Ratings Project with Apache Spark
In this project, we will use Apache Spark and PySpark to analyze a real-world movie ratings dataset. The goal is to understand how to build a data pipeline: from loading raw data, performing transformations, analyzing results, and drawing insights.
Dataset Overview
We will use a sample version of the popular MovieLens dataset, which includes user-generated movie ratings. This dataset is widely used in recommendation systems research.
ratings.csv
: Contains userId, movieId, rating, and timestampmovies.csv
: Contains movieId, title, and genres
Project Objectives
- Read and merge the datasets
- Clean and preprocess the data
- Analyze average ratings per movie
- Find top-rated movies by genre
- Filter movies with a minimum number of reviews
Step 1: Load the Data
We’ll start by loading the two CSV files into Spark DataFrames using PySpark.
from pyspark.sql import SparkSession
spark = SparkSession.builder .appName("Movie Ratings Project") .getOrCreate()
# Load ratings and movies
ratings = spark.read.csv("ratings.csv", header=True, inferSchema=True)
movies = spark.read.csv("movies.csv", header=True, inferSchema=True)
ratings.show(5)
movies.show(5)
+-------+-------+------+-------------+ |userId |movieId|rating|timestamp | +-------+-------+------+-------------+ |1 |31 |2.5 |1260759144 | |1 |1029 |3.0 |1260759179 | |1 |1061 |3.0 |1260759182 | |1 |1129 |2.0 |1260759185 | |1 |1172 |4.0 |1260759205 | +-------+-------+------+-------------+
Step 2: Merge Ratings with Movie Titles
To analyze movie names instead of just IDs, we join both DataFrames on movieId
.
combined = ratings.join(movies, on="movieId", how="inner")
combined.select("title", "rating").show(5)
+--------------------+------+ |title |rating| +--------------------+------+ |Dangerous Minds (...|2.5 | |Dumbo (1941) |3.0 | |Sleepers (1996) |3.0 | |Escape from L.A. ...|2.0 | |Lost Highway (1997) |4.0 | +--------------------+------+
Step 3: Average Rating per Movie
We calculate the average rating for each movie.
from pyspark.sql.functions import avg
average_ratings = combined.groupBy("title").agg(avg("rating").alias("avg_rating"))
average_ratings.orderBy("avg_rating", ascending=False).show(10)
+------------------------+----------+ |title |avg_rating| +------------------------+----------+ |Shawshank Redemption...|4.8 | |Godfather, The (1972) |4.7 | |Fight Club (1999) |4.6 | |Inception (2010) |4.6 | |Pulp Fiction (1994) |4.5 | +------------------------+----------+
Question:
Why do some average ratings seem artificially high?
Answer:
Some movies have only 1 or 2 reviews. With such low volume, a single 5-star rating can skew the average. We’ll fix this next.
Step 4: Filter by Minimum Number of Ratings
Let’s keep only those movies that have at least 100 ratings.
from pyspark.sql.functions import count
ratings_count = combined.groupBy("title").agg(
count("rating").alias("rating_count"),
avg("rating").alias("avg_rating")
)
filtered = ratings_count.filter("rating_count >= 100")
filtered.orderBy("avg_rating", ascending=False).show(10)
+------------------------+-------------+----------+ |title |rating_count |avg_rating| +------------------------+-------------+----------+ |Shawshank Redemption...|1200 |4.7 | |Forrest Gump (1994) |1100 |4.6 | |Matrix, The (1999) |1050 |4.6 | |Pulp Fiction (1994) |1001 |4.5 | +------------------------+-------------+----------+
Step 5: Most Popular Genre by Average Rating
We now find the best-rated genres. First, we need to explode the genres
column.
from pyspark.sql.functions import explode, split
# Split genre string into array and explode
genres_split = movies.withColumn("genre", explode(split("genres", "\|")))
joined = ratings.join(genres_split, "movieId")
# Average rating by genre
genre_avg = joined.groupBy("genre").agg(avg("rating").alias("avg_rating"))
genre_avg.orderBy("avg_rating", ascending=False).show()
+------------+----------+ |genre |avg_rating| +------------+----------+ |Documentary |4.3 | |Drama |4.1 | |Animation |4.0 | |Mystery |3.9 | |Comedy |3.8 | +------------+----------+
Conclusion
In this project, you learned how to build a real-world Spark data pipeline to process and analyze movie ratings data. You:
- Read CSV data into Spark DataFrames
- Performed joins and aggregations
- Filtered results based on business logic (minimum ratings)
- Exploded and processed multi-value columns (genres)
By doing this project, you now understand how Spark works on real data and how scalable processing helps in making informed decisions from large datasets.