My SCAMP Final Project — A Simple Movie Recommender System with Python

8 min readMay 20, 2021

SCAMP — She Code Africa Mentoring Program

Three months ago, I enrolled in the Data Science Track of the SCAMP. As a data enthusiast, I was looking forward to the network and mentorship. I was particularly excited about the cool projects I could work on as a data science mentee. For my final project, I decided on a simple movie recommender system as I was fascinated with the content recommendation systems on YouTube and Netflix.

Contents

Introduction
Dataset
Methodology
Results
Conclusion
References

Introduction

A popular application of data science is a recommender system. Recommender systems are used to predict the “rating” or “preference” that a user would give to an item. They are applied by tech companies in different forms. Amazon uses it to suggest products to customers and YouTube uses it to decide which video to play next on autoplay.

A recommender system is a filtration program with the primary objective of predicting a user’s preference for a domain-specific object or item. The domain-specific item in this project is a movie, so the main goal of this recommendation system is to filter and predict only those movies that a user would prefer based on some information about the user.

There are two major types of filtration; Content-based Filtering and the Collaborative Filtering.

Content-based filtering is based on provided about the items. The algorithm recommends products that are similar to the ones that a user has liked in the past. This similarity is computed from the data we have about the items as well as the user’s past preferences. The disadvantage of this type is that the user is not exposed to a wide of range of products.
Collaborative Filtering is a filtration strategy that is based on the combination of the user’s behavior and comparing and contrasting that with other users’ behavior in the database. The history of all users plays an important role in this algorithm.

The main difference between content-based filtering and collaborative filtering that in the latter, the interaction of all users with the items influences the recommendation algorithm while for content-based filtering only the concerned user’s data is taken into account.

Dataset

The dataset in this project is the MovieLens Latest Datasets. The dataset consists of movies released on or before August 2018. This dataset captures 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users.

This dataset consists of the following files:

movies.csv: This file contains information on ~9,000 movies featured in the Full MovieLens dataset. Each line of this file after the header row represents one movie, and has the following format: movieId, title, genres. Movie titles include the year of release in parentheses. Genres are a pipe-separated list.
tags.csv: Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format: userId, movieId, tag, timestamp. Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase.
links.csv: This file contains the TMDB and IMDB IDs of all the movies featured in the dataset. Each line of this file after the header row represents one movie, and has the following format: movieId, imdbId, tmdbId.
ratings.csv: Each line of this file after the header row represents one rating of one movie by one user, and has the following format: userId, movieId, rating, timestamp. Ratings are made on a 5-star scale, with half-star increments (0.5 stars — 5.0 stars).
READ_ME.txt: This file contains the data summary, citations, usage licenses and data descriptions for each csv file.

The Full MovieLens Dataset comprises of 27 million ratings and 1,100,000 tag applications, applied to 58,000 movies by 280,000 users in this dataset. It also contains tag genome data with 14 million relevance scores across 1,100 tags. It can be accessed from the official GroupLens website and was last updated September 2018.

Methodology

In this project, I utilized Python libraries, Pandas, Numpy, Scikit Learn, SciPy. Google Colab which allows me to write and execute Python in my browser was used for code execution. I used the ratings.csv and movies.csv for this project. The first step was data importation into my Colab notebook.

#importing required libraries
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns#importing datasets
movies = pd.read_csv("movies.csv")ratings = pd.read_csv("ratings.csv")

The .head() command is then used to print the first 5 rows of each of the datasets.

movies.head()

ratings.head()

From the above picture, it can be seen that user with userId 1 has watched movies with movieId 1, 3, 6, 47 and 50 and rated them 4.0, 4.0, 4.0, 5.0 and 5.0 respectively. But the user did not rate some movies like movieId 2, 4, 7 etc. To make this dataframe easier to interpret, I applied a pivot method on the ratings dataset, where columns represents the unique userId and the row represents each unique movieId.

dataset = ratings.pivot(index='movieId',columns='userId',values='rating')dataset.head()

I impute the NaN values with zero values as they could create problems when feeding the machine learning algorithm. It is also easier to interpret for instance userId 1 has rated movieId 1 and 3, 4.0 respectively but has not rated movieId 2,4,5 at all.

dataset.fillna(0,inplace=True)dataset.head()

Ratings are thinly distributed in the real world. To improve credibility of the system, I decided to reduce noise by filtering the dataset for movies that have been rated by a minimum of 10 users and users that have rated a minimum of 50 movies. Hence data points collected would be from very popular movies and highly engaged users. Aggregating the number of users who voted and the number of movies that were voted.

no_user_voted = ratings.groupby('movieId')['rating'].agg('count')
no_movies_voted = ratings.groupby('userId')['rating'].agg('count')

Filtering the dataset based on the threshold values; the number of votes by each user with our threshold of 50 and the number of votes by each user with our threshold of 50.

dataset = dataset.loc[no_user_voted[no_user_voted > 10].index,:]
dataset= dataset.loc[:,no_movies_voted[no_movies_voted > 50].index]

The final dataset has dimensions of 2121 rows by 378 columns with most of the values being sparse that is it contains mostly zero values. Sparsity is a problem for recommender system as it is very CPU- and memory-inefficient. I reduced the sparsity by using the csr_matrix function from the Scipy library.

from scipy.sparse import csr_matrixcsr_data = csr_matrix(dataset.values)dataset.reset_index(inplace=True)

I used the KNN algorithm to compute similarity with cosine distance metric which is very fast and performs better than other metrics. The K-Nearest Neighbors is a widely used machine learning algorithm for both supervised and unsupervised machine learning problems. It is highly preferred because of easier interpretation of output and low calculation time. The unsupervised nearest neighbors algorithm which implements unsupervised nearest neighbors learning is used in this project. It acts as a uniform interface to three different nearest neighbors algorithms: BallTree, KDTree, and a brute-force algorithm . KNN works by calculating the distance of 1 test observation from all the observation of the training dataset and then finding K nearest neighbors of it. This happens for each and every test observation and that is how it finds similarities in the data.

For calculating distances KNN uses a distance metric from the list of available metrics. The algorithm works best on a particular dataset if the most appropriate distance metric is chosen accordingly. The cosine distance metric is used mainly to calculate similarity between two vectors. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in the same direction. When used with KNN this distance gives a new perspective to a business problem and discovers some hidden information in the data which might not been seen using the Euclidean or Manhattan distances.

Formula for cosine distance is:

Using the formula we get a value which tells us about the similarity between the two vectors and 1-cosθ will give us their cosine distance.

Using this distance we get values between 0 and 1, where 0 means the vectors are 100% similar to each other and 1 means they are not similar at all.

from sklearn.neighbors import NearestNeighborsknn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)knn.fit(csr_data)

I next implement a python function to retrieve the top 10 similar movies to the input movie. The function first checks if the movie name supplied as input is available in the database. If it is the recommendation system is used to find similar movies. The function then sorts them based on their similarity distance and outputs only the top 10 movies with their distances from the input movie.

def get_movie_recommendation(movie_name):
    n_movies_to_reccomend = 10
    movie_list = movies[movies['title'].str.contains(movie_name)]  
    if len(movie_list):        
        movie_idx= movie_list.iloc[0]['movieId']
        movie_idx = dataset[dataset['movieId'] == movie_idx].index[0]
        distances , indices = knn.kneighbors(csr_data[movie_idx],n_neighbors=n_movies_to_reccomend+1)    
        rec_movie_indices = sorted(list(zip(indices.squeeze().tolist(),distances.squeeze().tolist())),key=lambda x: x[1])[:0:-1]
        recommend_frame = []
        for val in rec_movie_indices:
            movie_idx = dataset.iloc[val[0]]['movieId']
            idx = movies[movies['movieId'] == movie_idx].index
            recommend_frame.append({'Title':movies.iloc[idx]['title'].values[0],'Distance':val[1]})
        df = pd.DataFrame(recommend_frame,index=range(1,n_movies_to_reccomend+1))
        return df
    else:
        return "No movies found. Please check your input"

Results

With my model ready, I recommend some movies. As a superhero fan, I input my favorite superhero movie, “Iron Man”. The movies returned are superhero or animation movies which are ideal for the superhero genre of movies.

get_movie_recommendation('Iron Man')

Conclusion

In this article, I studied what a simple recommender system and created one using the KNN algorithm in Python. This model works well enough as it is a movie recommendation system based on users behavior. This simple recommender is a basic system that recommends the top items based on a certain metric or score in this case cosine distance.

The Github link for this project can be found here.

The deployed model link can be found here.

You can connect with me on LinkedIn and Twitter in case of any questions or interesting ideas to share with me. Happy Reading!!!.

References

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>
So_ham. “Build A Movie Recommendation System on Your Own.” Analytics Vidhya. 24 Nov. 2020. Web. 20 May 2021.
“Most Popular Distance Metrics Used in KNN and When to Use Them.” KDnuggets. Web. 20 May 2021.
“(Tutorial) Recommender Systems in Python.” DataCamp Community. Web. 20 May 2021.

My SCAMP Final Project — A Simple Movie Recommender System with Python

Written by Victoria Akintomide