#CloudGuruChallenge — Machine Learning on AWS
This project is a part of a Cloud Guru challenge by Kesha Williams. You can find the information here. https://acloudguru.com/blog/engineering/cloudguruchallenge-machine-learning-on-aws#
I have used 2 IMDB dataset for recommendations. 1) title_basics_data.tsv 2) title_ratings_data.tsv. You can find dataset on following link : https://datasets.imdbws.com/?opt_id=oeu1603156399603r0.4083130683432188
There are many ways we can create a recommendation system. I have created a cluster based on a movie. It means you give a movie name and based on a movie name, It will recommend up to 15 other movies. I have used K-means cluster to cluster movies based on their genre. The code is created on AWS sageMaker jupyter notebook.
There are several steps that I have followed from data cleaning to analyzing data. We will look at each step one by one.
We are using several libraries for different parts. Boto3 is the name of the Python SDK for AWS. It allows you to directly create, update, and delete AWS resources from your Python scripts.
Creating S3 bucket and merging datasets
Here we are creating a S3 bucket. Knowing the region is very important in AWS. If everything is good, we will get a message that the S3 bucket is created successfully.
We will read both datasets from the S3 bucket. both datasets have ‘tconst’ column and we will merge both on that column. Now, we save this new dataframe so we do not need to merge and call the S3 bucket. We will add this merged data into the S3 bucket of our output path.
Here is our merged dataframe. There are 9 features.
- averageRating : average rating of each movie, tv-show, etc
- numVotes : how many people have voted for that movie, tv-show, etc
- titleType : format of video type such as movies, tv-episode, etc
- primaryTitle : title of movie, tv-show, etc
- isAdult : whether a video is adult or not
- startYear : year video produces
- genres : a genre of movie, tv-show, etc
We can see that there are 700K rows and 9 columns.
Data Analyzing and Visualization
The main thing about any video format is their review. We always rely on reviews to watch anything. We will first analyze how the average rating is distributed across data. We will use seaborn for visualizing data.
We can see that the highest average rating is between 7 to 8.
Next, we will analyze how many different titles produces each year. We will visualize last 10 years of data.
We can see that around 30K different titles produce each year except 2020 because of the covid -19 pandemic.
Now, we will find an average rating of each year to understand which year gives more amazing different titles with good ratings.
We can see that an average rating for the last 10 years is around 7.
We have seen there are many different formats of videos. Now we will visualize how many records we have for each format.
We can see that there are 10 different formats and tv episodes have the highest records and at the end video games with the least records.
Next, we will see how many average ratings on each video format. We will use groupby on titleType and sum all average ratings, then we will merge count and ratings and create a new column which gives ratings / count.
We can see that average ratings of all format are around 6.8 and tv episode has the highest average rating with 7.44. Movies has only 6.14 ratings. We can assume that there are many records with fewer ratings.
Next, we want to see a more accurate average rating distribution. If any record has more votes, then we can believe more on that record’s ratings. So, we will filter data with the number of votes more than 50K, and an average rating is greater than 6.25.
We can see that between 6.75 to 7.75, there is a higher average rating with filtered conditions.
Now we will find which genre is the most common across our dataset. We will plot the word cloud and histogram.
Word cloud gives more frequent genre text larger. We can see that Drama and Comedy are the most common genre.
We will plot a bar graph for a genre and we can see that Comedy and Drama are the most common with around 25K records.
To create recommendations, I have divided the code into 3 parts. 1) weighted ratings 2) clustering 3) final function. First, we will use a weighted rating formula to give each data a rating based on average rating and number of votes. The formula to calculate weighted rating is the following :
- weighted rating (WR)=(v/(v+m) * R) + (m/(m+v) * C)
R = average for the video format(mean)
v = number of votes for the video format
m = minimum votes required to be counted
C = the mean vote across the whole dataset
Then we will create a new column for each genre and based on the value we will give 1 or 0 to that column. This will create a total of 38 columns. we will take all genres column and use a scaler to standardize data. It transforms the data in such a manner that it has mean as 0 and standard deviation as 1. In short, it standardizes the data. Standardization is useful for data that has negative values. It arranges the data in a standard normal distribution.
We will use the k means cluster and create an elbow graph to find the optimal value of k. We will loop over the first 50 values and visualize the graph.
We will choose k value where it started linearly decreasing. K= 28 is the better choice as from K = 28, elbow curve started decreasing.
We will now use kmeans cluster and predict our cluster result. This will help us in recommending based on genres.
This is our final function. It takes 5 arguments. We need to provide at least one argument. title take a movie/show name, type is format of a video, rating is the minimum rating required,a year is particular year search, and a genre takes genre name and based on provided argument, it will recommend up to 15 records.
get_recommendations(title = ‘thor’,types = ‘movie’,ratings = 7, genres = ‘drama’)
get_recommendations(types = ‘movie’,ratings = 7, genres = ‘drama’)
get_recommendations(title = ‘thor’,types = ‘movie’,ratings = 7)
That’s it. A Movie recommendation system is completed. During this project, I have learned about AWS sageMaker, how to work with S3 bucket, how to understand your data, and based on that creating a recommendation system. I will improve this code later, and add some more stuff to make a more enhanced recommendation system. You can find my code in my GitHub repository : https://github.com/dhruvilshah35/Movie-Recommendation-System.
I have enjoyed doing this challenge and learning new things. Please like and give your suggestions to improve this. Thank you so much.