-
Notifications
You must be signed in to change notification settings - Fork 0
/
Recommender_Submission.Rmd
1511 lines (1094 loc) · 64.9 KB
/
Recommender_Submission.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Movielens Recommender"
author: "Kostas Voulgaropoulos"
date: "September 2019"
output: pdf_document
---
---
title: "R Notebook"
output:
html_document:
df_print: paged
---
# Problem Setting and Approach
Movielens, a platform for non-commercial personalized movie recommendations has made available several movie datasets of different sizes. Throughout this kernel, we will use the 10 million rows dataset to create a movie recommender.
Predictors include userId, movieId, movie title, year released, timestamp of rating and genre. Our target variable is continuous, thus we will measure success of our algorithm by [Residual Mean Square Error](https://en.wikipedia.org/wiki/Mean_squared_error).
To successfully plan a strategy , we will visually explore our data and clean/engineer it whenever possible.
After experimenting with various approaches, our final prediction attempt will focus around matrix factorization.
## Model Basis
We will approach the prediction of rating in the following manner:
* The **mean rating** of all movies is the easiest, yet justified, prediction we could give. By taking a look at the distribution of ratings, it becomes obvious that the mean has the highest probability to happen. If we were asked to pick a rating for a movie we don't know from a user we also don't know, the safest guess would be the mean. But since we have features available, we will try to use them as follows:
* On top of the mean rating, we can add the mean of each **specific movie** we want to predict. For example, Lord of the Rings is widely considered a good movie. Its mean rating will be way higher than the overall mean (which is shaped by both renowned and not so-acclaimed movies). As a result, a good movie's average will raise our prediction, while a bad movie's will decrease respectively. A common word in statistics used to describe such effects is bias. Thus, we'll call it movie bias and calculate is as the difference between mean rating of all movies and the mean rating of the specific movie we want to predict a rating for. For example, if we suppose the mean rating of all movies is 3.5, but Lord of the Rings has a mean rating of 4.5, then the movie bias of Lord of the Rings is 1 ( = 4.5 - 3.5).
* Accordingly, the **same applies to users**. Some users are demanding and tend to rate lower than others. We also need to take this under consideration. Continuing the example above, a user with mean rating of 2 will get a -1.5 ( = 2 - 3.5) user bias effect in our prediction.
* Finally, we will try to encapsulate a more abstract form of **genres**. By abstract we mean that we will not explore the presence of genres per se, but rather the presence of *concepts*. These concepts can be whichever factor describes a preference. Each user has a tendency towards a specific factor and each movie is somehow associated (more or less, positively or negatively) with this factor.
## Considering the number of ratings
The fact that not all movies bear the same number of ratings creates a distortion. For movies (the same applies to users) with many ratings, it's safer for us to base our prediction on their calculated bias. But it becomes overt that movies with low ratings may lead us to assign biases they don't actually deserve. This discrepancy can be rectified by adjusting the way we compute each movie's average rating. Movie bias is calculated by computing the difference between each rating and the mean rating of all movies, summing up all these differences and diving by the number of movie ratings to make this measure average. In order to control for the number of ratings (that is, take into more consideration a movie's ratings if it has many ratings and vice versa), we will add a penalty term when dividing with the sum of all differences from the mean. This way, movies with many ratings will get almost no influenced at all by the penalty, while movies with few ratings might get their average bias halfed or more.
To make it more straightforward, we can consider a very popular movie with a sum of differences from the genereal mean of 2000, number of ratings of 2500 and a penalty term of 4. Before penalization, the popular movie would yield a movie bias of +0.8 (= 20,000/25,000). After penalization, it would return 0.799 (= 20,000 / 25,004), almost the same as before.
On the other hand, a not-so-popular movie with just two ratings and a sum of (two) differences at +3, would return, before penalization, a movie bias of +1.5 ( = 3 / 2) . After, penalization it would become significantly lower at .5 ( = 3 / 6 ).
Thus, it becomes obvious that penalization minimizes biases based on a few ratings, while it leaves almost untouched biases returned by a great number of ratings. A common term for this penalty is lamba ( the greek l, lambda), and that's how we'll call it throughout this analysis. Since lambda is a tuning parameter, we need to choose the best. We will cross-validate (ie try different values and compute their efficiency) to get an optimal penalty term for both the movie and the user biases.
## Factorizing the residuals
So far, the formula of our prediction is formed as follows:
prediction = mean_rating_of_all_movies + movie_bias + user_bias
or
hat{y} = mu + b_i + b_u
The difference between prediction and reality is the sum of residuals (notated with the greek epsilon - epsilon)
Although this naive model is a good basis, it is not enough. We need to analyze its residuals and take more out of it. Inside the ratings matrix exist many groups of users and movies somehow corellated. Some users like thrillers and tend to rate them higher, others like Leonardo di Caprio, others dispise romance movies, etc. These trends are usually called latent factors or concepts and are almost infinite. The current state of our model ignores such factors and leaves similar residuals among them.
In order to detect and take into consideration the most important of them, we will employ a technique called matrix factorization.
The first step to factorize the residuals is to spread them in a sparse matrix. That means we convert each user into a row, each movie into a column with the corresponding ratings the values of this matrix. Its dimensions are the number of distinct users x the number of distinct movies . Since users haven't rated all movies, this matrix will be very sparse, containing NAs wherever a user has not rated a movie.
Our goal is to decompose this sparse matrix of residuals into 3 dense matrices, which, multiplied with each other, return a dense matrix very close to our initial sparse. In order to achieve it we will employ singular value decomposition (SVD for short). If we name our initial matrix of (m x n) dimensions A and the number of underlying concepts is k, we can represent A as the multiplication of a (m x k) matrix (commonly notated as U) with a (k x k) nonnegative diagonal matrix (notated as Sigma) and a (k x n) matrix (notated as V)
A = U Sigma V^T
This fairly computationally expensive process will be realized with the help of the funkSVD function from the recommenderlab package. After this transformation is achieved, we have managed to decompose our residuals epsilon. Each user-movie instance of residuals can be represented by a sum of multiplied values from the U, Sigma, and T matrices. The residual of movie i and user j is rewritten as:
epsilon_{i,j} = sum p_{i,1} q_{1,j} + p_{i,2} q_{2,j} + ... + p_{i,k} q_{k,j}
Where p and q represent values from the matrices U and Q obtained by the SVD. k is the number of concepts with the variability of each term p_{i,k} q_{k,j} decreasing (thus, concept 1 has more predictive power that concept 2 etc.)
This decomposition can be embedded in our model:
hat{y} = mu + b_i + b_u + p_{i,1} q_{1,j} + p_{i,2} q_{2,j}+ ... + p_{i,k} q_{k,j}
# Libraries and Data Import
Let's start our journey by importing libraries.
```{r libraries_import, warning=FALSE, message=FALSE}
#Importing necessary libraries
#Data manipulation
if(!require(tidyverse)) install.packages("tidyverse",
repos = "http://cran.us.r-project.org")
# Train/Test split and Model Training
if(!require(caret)) install.packages("caret",
repos = "http://cran.us.r-project.org")
#Handle Tables
if(!require(data.table)) install.packages("data.table",
repos = "http://cran.us.r-project.org")
#SVD decomposition
if(!require(recommenderlab)) install.packages("recommenderlab",
repos = "http://cran.us.r-project.org")
# to plot multiple grobs alongside
if(!require(gridExtra)) install.packages("gridExtra",
repos = "http://cran.us.r-project.org")
# to separate stacked genres
if(!require(splitstackshape))install.packages("splitstackshape",
repos = "http://cran.us.r-project.org")
#to transform shape of our data
if(!require(reshape2)) install.packages("reshape2",
repos = "http://cran.us.r-project.org")
#For rmse function
if(!require(ModelMetrics)) install.packages("ModelMetrics",
repos = "http://cran.us.r-project.org")
```
Now we need to download our data, read each file as a table and finally merge them into one.
Note: Because my pc could not run such a large dataset, I produced the report with a quarter its rows. In case you want to run on the whole dataset, disable the command that splits it.
```{r Data_Import_0 , echo=TRUE, eval = TRUE}
dl <- tempfile()
download.file("http://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)
ratings <- fread(text = gsub("::", "\t", readLines(unzip(dl, "ml-10M100K/ratings.dat"))),
col.names = c("userId", "movieId", "rating", "timestamp"))
movies <- str_split_fixed(readLines(unzip(dl, "ml-10M100K/movies.dat")), "\\::", 3)
colnames(movies) <- c("movieId", "title", "genres")
movies <- as.data.frame(movies) %>% mutate(movieId = as.numeric(levels(movieId))[movieId],
title = as.character(title),
genres = as.character(genres))
movielens <- left_join(ratings, movies, by = "movieId")
#if you want to run the whole dataset, disable the following command
movielens <- movielens %>% slice(1:floor(nrow(movielens)/4))
rm( ratings, movies )
```
# EDA and Data Cleaning
Now that we successfully imported our data, we need to start exploring it. Some quick thoughts of this first glance are:
* **movieId** and title contain almost the **same information**. At times, titles are repeated. Thus, we expect unique movieIds to be more than the unique titles. For our model, we will only keep movieId as a predictor.
* **year** can contain **little to no predictive power**. People who like space movies will rate highly "2001:A space odyssey" as well as "Interstellar". It stays fot the EDA but we probably won't need it in our prediction.
* **genres**, even though they are very hard to be used directly in a model, will be a good signal for the existence of underlying **concepts** in our data.
* **timestamp** probably contains **no information** at all. The time a rating is done has nothing to do with a user's preferences or the quality of a film. However, we will briefly explore it below.
* **title** nests another feature, the **year of release**. We will unnest it to test whether it contains any power.
```{r rid_n_factors, echo=FALSE}
movielens$timestamp <- movielens$timestamp %>% as.Date.POSIXct()
movielens$timestamp_year <- movielens$timestamp %>% year()
movielens <- movielens %>%
mutate_if(is.integer, as.factor) #integers to factors
```
```{r title_unnest, echo=FALSE}
movielens$year <- movielens %>%
pull(title) %>%
str_extract("\\([:digit:]{4}\\)") %>%
str_remove_all("\\(|\\)") %>%
as.integer()
movielens$title <-movielens$title %>% str_remove("\\([:digit:]{4}\\)")
```
## Distribution of Ratings
Throughout our analysis, the mean of ratings will play a significant role. It will be the base of our prediction. Let's take a look at the distribution that shapes it.
```{r ratings_distribution, echo=FALSE}
mu <- mean(movielens$rating)
med <- median(movielens$rating)
movielens %>% ggplot( aes(x=rating)) +
geom_density(aes(y=..density..), fill = "#99beff", col = '#d9a757', alpha = .2 )+
geom_histogram(aes(y=..density..),binwidth = .5,fill="#498ee3")+
geom_vline( aes( xintercept = mu, colour = "mean"), linetype ='dashed' ) +
geom_vline(aes( xintercept = med, colour = "median"), linetype ='dashed' ) +
theme_minimal()+
labs(title = "Histogram and Density of Ratings", x = "Rating", y = "Density")+
theme(legend.title = element_blank())
```
Both the mean (`r round(mean(movielens$rating),2)`) and the median (`r round(median(movielens$rating),2)`) are greater than half of the range (eg 2.5). That suggests a tendency towards high ratings. It looks like it is much more likely for someone to rate a movie she liked rather than one she didn't. Also, users are lenient with movies they don't like and seldom place ratings beneath 2.
## Count vs Mean Rating
Now we want to check, both for users and movies, whether the number of ratings goes along with the mean rating. In other words, we seek to answer the question: "Do movies (users) with many ratings tend to have (give) high ratings and vice versa?"
```{r count_vs_mean, echo=FALSE}
#Create a new dataframe containing the count of each user
cnt <- movielens %>% group_by(userId) %>% summarise( user_cnt = n() )
#average rating of each user
rat <- movielens %>% group_by(userId) %>% summarise( mean_rating = mean(rating) )
p1 <- cnt %>% left_join( rat, by = "userId" ) %>%
ggplot( aes(y = user_cnt, x = mean_rating ) ) +
geom_point(show.legend = F, alpha = .1,col = "#4685f2") +
labs( title = "Count of User vs Mean_Rating", x = "", y = "Times each user rated" )
#count of each movie
cnt_mov<- movielens %>% group_by(movieId) %>% summarise( movie_cnt = n() )
#average rating of each movie
rat_mov <- movielens %>% group_by(movieId) %>% summarise( mean_rating = mean(rating) )
mov_df <- cnt_mov %>% left_join(rat_mov, by = "movieId" )
p2 <- mov_df %>%
ggplot( aes(y = movie_cnt, x = mean_rating ) ) +
geom_point(show.legend = F, alpha = .1,col = "#4685f2") +
labs( title = "Count of Movie vs Mean_Rating", x = "Mean Ratings", y = "Times each movie was rated" )
grid.arrange(p1,p2)
rm(cnt,rat,cnt_mov,rat_mov,p1,p2)
```
We notice a correlation for the movies. That could be a hint that popular movies tend to have high ratings. On the other hand, users seem randomly scattered not showing any strong tendency.
## Best and Worst Movies by Rating
For starters, let's take a look at the best and the worst movies, without taking into account the number of ratings they have.
```{r best_worst_without, echo= FALSE}
#Add the title from original dataset
mov_df$title <- movielens[match(mov_df$movieId, movielens$movieId),"title"]
worst_movies <- mov_df %>%
arrange((mean_rating)) %>% #arrange from highest to lowest rating
slice(1:10) %>% #grab the first 10
ggplot( aes(x = reorder(title,mean_rating), y = mean_rating) ) +
geom_point(size = 9, col = "#baaa6e") +
coord_flip() +
geom_vline(xintercept = 1:10, col = "grey80") + #lines on which our dots "walk"
theme_linedraw()+
theme(axis.line = element_blank()) +
geom_text( aes(label = mean_rating), col = "black" ) +
labs(title = "Worst Movies", subtitle = "Regardless of the Number of Ratings" , y = "Mean Rating", x = "")
best_movies <- mov_df %>%
arrange(desc(mean_rating)) %>% #arrange from highest to lowest rating
slice(1:10) %>% #grab the first 10
ggplot( aes(x = reorder(title,mean_rating), y = mean_rating) ) +
geom_point(size = 9, col = "#ceabd1") +
coord_flip() +
geom_vline(xintercept = 1:10, col = "grey80") + #lines on which our dots "walk"
theme_linedraw()+
theme(axis.line = element_blank()) +
geom_text( aes(label = mean_rating), col = "black" ) +
labs(title = "Best Movies", subtitle = "Regardless of the Number of Ratings" , y = "Mean Rating", x = "")
grid.arrange(best_movies, worst_movies)
```
Both Best and Worst movies all share the same rating. Oddly enough, they are not so recognizable. To account for their popoularity we will replot them, this time with the number of ratings as label instead of the average rating.
```{r best_worst_without2, echo = FALSE}
worst_movies <- mov_df %>%
arrange((mean_rating)) %>% #arrange from highest to lowest rating
slice(1:10) %>% #grab the first 10
ggplot( aes(x = reorder(title,mean_rating), y = mean_rating) ) +
geom_point(size = 9, col = "#baaa6e") +
coord_flip() +
geom_vline(xintercept = 1:10, col = "grey80") + #lines on which our dots "walk"
theme_linedraw()+
theme(axis.line = element_blank()) +
geom_text( aes(label = movie_cnt), col = "black" ) +
labs(title = "Worst Movies", subtitle = "Regardless of the Number of Ratings" , y = "Mean Rating", x = "")
best_movies <- mov_df %>%
arrange(desc(mean_rating)) %>% #arrange from highest to lowest rating
slice(1:10) %>% #grab the first 10
ggplot( aes(x = reorder(title,mean_rating), y = mean_rating) ) +
geom_point(size = 9, col = "#ceabd1") +
coord_flip() +
geom_vline(xintercept = 1:10, col = "grey80") + #lines on which our dots "walk"
theme_linedraw()+
theme(axis.line = element_blank()) +
geom_text( aes(label = movie_cnt), col = "black" ) +
labs(title = "Best Movies", subtitle = "Regardless of the Number of Ratings" , y = "Mean Rating", x = "")
grid.arrange(best_movies, worst_movies)
```
As expected, almost all plotted movies have 1 rating. That explains the extreme means.
To account for this discrepancy, we will filter the number of ratings. As a low limit we will define the first quantile of the number of ratings.
```{r low_limit_ratings }
#pick the 1rd quantile of the movie count as a low limit
limit <- (summary(mov_df$movie_cnt))[5]
```
```{r best_worst_with, echo=FALSE}
best_movies <- mov_df %>%
filter(movie_cnt > limit) %>% #throw away movies with few ratings
arrange(desc(mean_rating)) %>% #arrange from highest to lowest rating
slice(1:10) %>% #grab the first 10
ggplot( aes(x = reorder(title,mean_rating), y = mean_rating) ) +
geom_point(size = 9, col = "#ceabd1") + coord_flip() +
geom_vline(xintercept = 1:10, col = "grey80") + #lines on which our dots "walk"
theme_classic()+
theme(axis.line = element_blank()) +
geom_text( aes(label = movie_cnt), col = "black" ) +
labs(title = paste("Best Movies with more than",limit,"Ratings"), y = "Mean Rating", x = "", subtitle = "Number of Ratings inside the Dots" )
worst_movies <- mov_df %>%
filter(movie_cnt > limit) %>% #throw away movies with few ratings
arrange((mean_rating)) %>% #arrange from highest to lowest rating
slice(1:10) %>% #grab the first 10
ggplot( aes(x = reorder(title,mean_rating), y = mean_rating) ) +
geom_point(size = 9, col = "#baaa6e") + coord_flip() +
geom_vline(xintercept = 1:10, col = "grey80") + #lines on which our dots "walk"
theme_classic()+
theme(axis.line = element_blank()) +
geom_text( aes(label = movie_cnt), col = "black" ) +
labs(title = paste("Worst Movies with more than",limit,"Ratings"), y = "Mean Rating", x = "", subtitle = "Number of Ratings inside the Dots" )
grid.arrange(best_movies, worst_movies)
rm(mov_df,best_movies,worst_movies)
```
Now we start to recognize the movies. Among the best are some all time classics, while among the worst are many commonly critised movies.
We need to take this effect under consideration. Later, in the modelling section, we will apply a penalty term to each movie's mean rating so that it encompasses the number of ratings they were shaped from.
## Genres
The genres column potentially contains great value. Unfortunately for us, many movies belong to more than one genre, and all of them are binded together in one column, separated by "|". We need to unnest them before we start exploring them.
```{r unnest_genres }
#For every genre entry create a separate row containing only this entry
movielens_genres <- movielens %>%
cSplit( "genres", #split the genres column
sep="|", #on separator |
direction = "long" ) #towards a long direction
```
Now that we split genres apart, let's look at the unique values contained in it and the frequency they appear.
```{r count_genres, echo=FALSE }
print(" Genres count of distinct values : ")
movielens_genres %>% pull(genres) %>% table() %>% sort(decreasing = TRUE) %>% print()
```
We notice that some genres are pretty similar. In order to concentrate them a bit, we will merge:
* Horror to Thriller
* Adventure to Action
* War to Action
* Sci-Fi to Fantasy
```{r merge_genres, echo=FALSE}
movielens_genres$genres <- plyr::revalue(movielens_genres$genres, c("Horror" = "Thriller",
"Adventure" = "Action",
"War" = "Action" ,
"Sci-Fi" = "Fantasy"))
```
And the table of frequencies changes to:
```{r count_genres_2, echo=FALSE }
print(" Genres count of distinct values : ")
movielens_genres %>% pull(genres) %>% table() %>% sort(decreasing = TRUE) %>% print()
```
### 10 Most Frequent Genres
Which genres people rate most often? Let's take a look the 10 most popular genres after the aforementioned genre merge.
```{r most_frequent_genres, echo=FALSE }
pop_genres <- movielens_genres %>%
group_by( genres ) %>%
summarize( n = n() ) %>%
arrange(desc(n)) %>%
head(10)
pop_genres %>%
ggplot( aes(x = reorder(genres,n), y = n) ) +
geom_bar( aes(fill = genres), stat = "identity",width = .6, show.legend = FALSE)+
coord_flip()+
theme_linedraw()+
labs( title = "Count of Ratings by Genre", y = "Count", x = "" )
```
### Ratings by Genre
Does popularity come with high ratings? We will create a plot where genres are in the same order, but this time we count their average rating.
```{r mean_rating_most_freq, echo=FALSE}
movielens_genres %>%
filter(genres %in% pop_genres$genres ) %>%
group_by(genres) %>%
summarize(mean_rating = mean(rating)) %>%
mutate( n = pop_genres$n[(match(genres, pop_genres$genres))] ) %>%
ggplot( aes( x = reorder(genres,n), y = mean_rating ) )+
geom_point(size = 9, col = "#2ecff0") +
geom_vline( xintercept = 1:10, col = "gray" )+
theme(axis.line = element_blank()) +
geom_text( aes(label = round(mean_rating,2), size = 8 ), col = "black", show.legend = FALSE ) +
coord_flip() +
theme_minimal() +
labs( title = "Mean Rating per Genre", subtitle = "Ordered by number of ratings", x = "", y = "Mean Rating" )
```
We can't spot a pattern here. Popularity has not much to do with the average rating.
To further examine it, we will also make a violinplot to take a look at the distribution of ratings for each genre. This way it can become more clear which values shaped these means.
```{r violin_genres, echo=FALSE}
movielens_genres %>%
filter(genres %in% pop_genres$genres ) %>%
ggplot() +
geom_violin( aes( x = genres, y = rating, fill = genres ), show.legend = FALSE ) +
coord_flip() +
labs( title = "Ratings by Genre", x = "", y = "Rating" ) +
theme_linedraw()
```
Distributions of ratings along genres look pretty identical. This comes of no surprise since means were not much different. In addition, all genres contain both good and bad movies, hence the wide distribution.
As a result, it looks like genre alone is not an efficient predictor. To know the genre of a movie seems insufficient to predict its rating.
Perhaps genres need to be examined along with the user who places the rating. To investigate it, we will employ a lollipop plot to make clear whether mean ratings per genre vary by userId.
Because genres are far too many to form informative plots we will keep the top 6 by number of ratings. The same will apply to users.
```{r genre_per_user_plot, echo=FALSE}
#store top genres and users in variables for ease of filtering
top_genres <- movielens_genres %>%
group_by( genres ) %>%
summarize( n = n() ) %>%
arrange(n%>% desc()) %>%
head(6) %>%
pull(genres)
top_users <- movielens_genres %>%
group_by(userId) %>%
summarize( user_count = n() ) %>%
arrange(desc(user_count)) %>%
head(6) %>%
pull(userId)
movielens_genres <- movielens_genres %>%
filter(genres %in% top_genres, userId %in% top_users)#keep only top genres and users
#transform data in wide format
movielens_genres <- movielens_genres %>%
dcast(userId + movieId + title + rating ~ genres,
fun.aggregate = length,
value.var = "rating") #convert to dummy columns
#Replace 1's with the actual rating
movielens_genres[,4:ncol(movielens_genres)] <- movielens_genres[,4:ncol(movielens_genres)] %>%
mutate_all( ~case_when( . !=0 ~ rating, . ==0 ~ 0 ) ) #whenever each dummy column is not zero, place the equivalent rating
movielens_genres[,c(1,5:10)] %>%
group_by(userId) %>%
summarize_all(mean) %>% #compute the mean of each user in each genre
gather(key = "Genre", value = "Average_Rating" , -userId) %>% #gather it for easier plotting
ggplot( aes(x = Genre, y = Average_Rating)) +
geom_point(col = "#b8182d") +
facet_grid(~userId) +
geom_segment(aes(x = Genre, y = 0, xend = Genre, yend = Average_Rating-.02), color = "#ffc64a") +
labs(title = "Average Rating per Genre", subtitle = "Faceted by UserId", y = "Average Rating", x = "User")+
theme_linedraw()+
theme(axis.text.x = element_text(face = "bold", color = "#2b05b3", size = 8, angle = 60))
rm(movielens_genres,top_genres,top_users, pop_genres)
```
Pretty enlightening! No user has a balance in their preferences. Some love comedy, others like action, etc. That's a strong hindsight for the presence of hidden "concepts" in our data.
We will deploy this insight later by applying SVD in an effort to unravel such latent factors.
## Time
Is time an effective predictor of rating? Does it matter when a movie is released or rated? Let's explore accordingly.
### Of Release
To earn a hindsight about the release date, let's visualize the mean rating of each year
```{r mean_rating_per_release_year, echo = FALSE }
movielens %>% group_by(year) %>%
summarize( average_rating = mean(rating) ) %>%
ggplot(aes(x = year , y = average_rating) ) +
geom_line(lwd = 1.1, col = "navyblue")+
#geom_point( aes(size ) +
theme_minimal() +
scale_y_continuous(limits = c(0,5))+
labs(y = "Yearly Average Rating",x = "Year", title = "Average Rating by Release Year")
```
As expected, no apparent pattern.
### Of Rating
```{r mean_rating_per_year, echo = FALSE}
#medi <- median(movielens$)
movielens %>% group_by( timestamp_year ) %>%
summarise(avg_rating = mean(rating), Number_of_Ratings = n()) -> temp
medi <- median(temp$avg_rating)
temp %>%
ggplot( aes( x = timestamp_year, y = avg_rating ) ) +
#geom_line(lwd = 1.1, col = "navyblue")+
geom_point(aes(size = Number_of_Ratings ), col = "navyblue")+
geom_hline( yintercept = medi)+
theme_minimal() +
scale_y_continuous(limits = c(0,5))+
labs(y = "Yearly Average Rating",x = "Year",
title = "Average Rating by Year each Rating placed",
subtitle = "Median Average Rating as Horizontal line")
```
Again, no apparent pattern. Seems like the time variable is not so valuable to us and it does not qualify for the modelling section.
# Modelling
The time has come to start creating our predictive algorithm.
Because data is vast, regression models will take forever to train. As an alternative, we will construct a similar formula logically step-by-step.
For that, we will make use of only 3 columns: *userID*, *movieId* and *rating*.
```{r select_features, echo=TRUE}
movielens <- movielens %>% select( c(movieId,userId,rating) )
```
## Train/Test Split
The RMSE of our model will calculated on the validation set. To avoid overfitting, we will not use it to optimize our model's parameters. Instead, we will split the initial train set to create a test set out of it that will help us optimize.
### Train / Validation Split
We will perform a 90/10 split. However, the split is not enough. We need to make sure that all users and movies contained in test set exist in the train set as well. Otherwise, our algorithm will be unable to predict a rating and will return an N/A.
As a result, we need to remove some rows from test set. However, we will not let these rows get wasted. We will add them in the train set instead. Even though they won't be directly deployed, they will contribute to the unravelling of concepts.
To make this clear we will use an extremely simplified example. Suppose all rows intended for removal were thrillers. Users keen on thrillers would have given high ratings to these, while others with a preference to romance would have rated them low. This set of movies would greatly contribute to the latent factor responding to the thriller genre and would have helped our algorithm to subtly categorize each user's preference on it.
```{r split, echo=TRUE, message=FALSE,warning=FALSE}
set.seed(1)
# if using R 3.5 or earlier, use `set.seed(1)` instead
test_index <- createDataPartition(y = movielens$rating, times = 1, p = 0.1, list = FALSE)
edx <- movielens[-test_index,]
temp <- movielens[test_index,]
validation <- temp %>%
semi_join(edx, by = "movieId") %>%
semi_join(edx, by = "userId")
# Add rows removed from validation set back into edx set
removed <- anti_join(temp, validation)
edx <- rbind(edx, removed)
mu <- mean(edx$rating)
rm(test_index, temp, movielens, removed)
```
### Train / Test Split
As said above, in order to optimize without using the validation set, we will conduct an 80/20 split on the existing train set.
For speed of computation, we won't split the whole train set, but 200,000 rows of it.
```{r second_split, echo=TRUE, message=FALSE,warning=FALSE}
set.seed(1)
# if using R 3.5 or earlier, use `set.seed(1)` instead
test_index <- createDataPartition(y = edx$rating, times = 1, p = 0.8, list = FALSE)
train_set <- edx[-test_index,]
temp <- edx[test_index,]
test_set <- temp %>%
semi_join(train_set, by = "movieId") %>%
semi_join(train_set, by = "userId")
# Add rows removed from test set back into train set
removed <- anti_join(temp, test_set)
train_set <- rbind(train_set, removed)
rm(test_index, temp, removed)
```
## User and Movie Bias
Now we need to compute each movie's and user's bias. In order to control for the number of ratings, we need to penalize by term commonly referred to as lambda. To find the optimal lambda, we create a function that lets us that takes as input our dataframe and a vector of possible lambdas, computes the RMSE each lambda returned and outputs the lambda that yields the lowest rmse.
```{r l_optimizer}
l_optimizer <- function(train_set, lambdas){
rmses <- sapply(lambdas, function(l){
#Calculate movie bias
b_i <- train_set %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+l))
#Calculate user bias
b_u <- train_set %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n()+l))
#Predict ratings
predicted_ratings <- test_set %>%
left_join(b_i, by='movieId') %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu + b_i + b_u) %>%
pull(pred)
#Calculate the error for the specific lambda
return(RMSE(predicted_ratings, test_set$rating))
})
return(lambdas[which.min(rmses)])
}
```
```{r run_l_optimizer, echo=FALSE }
#A vector of possible lambdas
lambdas <- seq(0, 5, 0.5)
#Assign the optimal lambda to a variable
lambda <- l_optimizer(train_set, lambdas)
#Take a look at it
print(paste("Optimal Lambda :",lambda))
```
Now that we know the optimal lambda, we will go on to create the biases that will lead us to the residuals we seek to decompose.
```{r biases, echo=TRUE }
#Calculate movie bias
b_i <- edx %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+lambda))
#Calculate user bias
b_u <- edx %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n()+lambda))
```
## Residuals
Now that we obtained the regularized user and movie biases, we need to calculate what is left unexplained, ie the residual values. In other words, we need to compute all differences between actual ratings and our predictions so that we can further analyze these differences.
```{r resids_function}
#Function to calculate the residuals
resids <- function( train_set = train_set, mu = mu ){
#This function takes training set, calculates the residuals left after subtracting mu, user and movie bias
#from actual rating and transforms it into a sparse matrix with movieIds as columns and users as row
df <- train_set %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(res = rating - (mu + b_i + b_u) ) %>%
select(userId, movieId, res) %>%
spread( movieId, res ) %>% as.data.frame()
#Name the rows
rownames(df)<- df[,1]
df <- df[,-1]
}
```
```{r resids_glance, echo=FALSE }
#convert our train set to residuals
#tr <- resids(train_set = train_set,mu=mu)
#A quick look at the residual values
#print("Residuals of the first 10 users and movies")
#tr[1:10,1:10] %>% print()
```
## SVD
Now that we have the matrix of residuals, we de compose it in the best possible way. For that reason we will use the funkSVD function from the recommenderlab package. As its name suggests, it is not the tradiotional SVD. It uses a slightly different method, decomposing the initial matrix into two (and not three) matrices containing the latent factors for users and movies that minimize a specific objective function. For more details, refer [here](https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems).
Because it contains many parameters, we will build a function that optimizes some of them.
These parameters include:
* train set (named tr) : the train set matrix of residuals. It will get split in in an 80/20 to create a test set out of it that will help us estimate the final RMSE
* test set (named validation) : test set on which we want to predict
* ks : number of features, or ranks of the approximation. In other words, how many "concepts" do we seek to unravel. A variable with high computational cost.
* gammas : regularization term
* lambdas : the rate by which we update our weights after each iteration. A low lambda can make our algorithm train for days, while a high one can lead us to miss the optimal weights.
* min_improvement : the minimum amount of improvement derived from our objective that can be considered as improvement
* min_epochs : minimum number of iterations per feature
```{r optimizer_def}
optimizer <- function(tr = tr,
test_set = test_set,
ks,gammas,lambdas,min_epochs, min_improvements){
#Inputs: parameters of funkSVD function
d <- data.frame(k = NULL,
gamma = NULL,
lambda = NULL,
min_epochs = NULL,
min_improvement = NULL,
rmse = NULL)
for (k in ks){
for (g in gammas){
for (l in lambdas){
for (e in min_epochs){
for (imp in min_improvements) {
#decompose residuals with funk SVD
a <- funkSVD(tr,
k = k,
gamma = g,
lambda = l,
min_epochs = e,
max_epochs = 200,
min_improvement = imp,
verbose = TRUE )
#recompose them (returns a full matrix in place of the sparse tr)
r <- tcrossprod(a$U, a$V)
#pass the original colnames to the new matrix
colnames(r) <- colnames(tr)
#create a new vector (called re) on test set
#containing the appropriate p * q (recomposed) term
test_set$re <- seq(0,nrow(test_set)-1) #to-be-filled vector of zeros
#for each row of test set
for (i in 1:nrow(test_set)){
#fill the vector of zeros with the proper calculated p*q quantity
test_set$re[i] <- r[ test_set$userId[i] , which(test_set$movieId[i] == colnames(r)) ]
}
#calculate our prediction
tes <- test_set %>%
left_join(b_i, by = "movieId") %>% #bring the movie bias
left_join(b_u, by = "userId") %>% #bring the user bias
mutate(pred = mu + b_i + b_u + re) # calculate our prediction
#store results in a data frame
d <- d %>% rbind(data.frame(k = k, gamma = g, lambda = l,
min_epochs = e, min_improvement = imp,
rmse = ModelMetrics::rmse(tes$rating, tes$pred) ) )
}}}}}
return(d)
}
```
Now that we defined our function, we also define the parameters we want to experiment with.
For ease of computations, we won't use the optimizer in knitting. I've run the optimization process several times with smaller datasets. In case you want to further optimize them, re-run the script with you own preferences.
```{r params_def}
#Define the parameters
ks <- c(1)
gammas <- c(.009)
lambdas <- c(.005)
min_epochs <- c(5)
min_improvements <- c(.001)
```
# Results
Then, we
* run our optimizer,
* store optimizer results in dataframe
* look at the optimal parameters that gave the minimum RMSE.
```{r optim_results, echo=FALSE, message=FALSE,warning=FALSE}
optimization <- FALSE
if(optimization){
results <- optimizer(tr = tr, test_set = test_set, ks,gammas,lambdas,min_epochs, min_improvements)
results %>%
ggplot( aes( x = reorder(k,rmse) , y = (rmse)) ) +
geom_point(aes(col = as.factor(gamma) ), size = 9, alpha = .8) +
coord_flip() +
geom_vline(xintercept = 1:nrow(results), col = "grey80") + #lines on which our dots "walk"
theme_minimal()+
theme(axis.line = element_blank()) +
labs(title = "RMSE results", subtitle = "" , y = "RMSE", x = "Ranks", col = "Gamma", shape = "Lambda")
opt_params <- results[which(results$rmse == min(results$rmse)),]
}else{
results <- c(ks,gammas,lambdas,min_epochs,min_improvements)
opt_params <- results
}
```
```{r optimal_print, echo = FALSE}
"Optimal Parameters : " %>% print()
opt_params %>% print()
rm(train_set,test_set)
```
The optimizer function can also be used to predict on the validation set. We will re-run it, but this time with:
* edx as train test, (after we pass it from the function that returns the residuals)
* validation as test_set
* the rest of (optimal) parameters from the data frame we created above
```{r predict_validation, echo=TRUE}
#subtract the mean, movie and user biases
tr <- resids(train_set = edx, mu=mu)
final_RMSE <- optimizer( tr = tr,
test_set = validation,
ks = opt_params[1],
gammas = opt_params[2],
lambdas = opt_params[3],
min_epochs = opt_params[4],
min_improvements = opt_params[5])
print("Final RMSE:")
final_RMSE$rmse %>% print()
```
As you can see, torturing data enough gave us an RMSE of `r final_RMSE$rmse`.
# Conclusion
Throughtout this report, we tried to explore movie ratings under different spectrums in order to come up with an efficient and computationally affordable movie recommender.
At first,we took a brief look at their **distribution** and noticed a tendency towards **high ratings**, since both mean and median rating were over 2.5.
Then, we visualised the mean rating in correlation with the number of ratings both for movies and for users. Despite users showed a randomly scattered plot, movies exhibited a slight positive tendency. That led us to explore the **mean** rating **in relation with the number of ratings** each movie had. This step made clear that we shouldn't rely on the mean rating unless it was shaped by a safe number of ratings and introduced us to the **need to regularize each movie's mean rating**. The issue was tackled later with an optimization function that detects the optimal regularization parameter.
Then, we gravitated towards the **Genres** column. After we brought it in a proper format, we started exploring the mean rating per genre in relation with its popularity. Overall, genres didn't appear so discriminating. They pretty much held the **same distribution of ratings** and their respective means were close with each other. But since we put them under the **perspective of user's preference**, they suddenly became insightful. It became clear that each user had a tendency to rate higher (lower) certain genres and preferences among users varied. This indicated that genres contained valuable information. If we knew the preferences of a user and how much each movie belonged to these preferences, we could embed these preferences in our model and substantially improve its predictive power. For that, we employed the matrix factorization, which helped us abstractly unveil such preferences.
With all this hindsight, we started creating our model. After we regularized each movie and user mean rating (aka **bias**), we focused on exploiting the residuals of this precocious model. For that, we employed the FunkSVD function, tuned by a subsplit of our original train set. Finally, we plotted each RMSE obtained by each combination of parameters and chose the optimal to train our final model.
A short description of the results is embedded in the following plot.
```{r model_comparison_plot, echo = FALSE}
#Just the mean
mean_only <- rmse(validation$rating,rep(mu,nrow(validation)))
#Adding the user bias
pred_bu <- validation %>%
left_join(b_i, by = "movieId") %>%
mutate(pred = (mu + b_i) ) %>%
pull(pred)
mean_bu <- rmse(validation$rating,pred_bu)
pred_bu_bi <- validation %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = (mu + b_i + b_u) ) %>%
pull(pred)
mean_bu_bi <- rmse(validation$rating,pred_bu_bi)
data.frame(model = c("Mean Only","Mean + User Bias", "Mean + User + Movie Bias", "Mean + User + Movie Bias +Matrix Factorization") ,
rmses = c(mean_only, mean_bu, mean_bu_bi, final_RMSE$rmse)) %>%
ggplot( aes(x = model, y = rmses) ) +
geom_point( size = 6, col = "#4c7329" ) +
coord_flip()+
theme_linedraw()+
labs( title = "Comparison of Models", y = "RMSE", x ="")
```
# Appendix (Code)
```{r appendix_code, eval=FALSE,echo=TRUE}
#Importing necessary libraries
#Data manipulation
if(!require(tidyverse)) install.packages("tidyverse",
repos = "http://cran.us.r-project.org")
# Train/Test split and Model Training
if(!require(caret)) install.packages("caret",
repos = "http://cran.us.r-project.org")
#Handle Tables
if(!require(data.table)) install.packages("data.table",
repos = "http://cran.us.r-project.org")
#SVD decomposition
if(!require(recommenderlab)) install.packages("recommenderlab",
repos = "http://cran.us.r-project.org")
# to plot multiple grobs alongside
if(!require(gridExtra)) install.packages("gridExtra",
repos = "http://cran.us.r-project.org")
# to separate stacked genres
if(!require(splitstackshape))install.packages("splitstackshape",
repos = "http://cran.us.r-project.org")
#to transform shape of our data
if(!require(reshape2)) install.packages("reshape2",
repos = "http://cran.us.r-project.org")
#For rmse function
if(!require(ModelMetrics)) install.packages("ModelMetrics",
repos = "http://cran.us.r-project.org")
#Data Import
dl <- tempfile()
download.file("http://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)
ratings <- fread(text = gsub("::", "\t", readLines(unzip(dl, "ml-10M100K/ratings.dat"))),
col.names = c("userId", "movieId", "rating", "timestamp"))
movies <- str_split_fixed(readLines(unzip(dl, "ml-10M100K/movies.dat")), "\\::", 3)
colnames(movies) <- c("movieId", "title", "genres")
movies <- as.data.frame(movies) %>% mutate(movieId = as.numeric(levels(movieId))[movieId],
title = as.character(title),
genres = as.character(genres))
movielens <- left_join(ratings, movies, by = "movieId")
#if you want to run the whole dataset, disable the following command
movielens <- movielens %>% slice(1:floor(nrow(movielens)/4))
rm( ratings, movies )
movielens$timestamp <- movielens$timestamp %>% as.Date.POSIXct()
movielens$timestamp_year <- movielens$timestamp %>% year()
movielens <- movielens %>%
mutate_if(is.integer, as.factor) #integers to factors
movielens$year <- movielens %>%
pull(title) %>%
str_extract("\\([:digit:]{4}\\)") %>%
str_remove_all("\\(|\\)") %>%
as.integer()
movielens$title <-movielens$title %>% str_remove("\\([:digit:]{4}\\)")
mu <- mean(movielens$rating)
med <- median(movielens$rating)