MachineLearningwithPython/en_Collaborative Filtering.srt at main · camara94/MachineLearningwithPython · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
0
00:00:00,729 --> 00:00:05,200
Hello, and welcome! In this video, we’ll be covering a recommender

1
00:00:05,200 --> 00:00:10,839
system technique called, Collaborative filtering. So let’s get started.

2
00:00:10,839 --> 00:00:15,629
Collaborative filtering is based on the fact that relationships exist between products

3
00:00:15,629 --> 00:00:21,039
and people’s interests. Many recommendation systems use Collaborative

4
00:00:21,039 --> 00:00:26,660
filtering to find these relationships and to give an accurate recommendation of a product

5
00:00:26,660 --> 00:00:31,050
that the user might like or be interested in.

6
00:00:31,050 --> 00:00:37,510
Collaborative filtering has basically two approaches: User-based and Item-based.

7
00:00:37,510 --> 00:00:43,329
User-based collaborative filtering is based on the user’s similarity or neighborhood.

8
00:00:43,329 --> 00:00:47,930
Item-based collaborative filtering is based on similarity among items.

9
00:00:47,930 --> 00:00:53,039
Let’s first look at the intuition behind the “user-based” approach.

10
00:00:53,039 --> 00:00:58,230
In user-based collaborative filtering, we have an active user for whom the recommendation

11
00:00:58,230 --> 00:01:02,460
is aimed. The collaborative filtering engine, first

12
00:01:02,460 --> 00:01:07,841
looks for users who are similar, that is, users who share the active user’s rating

13
00:01:07,841 --> 00:01:12,270
patterns. Collaborative filtering bases this similarity

14
00:01:12,270 --> 00:01:18,770
on things like history, preference, and choices that users make when buying, watching, or

15
00:01:18,770 --> 00:01:23,350
enjoying something. For example, movies that similar users have

16
00:01:23,350 --> 00:01:27,660
rated highly. Then, it uses the ratings from these similar

17
00:01:27,660 --> 00:01:34,250
users to predict the possible ratings by the active user for a movie that she had not previously

18
00:01:34,250 --> 00:01:39,120
watched. For instance, if 2 users are similar or are

19
00:01:39,120 --> 00:01:45,360
neighbors, in terms of their interest in movies, we can recommend a movie to the active user

20
00:01:45,360 --> 00:01:50,830
that her neighbor has already seen. Now, let’s dive into the algorithm to see

21
00:01:50,830 --> 00:01:53,830
how all of this works.

22
00:01:53,830 --> 00:02:00,580
Assume that we have a simple user-item matrix, which shows the ratings of 4 users for 5 different

23
00:02:00,580 --> 00:02:04,580
movies. Let’s also assume that our active user has

24
00:02:04,580 --> 00:02:11,549
watched and rated 3 out of these 5 movies. Let’s find out which of the two movies that

25
00:02:11,549 --> 00:02:16,239
our active user hasn’t watched, should be recommended to her.

26
00:02:16,239 --> 00:02:21,870
The first step is to discover how similar the active user is to the other users.

27
00:02:21,870 --> 00:02:25,849
How do we do this? Well, this can be done through several different

28
00:02:25,849 --> 00:02:32,159
statistical and vectorial techniques such as distance or similarity measurements, including

29
00:02:32,159 --> 00:02:38,640
Euclidean Distance, Pearson Correlation, Cosine Similarity, and so on.

30
00:02:38,640 --> 00:02:44,420
To calculate the level of similarity between 2 users, we use the 3 movies that both the

31
00:02:44,420 --> 00:02:49,799
users have rated in the past. Regardless of what we use for similarity measurement,

32
00:02:49,799 --> 00:02:59,969
let’s say, for example, the similarity, could be 0.7, 0.9, and 0.4 between the active

33
00:02:59,969 --> 00:03:05,439
user and other users. These numbers represent similarity weights,

34
00:03:05,439 --> 00:03:11,569
or proximity of the active user to other users in the dataset.

35
00:03:11,569 --> 00:03:15,730
The next step is to create a weighted rating matrix.

36
00:03:15,730 --> 00:03:21,120
We just calculated the similarity of users to our active user in the previous slide.

37
00:03:21,120 --> 00:03:27,730
Now we can use it to calculate the possible opinion of the active user about our 2 target

38
00:03:27,730 --> 00:03:31,290
movies. This is achieved by multiplying the similarity

39
00:03:31,290 --> 00:03:37,810
weights to the user ratings. It results in a weighted ratings matrix, which

40
00:03:37,810 --> 00:03:45,079
represents the user’s neighbour’s opinion about our 2 candidate movies for recommendation.

41
00:03:45,079 --> 00:03:50,609
In fact, it incorporates the behaviour of other users and gives more weight to the ratings

42
00:03:50,609 --> 00:03:55,180
of those users who are more similar to the active user.

43
00:03:55,180 --> 00:04:01,430
Now we can generate the recommendation matrix by aggregating all of the weighted rates.

44
00:04:01,430 --> 00:04:08,290
However, as 3 users rated the first potential movie, and 2 users rated the second movie,

45
00:04:08,290 --> 00:04:14,959
we have to normalize the weighted rating values. We do this by dividing it by the sum of the

46
00:04:14,959 --> 00:04:20,350
similarity index for users. The result is the potential rating that our

47
00:04:20,350 --> 00:04:26,940
active user will give to these movies, based on her similarity to other users.

48
00:04:26,940 --> 00:04:32,190
It is obvious that we can use it to rank the movies for providing recommendation to our

49
00:04:32,190 --> 00:04:33,530
active user.

50
00:04:33,530 --> 00:04:39,660
Now, let’s examine what’s different between “User-based” and “Item-based” Collaborative

51
00:04:39,660 --> 00:04:43,721
filtering: In the User-based approach, the recommendation

52
00:04:43,721 --> 00:04:50,200
is based on users of the same neighborhood, with whom he or she shares common preferences.

53
00:04:50,200 --> 00:04:58,720
For example, as User1 and User3 both liked Item 3 and Item 4, we consider them as similar

54
00:04:58,720 --> 00:05:08,070
or neighbor users, and recommend Item 1, which is positively rated by User1 to User3.

55
00:05:08,070 --> 00:05:13,420
In the item-based approach, similar items build neighborhoods on the behavior of users.

56
00:05:13,420 --> 00:05:18,780
(Please note, however, that it is NOT based on their content).

57
00:05:18,780 --> 00:05:24,790
For example, Item 1 and Item 3 are considered neighbors, as they were positively rated by

58
00:05:24,790 --> 00:05:31,771
both User1 and User2. So, Item 1 can be recommended to User 3 as

59
00:05:31,771 --> 00:05:37,810
he has already shown interest in Item3. Therefore, the recommendations here are based

60
00:05:37,810 --> 00:05:42,960
on the items in the neighborhood that a user might prefer.

61
00:05:42,960 --> 00:05:47,650
Collaborative filtering is a very effective recommendation system, however, there are

62
00:05:47,650 --> 00:05:53,110
some challenges with it as well. One of them is Data Sparsity.

63
00:05:53,110 --> 00:05:57,910
Data sparsity happens when you have a large dataset of users, who generally, rate only

64
00:05:57,910 --> 00:06:03,430
a limited number of items. As mentioned, collaborative-based recommenders

65
00:06:03,430 --> 00:06:08,300
can only predict scoring of an item if there are other users who have rated it.

66
00:06:08,300 --> 00:06:14,080
Due to sparsity, we might not have enough ratings in the user-item dataset, which makes

67
00:06:14,080 --> 00:06:20,410
it impossible to provide proper recommendations. Another issue to keep in mind is something

68
00:06:20,410 --> 00:06:25,720
called ‘cold start.’ Cold start refers to the difficulty the recommendation

69
00:06:25,720 --> 00:06:32,110
system has when there is a new user and, as such, a profile doesn’t exist for them yet.

70
00:06:32,110 --> 00:06:38,140
Cold start can also happen when we have a new item, which has not received a rating.

71
00:06:38,140 --> 00:06:44,000
Scalability can become an issue, as well. As the number of users or items increases

72
00:06:44,000 --> 00:06:49,750
and the amount of data expands, Collaborative filtering algorithms will begin to suffer

73
00:06:49,750 --> 00:06:55,500
drops in performance, simply due to growth in the similarity computation.

74
00:06:55,500 --> 00:07:01,000
There are some solutions for each of these challenges, such as using hybrid-based recommender

75
00:07:01,000 --> 00:07:04,350
systems, but they are out of scope of this course.

76
00:07:04,350 --> 00:07:05,720
Thanks for watching!