eng/Lecture 3 _ Loss Functions and Optimization.srt

1
00:00:08,555 --> 00:00:12,482
- Okay so welcome to
CS 231N Lecture three.

2
00:00:12,482 --> 00:00:15,394
Today we're going to talk about
loss functions and optimization

3
00:00:15,394 --> 00:00:17,520
but as usual, before we
get to the main content

4
00:00:17,520 --> 00:00:19,762
of the lecture, there's a
couple administrative things

5
00:00:19,762 --> 00:00:20,929
to talk about.

6
00:00:21,894 --> 00:00:25,347
So the first thing is that
assignment one has been released.

7
00:00:25,347 --> 00:00:27,689
You can find the link up on the website.

8
00:00:27,689 --> 00:00:29,176
And since we were a little bit late

9
00:00:29,176 --> 00:00:30,886
in getting this assignment
out to you guys,

10
00:00:30,886 --> 00:00:33,981
we've decided to change
the due date to Thursday,

11
00:00:33,981 --> 00:00:36,064
April 20th at 11:59 p.m.,

12
00:00:37,174 --> 00:00:40,081
this will give you a full
two weeks from the assignment

13
00:00:40,081 --> 00:00:43,502
release date to go and
actually finish and work on it,

14
00:00:43,502 --> 00:00:47,299
so we'll update the syllabus
for this new due date

15
00:00:47,299 --> 00:00:49,887
in a little bit later today.

16
00:00:49,887 --> 00:00:51,825
And as a reminder, when you
complete the assignment,

17
00:00:51,825 --> 00:00:55,417
you should go turn in the
final zip file on Canvas

18
00:00:55,417 --> 00:00:57,579
so we can grade it and get
your grades back as quickly

19
00:00:57,579 --> 00:00:58,579
as possible.

20
00:00:59,599 --> 00:01:04,233
So the next thing is always
check out Piazza for interesting

21
00:01:04,233 --> 00:01:05,679
administrative stuff.

22
00:01:05,679 --> 00:01:08,588
So this week I wanted to
highlight that we have several

23
00:01:08,588 --> 00:01:12,232
example project ideas as
a pinned post on Piazza.

24
00:01:12,232 --> 00:01:15,730
So we went out and solicited
example of project ideas

25
00:01:15,730 --> 00:01:18,020
from various people in the
Stanford community or affiliated

26
00:01:18,020 --> 00:01:20,951
to Stanford, and they came
up with some interesting

27
00:01:20,951 --> 00:01:23,713
suggestions for projects
that they might want students

28
00:01:23,713 --> 00:01:25,383
in the class to work on.

29
00:01:25,383 --> 00:01:27,786
So check out this pinned post
on Piazza and if you want

30
00:01:27,786 --> 00:01:31,384
to work on any of these projects,
then feel free to contact

31
00:01:31,384 --> 00:01:35,031
the project mentors
directly about these things.

32
00:01:35,031 --> 00:01:37,890
Aditionally we posted office
hours on the course website,

33
00:01:37,890 --> 00:01:41,556
this is a Google calendar, so
this is something that people

34
00:01:41,556 --> 00:01:45,877
have been asking about
and now it's up there.

35
00:01:45,877 --> 00:01:49,107
The final administrative
note is about Google Cloud,

36
00:01:49,107 --> 00:01:52,545
as a reminder, because we're
supported by Google Cloud

37
00:01:52,545 --> 00:01:55,131
in this class, we're able to
give each of you an additional

38
00:01:55,131 --> 00:01:57,833
$100 credit for Google Cloud
to work on your assignments

39
00:01:57,833 --> 00:02:01,487
and projects, and the exact
details of how to redeem

40
00:02:01,487 --> 00:02:06,046
that credit will go out later
today, most likely on Piazza.

41
00:02:06,046 --> 00:02:08,610
So if there's, I guess if
there's no questions about

42
00:02:08,610 --> 00:02:12,777
administrative stuff then we'll
move on to course content.

43
00:02:14,240 --> 00:02:15,073
Okay cool.

44
00:02:16,359 --> 00:02:18,797
So recall from last time in lecture two,

45
00:02:18,797 --> 00:02:21,212
we were really talking about
the challenges of recognition

46
00:02:21,212 --> 00:02:23,002
and trying to hone in on this idea

47
00:02:23,002 --> 00:02:25,276
of a data-driven approach.

48
00:02:25,276 --> 00:02:27,721
We talked about this idea
of image classification,

49
00:02:27,721 --> 00:02:29,960
talked about why it's hard,
there's this semantic gap

50
00:02:29,960 --> 00:02:34,002
between the giant grid of
numbers that the computer sees

51
00:02:34,002 --> 00:02:36,612
and the actual image that you see.

52
00:02:36,612 --> 00:02:38,445
We talked about various
challenges regarding this

53
00:02:38,445 --> 00:02:40,757
around illumination,
deformation, et cetera,

54
00:02:40,757 --> 00:02:42,924
and why this is actually a
really, really hard problem

55
00:02:42,924 --> 00:02:44,986
even though it's super
easy for people to do

56
00:02:44,986 --> 00:02:48,712
with their human eyes
and human visual system.

57
00:02:48,712 --> 00:02:51,221
Then also recall last time
we talked about the k-nearest

58
00:02:51,221 --> 00:02:54,289
neighbor classifier as kind
of a simple introduction

59
00:02:54,289 --> 00:02:56,109
to this whole data-driven mindset.

60
00:02:56,109 --> 00:02:58,792
We talked about the CIFAR-10
data set where you can see

61
00:02:58,792 --> 00:03:01,624
an example of these images
on the upper left here,

62
00:03:01,624 --> 00:03:04,488
where CIFAR-10 gives you
these 10 different categories,

63
00:03:04,488 --> 00:03:06,587
airplane, automobile, whatnot,

64
00:03:06,587 --> 00:03:09,427
and we talked about how the
k-nearest neighbor classifier

65
00:03:09,427 --> 00:03:12,002
can be used to learn decision boundaries

66
00:03:12,002 --> 00:03:14,404
to separate these data points into classes

67
00:03:14,404 --> 00:03:16,546
based on the training data.

68
00:03:16,546 --> 00:03:19,399
This also led us to a
discussion of the idea of cross

69
00:03:19,399 --> 00:03:21,755
validation and setting
hyper parameters by dividing

70
00:03:21,755 --> 00:03:25,990
your data into train,
validation and test sets.

71
00:03:25,990 --> 00:03:28,008
Then also recall last time
we talked about linear

72
00:03:28,008 --> 00:03:30,857
classification as the first
sort of building block

73
00:03:30,857 --> 00:03:33,210
as we move toward neural networks.

74
00:03:33,210 --> 00:03:35,526
Recall that the linear
classifier is an example

75
00:03:35,526 --> 00:03:39,338
of a parametric classifier
where all of our knowledge

76
00:03:39,338 --> 00:03:41,328
about the training data gets summarized

77
00:03:41,328 --> 00:03:44,146
into this parameter matrix W that is set

78
00:03:44,146 --> 00:03:46,244
during the process of training.

79
00:03:46,244 --> 00:03:49,248
And this linear classifier
recall is super simple,

80
00:03:49,248 --> 00:03:51,115
where we're going to take
the image and stretch it out

81
00:03:51,115 --> 00:03:52,610
into a long vector.

82
00:03:52,610 --> 00:03:55,774
So here the image is x and
then we take that image

83
00:03:55,774 --> 00:03:59,095
which might be 32 by 32 by
3 pixels, stretch it out

84
00:03:59,095 --> 00:04:02,051
into a long column vector of 32 times 32

85
00:04:02,051 --> 00:04:03,718
times 3 entries,

86
00:04:05,144 --> 00:04:07,203
where the 32 and 32 are
the height and width,

87
00:04:07,203 --> 00:04:09,023
and the 3 give you
the three color channels,

88
00:04:09,023 --> 00:04:10,522
red, green, blue.

89
00:04:10,522 --> 00:04:14,361
Then there exists some parameter matrix, W

90
00:04:14,361 --> 00:04:16,481
which will take this long column vector

91
00:04:16,481 --> 00:04:19,317
representing the image
pixels, and convert this

92
00:04:19,317 --> 00:04:21,642
and give you 10 numbers giving scores

93
00:04:21,642 --> 00:04:25,187
for each of the 10 classes
in the case of CIFAR-10.

94
00:04:25,187 --> 00:04:26,916
Where we kind of had this interpretation

95
00:04:26,916 --> 00:04:30,417
where larger values of those scores,

96
00:04:30,417 --> 00:04:33,147
so a larger value for the cat
class means the classifier

97
00:04:33,147 --> 00:04:35,681
thinks that the cat is
more likely for that image,

98
00:04:35,681 --> 00:04:38,350
and lower values for
maybe the dog or car class

99
00:04:38,350 --> 00:04:41,353
indicate lower probabilities
of those classes being present

100
00:04:41,353 --> 00:04:43,243
in the image.

101
00:04:43,243 --> 00:04:46,564
Also, so I think this point
was a little bit unclear

102
00:04:46,564 --> 00:04:50,209
last time that linear classification
has this interpretation

103
00:04:50,209 --> 00:04:52,425
as learning templates per class,

104
00:04:52,425 --> 00:04:55,128
where if you look at the
diagram on the lower left,

105
00:04:55,128 --> 00:04:58,299
you think that, so for
every pixel in the image,

106
00:04:58,299 --> 00:05:00,411
and for every one of our 10 classes,

107
00:05:00,411 --> 00:05:03,244
there exists some entry in this matrix W,

108
00:05:03,244 --> 00:05:07,354
telling us how much does that
pixel influence that class.

109
00:05:07,354 --> 00:05:10,416
So that means that each of
these rows in the matrix W

110
00:05:10,416 --> 00:05:13,212
ends up corresponding to
a template for the class.

111
00:05:13,212 --> 00:05:15,479
And if we take those rows and unravel,

112
00:05:15,479 --> 00:05:17,724
so each of those rows again corresponds

113
00:05:17,724 --> 00:05:20,540
to a weighting between the values of,

114
00:05:20,540 --> 00:05:23,351
between the pixel values of
the image and that class,

115
00:05:23,351 --> 00:05:26,246
so if we take that row and
unravel it back into an image,

116
00:05:26,246 --> 00:05:28,787
then we can visualize the
learned template for each

117
00:05:28,787 --> 00:05:30,700
of these classes.

118
00:05:30,700 --> 00:05:33,324
We also had this interpretation
of linear classification

119
00:05:33,324 --> 00:05:36,199
as learning linear decision
boundaries between pixels

120
00:05:36,199 --> 00:05:38,588
in some high dimensional
space where the dimensions

121
00:05:38,588 --> 00:05:41,611
of the space correspond
to the values of the pixel

122
00:05:41,611 --> 00:05:44,574
intensity values of the image.

123
00:05:44,574 --> 00:05:48,371
So this is kind of where
we left off last time.

124
00:05:48,371 --> 00:05:51,615
And so where we kind of
stopped, where we ended up last

125
00:05:51,615 --> 00:05:54,941
time is we got this idea
of a linear classifier,

126
00:05:54,941 --> 00:05:58,354
and we didn't talk about how
to actually choose the W.

127
00:05:58,354 --> 00:06:00,189
How to actually use the training data

128
00:06:00,189 --> 00:06:03,428
to determine which value
of W should be best.

129
00:06:03,428 --> 00:06:05,256
So kind of where we stopped off at

130
00:06:05,256 --> 00:06:09,092
is that for some setting
of W, we can use this W

131
00:06:09,092 --> 00:06:12,868
to come up with 10 with our
class scores for any image.

132
00:06:12,868 --> 00:06:16,397
So and some of these class
scores might be better or worse.

133
00:06:16,397 --> 00:06:17,964
So here in this simple example,

134
00:06:17,964 --> 00:06:21,633
we've shown maybe just a
training data set of three images

135
00:06:21,633 --> 00:06:25,384
along with the 10 class scores
predicted for some value of W

136
00:06:25,384 --> 00:06:26,846
for those images.

137
00:06:26,846 --> 00:06:28,647
And you can see that some
of these scores are better

138
00:06:28,647 --> 00:06:30,306
or worse than others.

139
00:06:30,306 --> 00:06:33,144
So for example in the image
on the left, if you look up,

140
00:06:33,144 --> 00:06:35,042
it's actually a cat because you're a human

141
00:06:35,042 --> 00:06:36,724
and you can tell these things,

142
00:06:36,724 --> 00:06:39,752
but if we look at the
assigned probabilities, cat,

143
00:06:39,752 --> 00:06:41,868
well not probabilities but scores,

144
00:06:41,868 --> 00:06:44,236
then the classifier maybe
for this setting of W

145
00:06:44,236 --> 00:06:48,882
gave the cat class a score
of 2.9 for this image,

146
00:06:48,882 --> 00:06:51,818
whereas the frog class gave 3.78.

147
00:06:51,818 --> 00:06:53,909
So maybe the classifier
is not doing not so good

148
00:06:53,909 --> 00:06:56,236
on this image, that's bad,
we wanted the true class

149
00:06:56,236 --> 00:06:58,720
to be actually the highest class score,

150
00:06:58,720 --> 00:07:00,909
whereas for some of these
other examples, like the car

151
00:07:00,909 --> 00:07:03,529
for example, you see
that the automobile class

152
00:07:03,529 --> 00:07:05,193
has a score of six which is much higher

153
00:07:05,193 --> 00:07:07,619
than any of the others, so that's good.

154
00:07:07,619 --> 00:07:11,433
And the frog, the predicted
scores are maybe negative four,

155
00:07:11,433 --> 00:07:13,637
which is much lower
than all the other ones,

156
00:07:13,637 --> 00:07:15,157
so that's actually bad.

157
00:07:15,157 --> 00:07:17,331
So this is kind of a hand wavy approach,

158
00:07:17,331 --> 00:07:19,140
just kind of looking at
the scores and eyeballing

159
00:07:19,140 --> 00:07:21,454
which ones are good
and which ones are bad.

160
00:07:21,454 --> 00:07:23,610
But to actually write
algorithms about these things

161
00:07:23,610 --> 00:07:26,064
and to actually to determine
automatically which W

162
00:07:26,064 --> 00:07:29,660
will be best, we need some
way to quantify the badness

163
00:07:29,660 --> 00:07:31,832
of any particular W.

164
00:07:31,832 --> 00:07:35,826
And that's this function
that takes in a W,

165
00:07:35,826 --> 00:07:39,283
looks at the scores and then
tells us how bad quantitatively

166
00:07:39,283 --> 00:07:42,787
is that W, is something that
we'll call a loss function.

167
00:07:42,787 --> 00:07:45,467
And in this lecture we'll
see a couple examples

168
00:07:45,467 --> 00:07:48,093
of different loss functions
that you can use for this image

169
00:07:48,093 --> 00:07:50,582
classification problem.

170
00:07:50,582 --> 00:07:53,483
So then once we've got this
idea of a loss function,

171
00:07:53,483 --> 00:07:57,532
this allows us to quantify
for any given value of W,

172
00:07:57,532 --> 00:07:59,298
how good or bad is it?

173
00:07:59,298 --> 00:08:00,834
But then we actually need to find

174
00:08:00,834 --> 00:08:02,730
and come up with an efficient procedure

175
00:08:02,730 --> 00:08:05,570
for searching through the
space of all possible Ws

176
00:08:05,570 --> 00:08:08,934
and actually come up with
what is the correct value

177
00:08:08,934 --> 00:08:11,488
of W that is the least bad,

178
00:08:11,488 --> 00:08:13,660
and this process will be
an optimization procedure

179
00:08:13,660 --> 00:08:17,076
and we'll talk more about
that in this lecture.

180
00:08:17,076 --> 00:08:19,091
So I'm going to shrink
this example a little bit

181
00:08:19,091 --> 00:08:21,803
because 10 classes is
a little bit unwieldy.

182
00:08:21,803 --> 00:08:24,731
So we'll kind of work with
this tiny toy data set

183
00:08:24,731 --> 00:08:27,551
of three examples and
three classes going forward

184
00:08:27,551 --> 00:08:29,686
in this lecture.

185
00:08:29,686 --> 00:08:33,639
So again, in this example, the
cat is maybe not so correctly

186
00:08:33,639 --> 00:08:38,407
classified, the car is correctly
classified, and the frog,

187
00:08:38,407 --> 00:08:41,320
this setting of W got this
frog image totally wrong,

188
00:08:41,320 --> 00:08:45,225
because the frog score is
much lower than others.

189
00:08:45,225 --> 00:08:47,764
So to formalize this a little
bit, usually when we talk

190
00:08:47,764 --> 00:08:49,617
about a loss function, we imagine

191
00:08:49,617 --> 00:08:53,670
that we have some training
data set of xs and ys,

192
00:08:53,670 --> 00:08:56,996
usually N examples of these
where the xs are the inputs

193
00:08:56,996 --> 00:09:00,004
to the algorithm in the
image classification case,

194
00:09:00,004 --> 00:09:03,862
the xs would be the actually
pixel values of your images,

195
00:09:03,862 --> 00:09:06,207
and the ys will be the things
you want your algorithm

196
00:09:06,207 --> 00:09:09,730
to predict, we usually call
these the labels or the targets.

197
00:09:09,730 --> 00:09:11,782
So in the case of image classification,

198
00:09:11,782 --> 00:09:14,540
remember we're trying
to categorize each image

199
00:09:14,540 --> 00:09:17,597
for CIFAR-10 to one of 10 categories,

200
00:09:17,597 --> 00:09:19,801
so the label y here will be an integer

201
00:09:19,801 --> 00:09:22,948
between one and 10 or
maybe between zero and nine

202
00:09:22,948 --> 00:09:25,214
depending on what programming
language you're using,

203
00:09:25,214 --> 00:09:27,045
but it'll be an integer telling you

204
00:09:27,045 --> 00:09:31,070
what is the correct category
for each one of those images x.

205
00:09:31,070 --> 00:09:35,284
And now our loss function
will denote L_i to denote the,

206
00:09:35,284 --> 00:09:37,693
so then we have this prediction function x

207
00:09:37,693 --> 00:09:41,769
which takes in our example
x and our weight matrix W

208
00:09:41,769 --> 00:09:43,638
and makes some prediction for y,

209
00:09:43,638 --> 00:09:45,235
in the case of image classification

210
00:09:45,235 --> 00:09:47,246
these will be our 10 numbers.

211
00:09:47,246 --> 00:09:50,738
Then we'll define some loss function L_i

212
00:09:50,738 --> 00:09:53,400
which will take in the predicted scores

213
00:09:53,400 --> 00:09:54,983
coming out of the function f

214
00:09:54,983 --> 00:09:57,604
together with the true target or label Y

215
00:09:57,604 --> 00:10:00,112
and give us some quantitative
value for how bad

216
00:10:00,112 --> 00:10:03,310
those predictions are for
that training example.

217
00:10:03,310 --> 00:10:06,927
And now the final loss L will
be the average of these losses

218
00:10:06,927 --> 00:10:09,776
summed over the entire data
set over each of the N examples

219
00:10:09,776 --> 00:10:11,278
in our data set.

220
00:10:11,278 --> 00:10:14,232
So this is actually a
very general formulation,

221
00:10:14,232 --> 00:10:17,221
and actually extends even
beyond image classification.

222
00:10:17,221 --> 00:10:19,818
Kind of as we move forward
and see other tasks,

223
00:10:19,818 --> 00:10:22,123
other examples of tasks and deep learning,

224
00:10:22,123 --> 00:10:25,226
the kind of generic setup
is that for any task

225
00:10:25,226 --> 00:10:27,639
you have some xs and ys
and you want to write down

226
00:10:27,639 --> 00:10:30,849
some loss function that
quantifies exactly how happy

227
00:10:30,849 --> 00:10:34,335
you are with your particular
parameter settings W

228
00:10:34,335 --> 00:10:36,615
and then you'll eventually
search over the space of W

229
00:10:36,615 --> 00:10:40,782
to find the W that minimizes
the loss on your training data.

230
00:10:41,881 --> 00:10:46,330
So as a first example of
a concrete loss function

231
00:10:46,330 --> 00:10:50,122
that is a nice thing to work
with in image classification,

232
00:10:50,122 --> 00:10:53,555
we'll talk about the multi-class SVM loss.

233
00:10:53,555 --> 00:10:57,069
You may have seen the binary
SVM, our support vector

234
00:10:57,069 --> 00:11:00,402
machine in CS 229 and the multiclass SVM

235
00:11:01,462 --> 00:11:06,063
is a generalization of that
to handle multiple classes.

236
00:11:06,063 --> 00:11:10,247
In the binary SVM case as
you may have seen in 229,

237
00:11:10,247 --> 00:11:12,597
you only had two classes, each example x

238
00:11:12,597 --> 00:11:14,550
was going to be classified
as either positive

239
00:11:14,550 --> 00:11:15,959
or negative example,

240
00:11:15,959 --> 00:11:18,415
but now we have 10 categories,
so we need to generalize

241
00:11:18,415 --> 00:11:21,562
this notion to handle multiple classes.

242
00:11:21,562 --> 00:11:25,654
So this loss function has kind
of a funny functional form,

243
00:11:25,654 --> 00:11:27,006
so we'll walk through it in a bit more,

244
00:11:27,006 --> 00:11:30,601
in quite a bit of detail over
the next couple of slides.

245
00:11:30,601 --> 00:11:33,894
But what this is saying
is that the loss L_i

246
00:11:33,894 --> 00:11:36,384
for any individual example,
the way we'll compute it

247
00:11:36,384 --> 00:11:41,098
is we're going to perform a sum
over all of the categories, Y,

248
00:11:41,098 --> 00:11:44,173
except for the true category, Y_i,

249
00:11:44,173 --> 00:11:47,004
so we're going to sum over
all the incorrect categories,

250
00:11:47,004 --> 00:11:49,242
and then we're going to compare the score

251
00:11:49,242 --> 00:11:51,046
of the correct category, and the score

252
00:11:51,046 --> 00:11:53,309
of the incorrect category,

253
00:11:53,309 --> 00:11:55,879
and now if the score
for the correct category

254
00:11:55,879 --> 00:12:00,389
is greater than the score
of the incorrect category,

255
00:12:00,389 --> 00:12:04,614
greater than the incorrect
score by some safety margin

256
00:12:04,614 --> 00:12:07,949
that we set to one, if that's
the case that means that

257
00:12:07,949 --> 00:12:12,140
the true score is much, or the
score for the true category

258
00:12:12,140 --> 00:12:15,461
is if it's much larger than
any of the false categories,

259
00:12:15,461 --> 00:12:18,463
then we'll get a loss of zero.

260
00:12:18,463 --> 00:12:22,503
And we'll sum this up over all
of the incorrect categories

261
00:12:22,503 --> 00:12:25,254
for our image and this
will give us our final loss

262
00:12:25,254 --> 00:12:27,955
for this one example in the data set.

263
00:12:27,955 --> 00:12:30,511
And again we'll take
the average of this loss

264
00:12:30,511 --> 00:12:32,794
over the whole training data set.

265
00:12:32,794 --> 00:12:37,737
So this kind of like if then
statement, like if the true

266
00:12:37,737 --> 00:12:42,063
class score is much
larger than the others,

267
00:12:42,063 --> 00:12:45,960
this kind of if then
formulation we often compactify

268
00:12:45,960 --> 00:12:50,127
into this single max of zero
S_j minus S_Yi plus one thing,

269
00:12:51,296 --> 00:12:53,849
but I always find that notation
a little bit confusing,

270
00:12:53,849 --> 00:12:55,094
and it always helps me

271
00:12:55,094 --> 00:12:57,478
to write it out in this
sort of case based notation

272
00:12:57,478 --> 00:12:59,436
to figure out exactly
what the two cases are

273
00:12:59,436 --> 00:13:01,720
and what's going on.

274
00:13:01,720 --> 00:13:04,589
And by the way, this
style of loss function

275
00:13:04,589 --> 00:13:07,605
where we take max of zero
and some other quantity

276
00:13:07,605 --> 00:13:10,927
is often referred to as
some type of a hinge loss,

277
00:13:10,927 --> 00:13:14,002
and this name comes from
the shape of the graph

278
00:13:14,002 --> 00:13:15,427
when you go and plot it,

279
00:13:15,427 --> 00:13:18,927
so here the x axis corresponds to the S_Yi,

280
00:13:19,903 --> 00:13:23,075
that is the score of the
true class for some training

281
00:13:23,075 --> 00:13:26,153
example, and now the y axis is the loss,

282
00:13:26,153 --> 00:13:30,320
and you can see that as the
score for the true category

283
00:13:31,323 --> 00:13:33,138
for this example increases, then the loss

284
00:13:33,138 --> 00:13:34,974
will go down linearly

285
00:13:34,974 --> 00:13:38,605
until we get to above this safety margin,

286
00:13:38,605 --> 00:13:41,035
after which the loss will be zero

287
00:13:41,035 --> 00:13:45,202
because we've already correctly
classified this example.

288
00:13:46,350 --> 00:13:48,636
So let's, oh, question?

289
00:13:48,636 --> 00:13:51,264
- [Student] Sorry, in terms of notation

290
00:13:51,264 --> 00:13:53,210
what is S underscore Yi?

291
00:13:53,210 --> 00:13:55,970
Is that your right score?

292
00:13:55,970 --> 00:13:57,121
- Yeah, so the question is

293
00:13:57,121 --> 00:14:00,897
in terms of notation,
what is S and what is SYI

294
00:14:00,897 --> 00:14:04,706
in particular, so the Ss
are the predicted scores

295
00:14:04,706 --> 00:14:08,046
for the classes that are
coming out of the classifier.

296
00:14:08,046 --> 00:14:11,734
So if one is the cat class and
two is the dog class then S1

297
00:14:11,734 --> 00:14:14,742
and S2 would be the cat and
dog scores respectively.

298
00:14:14,742 --> 00:14:18,234
And remember we said that Yi
was the category of the ground

299
00:14:18,234 --> 00:14:21,856
truth label for the example
which is some integer.

300
00:14:21,856 --> 00:14:26,023
So then S sub Y sub i, sorry
for the double subscript,

301
00:14:26,857 --> 00:14:29,766
that corresponds to the
score of the true class

302
00:14:29,766 --> 00:14:33,483
for the i-th example in the training set.

303
00:14:33,483 --> 00:14:34,670
Question?

304
00:14:34,670 --> 00:14:36,721
- [Student] So what
exactly is this computing?

305
00:14:36,721 --> 00:14:39,477
- Yeah the question is what
exactly is this computing here?

306
00:14:39,477 --> 00:14:42,425
It's a little bit funny, I
think it will become more clear

307
00:14:42,425 --> 00:14:45,390
when we walk through an explicit
example, but in some sense

308
00:14:45,390 --> 00:14:49,475
what this loss is saying is
that we are happy if the true

309
00:14:49,475 --> 00:14:52,784
score is much higher than
all the other scores.

310
00:14:52,784 --> 00:14:55,393
It needs to be higher
than all the other scores

311
00:14:55,393 --> 00:14:59,164
by some safety margin,
and if the true score

312
00:14:59,164 --> 00:15:02,692
is not high enough, greater
than any of the other scores,

313
00:15:02,692 --> 00:15:07,334
then we will incur some
loss and that would be bad.

314
00:15:07,334 --> 00:15:09,457
So this might make a little bit more sense

315
00:15:09,457 --> 00:15:11,400
if we walk through an explicit example

316
00:15:11,400 --> 00:15:14,023
for this tiny three example data set.

317
00:15:14,023 --> 00:15:16,896
So here remember I've sort
of removed the case space

318
00:15:16,896 --> 00:15:20,313
notation and just switching
back to the zero one notation,

319
00:15:20,313 --> 00:15:21,910
and now if we look at,

320
00:15:21,910 --> 00:15:25,344
if we think about computing
this multi-class SVM loss

321
00:15:25,344 --> 00:15:26,986
for just this first training example

322
00:15:26,986 --> 00:15:29,558
on the left, then remember
we're going to loop over

323
00:15:29,558 --> 00:15:32,713
all of the incorrect
classes, so for this example,

324
00:15:32,713 --> 00:15:35,722
cat is the correct class, so
we're going to loop over the car

325
00:15:35,722 --> 00:15:39,889
and frog classes, and now for
car, we're going to compare the,

326
00:15:41,888 --> 00:15:45,415
we're going to look at the car
score, 5.1, minus the cat score,

327
00:15:45,415 --> 00:15:49,582
3.2 plus one, when we're
comparing cat and car we expect

328
00:15:50,769 --> 00:15:53,927
to incur some loss here because
the car score is greater

329
00:15:53,927 --> 00:15:55,934
than the cat score which is bad.

330
00:15:55,934 --> 00:15:59,953
So for this one class,
for this one example,

331
00:15:59,953 --> 00:16:02,133
we'll incur a loss of 2.9,

332
00:16:02,133 --> 00:16:04,187
and then when we go and
compare the cat score

333
00:16:04,187 --> 00:16:07,312
and the frog score we see that cat is 3.2,

334
00:16:07,312 --> 00:16:09,280
frog is minus 1.7,

335
00:16:09,280 --> 00:16:12,361
so cat is more than one greater than frog,

336
00:16:12,361 --> 00:16:15,528
which means that between these two classes

337
00:16:15,528 --> 00:16:17,209
we incur zero loss.

338
00:16:17,209 --> 00:16:21,548
So then the multiclass SVM
loss for this training example

339
00:16:21,548 --> 00:16:23,894
will be the sum of the losses
across each of these pairs

340
00:16:23,894 --> 00:16:27,912
of classes, which will be
2.9 plus zero which is 2.9.

341
00:16:27,912 --> 00:16:31,280
Which is sort of saying that
2.9 is a quantitative measure

342
00:16:31,280 --> 00:16:32,950
of how much our classifier screwed up

343
00:16:32,950 --> 00:16:35,367
on this one training example.

344
00:16:36,595 --> 00:16:37,897
And then if we repeat this procedure

345
00:16:37,897 --> 00:16:42,374
for this next car image, then
again the true class is car,

346
00:16:42,374 --> 00:16:44,851
so we're going to iterate
over all the other categories

347
00:16:44,851 --> 00:16:48,005
when we compare the car and the cat score,

348
00:16:48,005 --> 00:16:50,894
we see that car is more
than one greater than cat

349
00:16:50,894 --> 00:16:52,872
so we get no loss here.

350
00:16:52,872 --> 00:16:55,019
When we compare car and frog, we again see

351
00:16:55,019 --> 00:16:58,357
that the car score is more
than one greater than frog,

352
00:16:58,357 --> 00:17:00,784
so we get again no loss
here, and our total loss

353
00:17:00,784 --> 00:17:03,793
for this training example is zero.

354
00:17:03,793 --> 00:17:06,565
And now I think you hopefully
get the picture by now, but,

355
00:17:06,565 --> 00:17:09,851
if you go look at frog, now
frog, we again compare frog

356
00:17:09,851 --> 00:17:12,566
and cat, incur quite a lot of
loss because the frog score

357
00:17:12,566 --> 00:17:15,695
is very low, compare frog
and car, incur a lot of loss

358
00:17:15,695 --> 00:17:19,163
because the score is very low,
and then our loss for this

359
00:17:19,164 --> 00:17:20,497
example is 12.9.

360
00:17:21,950 --> 00:17:24,657
And then our final loss
for the entire data set

361
00:17:24,657 --> 00:17:25,963
is the average of these losses

362
00:17:25,964 --> 00:17:27,378
across the different examples,

363
00:17:27,378 --> 00:17:30,077
so when you sum those out
it comes to about 5.3.

364
00:17:30,077 --> 00:17:32,330
So then it's sort of, this
is our quantitative measure

365
00:17:32,330 --> 00:17:35,729
that our classifier is
5.3 bad on this data set.

366
00:17:35,729 --> 00:17:37,532
Is there a question?

367
00:17:37,532 --> 00:17:39,952
- [Student] How do you
choose the plus one?

368
00:17:39,952 --> 00:17:42,720
- Yeah, the question is how
do you choose the plus one?

369
00:17:42,720 --> 00:17:44,374
That's actually a really great question,

370
00:17:44,374 --> 00:17:47,462
it seems like kind of an
arbitrary choice here,

371
00:17:47,462 --> 00:17:49,578
it's the only constant that
appears in the loss function

372
00:17:49,578 --> 00:17:51,750
and that seems to offend
your aesthetic sensibilities

373
00:17:51,750 --> 00:17:53,332
a bit maybe.

374
00:17:53,332 --> 00:17:54,650
But it turns out that this is somewhat

375
00:17:54,650 --> 00:17:58,669
of an arbitrary choice,
because we don't actually care

376
00:17:58,669 --> 00:18:01,105
about the absolute values of the scores

377
00:18:01,105 --> 00:18:03,595
in this loss function, we only care

378
00:18:03,595 --> 00:18:06,189
about the relative differences
between the scores.

379
00:18:06,189 --> 00:18:07,518
We only care that the correct score

380
00:18:07,518 --> 00:18:09,744
is much greater than the incorrect scores.

381
00:18:09,744 --> 00:18:12,340
So in fact if you imagine
scaling up your whole W

382
00:18:12,340 --> 00:18:15,542
up or down, then it kind
of rescales all the scores

383
00:18:15,542 --> 00:18:18,339
correspondingly and if you kind
of work through the details

384
00:18:18,339 --> 00:18:21,608
and there's a detailed derivation
of this in the course notes

385
00:18:21,608 --> 00:18:25,162
online, you find this choice
of one actually doesn't matter.

386
00:18:25,162 --> 00:18:28,457
That this free parameter
of one kind of washes out

387
00:18:28,457 --> 00:18:30,112
and is canceled with this scale,

388
00:18:30,112 --> 00:18:33,425
like the overall setting
of the scale in W.

389
00:18:33,425 --> 00:18:35,456
And again, check the course
notes for a bit more detail

390
00:18:35,456 --> 00:18:36,289
on that.

391
00:18:38,753 --> 00:18:41,174
So then I think it's
kind of useful to think

392
00:18:41,174 --> 00:18:43,319
about a couple different
questions to try to understand

393
00:18:43,319 --> 00:18:46,174
intuitively what this loss is doing.

394
00:18:46,174 --> 00:18:49,515
So the first question is what's
going to happen to the loss

395
00:18:49,515 --> 00:18:54,149
if we change the scores of the
car image just a little bit?

396
00:18:54,149 --> 00:18:54,982
Any ideas?

397
00:18:57,279 --> 00:19:00,656
Everyone's too scared to ask a question?

398
00:19:00,656 --> 00:19:01,489
Answer?

399
00:19:01,489 --> 00:19:05,072
[student speaking faintly]

400
00:19:07,783 --> 00:19:10,813
- Yeah, so the answer is
that if we jiggle the scores

401
00:19:10,813 --> 00:19:14,133
for this car image a little
bit, the loss will not change.

402
00:19:14,133 --> 00:19:16,426
So the SVM loss, remember,
the only thing it cares

403
00:19:16,426 --> 00:19:19,873
about is getting the correct
score to be greater than one

404
00:19:19,873 --> 00:19:22,718
more than the incorrect
scores, but in this case,

405
00:19:22,718 --> 00:19:26,306
the car score is already quite
a bit larger than the others,

406
00:19:26,306 --> 00:19:28,848
so if the scores for this
class changed for this example

407
00:19:28,848 --> 00:19:31,465
changed just a little
bit, this margin of one

408
00:19:31,465 --> 00:19:34,070
will still be retained and
the loss will not change,

409
00:19:34,070 --> 00:19:36,237
we'll still get zero loss.

410
00:19:37,670 --> 00:19:40,117
The next question, what's
the min and max possible loss

411
00:19:40,117 --> 00:19:40,950
for SVM?

412
00:19:44,065 --> 00:19:45,245
[student speaking faintly]

413
00:19:45,245 --> 00:19:46,564
Oh I hear some murmurs.

414
00:19:46,564 --> 00:19:50,174
So the minimum loss is zero,
because if you can imagine that

415
00:19:50,174 --> 00:19:52,818
across all the classes, if
our correct score was much

416
00:19:52,818 --> 00:19:56,333
larger then we'll incur zero
loss across all the classes

417
00:19:56,333 --> 00:19:58,157
and it will be zero,

418
00:19:58,157 --> 00:20:01,452
and if you think back to this
hinge loss plot that we had,

419
00:20:01,452 --> 00:20:04,295
then you can see that if the correct score

420
00:20:04,295 --> 00:20:06,992
goes very, very negative,
then we could incur

421
00:20:06,992 --> 00:20:08,781
potentially infinite loss.

422
00:20:08,781 --> 00:20:12,338
So the min is zero and
the max is infinity.

423
00:20:12,338 --> 00:20:15,227
Another question, sort of when
you initialize these things

424
00:20:15,227 --> 00:20:16,920
and start training from scratch,

425
00:20:16,920 --> 00:20:18,815
usually you kind of initialize W

426
00:20:18,815 --> 00:20:22,042
with some small random values,
as a result your scores

427
00:20:22,042 --> 00:20:24,742
tend to be sort of small
uniform random values

428
00:20:24,742 --> 00:20:26,335
at the beginning of training.

429
00:20:26,335 --> 00:20:28,726
And then the question is
that if all of your Ss,

430
00:20:28,726 --> 00:20:30,864
if all of the scores
are approximately zero

431
00:20:30,864 --> 00:20:32,249
and approximately equal,

432
00:20:32,249 --> 00:20:33,615
then what kind of loss do you expect

433
00:20:33,615 --> 00:20:36,541
when you're using multiclass SVM?

434
00:20:36,541 --> 00:20:38,302
- [Student] Number of classes minus one.

435
00:20:38,302 --> 00:20:43,248
- Yeah, so the answer is
number of classes minus one,

436
00:20:43,248 --> 00:20:46,671
because remember that
if we're looping over

437
00:20:46,671 --> 00:20:49,025
all of the incorrect classes,
so we're looping over

438
00:20:49,025 --> 00:20:52,289
C minus one classes, within
each of those classes

439
00:20:52,289 --> 00:20:54,538
the two Ss will be about the same,

440
00:20:54,538 --> 00:20:55,896
so we'll get a loss of one

441
00:20:55,896 --> 00:20:58,336
because of the margin and
we'll get C minus one.

442
00:20:58,336 --> 00:21:01,159
So this is actually kind
of useful because when you,

443
00:21:01,159 --> 00:21:02,764
this is a useful debugging strategy

444
00:21:02,764 --> 00:21:04,043
when you're using these things,

445
00:21:04,043 --> 00:21:05,534
that when you start off training,

446
00:21:05,534 --> 00:21:08,882
you should think about what
you expect your loss to be,

447
00:21:08,882 --> 00:21:11,731
and if the loss you actually
see at the start of training

448
00:21:11,731 --> 00:21:14,584
at that first iteration is
not equal to C minus one

449
00:21:14,584 --> 00:21:15,799
in this case,

450
00:21:15,799 --> 00:21:17,153
that means you probably have
a bug and you should go check

451
00:21:17,153 --> 00:21:19,455
your code, so this is actually
kind of a useful thing

452
00:21:19,455 --> 00:21:21,705
to be checking in practice.

453
00:21:22,686 --> 00:21:26,052
Another question, what happens
if, so I said we're summing

454
00:21:26,052 --> 00:21:30,810
an SVM over the incorrect
classes, what happens if the sum

455
00:21:30,810 --> 00:21:32,287
is also over the correct class

456
00:21:32,287 --> 00:21:35,123
if we just go over everything?

457
00:21:35,123 --> 00:21:36,834
- [Student] The loss increases by one.

458
00:21:36,834 --> 00:21:40,126
- Yeah, so the answer is that
the loss increases by one.

459
00:21:40,126 --> 00:21:42,729
And I think the reason
that we do this in practice

460
00:21:42,729 --> 00:21:45,852
is because normally loss of
zero is kind of, has this nice

461
00:21:45,852 --> 00:21:48,156
interpretation that
you're not losing at all,

462
00:21:48,156 --> 00:21:52,163
so that's nice, so I think your answers

463
00:21:52,163 --> 00:21:53,555
wouldn't really change,

464
00:21:53,555 --> 00:21:55,312
you would end up finding
the same classifier

465
00:21:55,312 --> 00:21:57,284
if you actually looped
over all the categories,

466
00:21:57,284 --> 00:22:00,779
but if just by conventions
we omit the correct class

467
00:22:00,779 --> 00:22:03,529
so that our minimum loss is zero.

468
00:22:05,731 --> 00:22:07,141
So another question, what if we used mean

469
00:22:07,141 --> 00:22:08,808
instead of sum here?

470
00:22:10,743 --> 00:22:12,033
- [Student] Doesn't change.

471
00:22:12,033 --> 00:22:13,876
- Yeah, the answer is
that it doesn't change.

472
00:22:13,876 --> 00:22:16,588
So the number of classes is
going to be fixed ahead of time

473
00:22:16,588 --> 00:22:18,875
when we select our data set,
so that's just rescaling

474
00:22:18,875 --> 00:22:21,300
the whole loss function by a constant,

475
00:22:21,300 --> 00:22:23,134
so it doesn't really matter,
it'll sort of wash out

476
00:22:23,134 --> 00:22:24,791
with all the other scale things

477
00:22:24,791 --> 00:22:26,554
because we don't actually
care about the true values

478
00:22:26,554 --> 00:22:29,131
of the scores, or the
true value of the loss

479
00:22:29,131 --> 00:22:30,464
for that matter.

480
00:22:31,341 --> 00:22:33,510
So now here's another
example, what if we change

481
00:22:33,510 --> 00:22:36,735
this loss formulation and we
actually added a square term

482
00:22:36,735 --> 00:22:38,734
on top of this max?

483
00:22:38,734 --> 00:22:40,866
Would this end up being the same problem

484
00:22:40,866 --> 00:22:43,768
or would this be a different
classification algorithm?

485
00:22:43,768 --> 00:22:44,978
- [Student] Different.

486
00:22:44,978 --> 00:22:46,013
- Yes, this would be different.

487
00:22:46,013 --> 00:22:48,096
So here the idea is that
we're kind of changing

488
00:22:48,096 --> 00:22:49,945
the trade-offs between good and badness

489
00:22:49,945 --> 00:22:51,703
in kind of a nonlinear way,

490
00:22:51,703 --> 00:22:53,233
so this would end up actually computing

491
00:22:53,233 --> 00:22:55,064
a different loss function.

492
00:22:55,064 --> 00:22:58,096
This idea of a squared hinge
loss actually does get used

493
00:22:58,096 --> 00:23:00,715
sometimes in practice, so
that's kind of another trick

494
00:23:00,715 --> 00:23:02,397
to have in your bag when you're making up

495
00:23:02,397 --> 00:23:05,866
your own loss functions
for your own problems.

496
00:23:05,866 --> 00:23:08,631
So now you'll end up,
oh, was there a question?

497
00:23:08,631 --> 00:23:09,926
- [Student] Why would
you use a squared loss

498
00:23:09,926 --> 00:23:12,396
instead of a non-squared loss?

499
00:23:12,396 --> 00:23:14,818
- Yeah, so the question is
why would you ever consider

500
00:23:14,818 --> 00:23:17,560
using a squared loss instead
of a non-squared loss?

501
00:23:17,560 --> 00:23:19,319
And the whole point of a loss function

502
00:23:19,319 --> 00:23:23,081
is to kind of quantify how
bad are different mistakes.

503
00:23:23,081 --> 00:23:25,943
And if the classifier is making
different sorts of mistakes,

504
00:23:25,943 --> 00:23:27,795
how do we weigh off the
different trade-offs

505
00:23:27,795 --> 00:23:29,740
between different types
of mistakes the classifier

506
00:23:29,740 --> 00:23:30,941
might make?

507
00:23:30,941 --> 00:23:33,065
So if you're using a squared loss,

508
00:23:33,065 --> 00:23:37,171
that sort of says that things
that are very, very bad

509
00:23:37,171 --> 00:23:38,634
are now going to be squared bad

510
00:23:38,634 --> 00:23:39,992
so that's like really, really bad,

511
00:23:39,992 --> 00:23:41,511
like we don't want anything

512
00:23:41,511 --> 00:23:44,423
that's totally
catastrophically misclassified,

513
00:23:44,423 --> 00:23:48,121
whereas if you're using this hinge loss,

514
00:23:48,121 --> 00:23:50,356
we don't actually care between
being a little bit wrong

515
00:23:50,356 --> 00:23:53,353
and being a lot wrong, being
a lot wrong kind of like,

516
00:23:53,353 --> 00:23:56,140
if an example is a lot
wrong, and we increase it

517
00:23:56,140 --> 00:23:57,851
and make it a little bit less wrong,

518
00:23:57,851 --> 00:24:00,408
that's kind of the same
goodness as an example

519
00:24:00,408 --> 00:24:03,403
which was only a little bit
wrong and then increasing it

520
00:24:03,403 --> 00:24:05,175
to be a little bit more right.

521
00:24:05,175 --> 00:24:07,068
So that's a little bit hand wavy,

522
00:24:07,068 --> 00:24:09,585
but this idea of using
a linear versus a square

523
00:24:09,585 --> 00:24:11,998
is a way to quantify how much we care

524
00:24:11,998 --> 00:24:14,397
about different categories of errors.

525
00:24:14,397 --> 00:24:16,175
And this is definitely something
that you should think about

526
00:24:16,175 --> 00:24:18,738
when you're actually applying
these things in practice,

527
00:24:18,738 --> 00:24:21,045
because the loss function is the way

528
00:24:21,045 --> 00:24:23,570
that you tell your algorithm
what types of errors

529
00:24:23,570 --> 00:24:24,904
you care about and what types of errors

530
00:24:24,904 --> 00:24:27,067
it should trade off against.

531
00:24:27,067 --> 00:24:28,707
So that's actually super
important in practice

532
00:24:28,707 --> 00:24:31,207
depending on your application.

533
00:24:33,115 --> 00:24:35,817
So here's just a little snippet
of sort of vectorized code

534
00:24:35,817 --> 00:24:38,366
in numpy, and you'll end up implementing

535
00:24:38,366 --> 00:24:41,762
something like this for
the first assignment,

536
00:24:41,762 --> 00:24:44,393
but this kind of gives you
the sense that this sum

537
00:24:44,393 --> 00:24:45,752
is actually like pretty easy

538
00:24:45,752 --> 00:24:47,598
to implement in numpy, it
only takes a couple lines

539
00:24:47,598 --> 00:24:49,568
of vectorized code.

540
00:24:49,568 --> 00:24:51,726
And you can see in practice,
like one nice trick

541
00:24:51,726 --> 00:24:56,704
is that we can actually go in
here and zero out the margins

542
00:24:56,704 --> 00:24:58,656
corresponding to the correct class,

543
00:24:58,656 --> 00:25:01,550
and that makes it easy to then just,

544
00:25:01,550 --> 00:25:05,145
that's sort of one nice
vectorized trick to skip,

545
00:25:05,145 --> 00:25:06,869
iterate over all but one class.

546
00:25:06,869 --> 00:25:08,691
You just kind of zero out
the one you want to skip

547
00:25:08,691 --> 00:25:10,654
and then compute the sum
anyway, so that's a nice trick

548
00:25:10,654 --> 00:25:14,115
you might consider
using on the assignment.

549
00:25:14,115 --> 00:25:17,906
So now, another question
about this loss function.

550
00:25:17,906 --> 00:25:20,239
Suppose that you were
lucky enough to find a W

551
00:25:20,239 --> 00:25:22,429
that has loss of zero,
you're not losing at all,

552
00:25:22,429 --> 00:25:23,545
you're totally winning,

553
00:25:23,545 --> 00:25:24,939
this loss function is crushing it,

554
00:25:24,939 --> 00:25:28,619
but then there's a
question, is this W unique

555
00:25:28,619 --> 00:25:29,857
or were there other Ws

556
00:25:29,857 --> 00:25:33,190
that could also have achieved zero loss?

557
00:25:34,051 --> 00:25:35,071
- [Student] There are other Ws.

558
00:25:35,071 --> 00:25:38,244
- Answer, yeah, so there
are definitely other Ws.

559
00:25:38,244 --> 00:25:40,798
And in particular, because
we talked a little bit

560
00:25:40,798 --> 00:25:44,776
about this thing of scaling
the whole problem up or down

561
00:25:44,776 --> 00:25:47,509
depending on W, so you
could actually take W

562
00:25:47,509 --> 00:25:51,676
multiplied by two and this
doubled W (Is it quad U now?

563
00:25:52,784 --> 00:25:53,778
I don't know.)

564
00:25:53,778 --> 00:25:54,910
[laughing]

565
00:25:54,910 --> 00:25:57,401
This would also achieve zero loss.

566
00:25:57,401 --> 00:25:59,368
So as a concrete example of this,

567
00:25:59,368 --> 00:26:00,866
you can go back to your favorite example

568
00:26:00,866 --> 00:26:01,984
and maybe work through the numbers

569
00:26:01,984 --> 00:26:03,151
a little bit later,

570
00:26:03,151 --> 00:26:06,109
but if you're taking W and we double W,

571
00:26:06,109 --> 00:26:09,929
then the margins between the
correct and incorrect scores

572
00:26:09,929 --> 00:26:11,383
will also double.

573
00:26:11,383 --> 00:26:12,990
So that means that if all these margins

574
00:26:12,990 --> 00:26:15,321
were already greater than
one, and we doubled them,

575
00:26:15,321 --> 00:26:17,074
they're still going to
be greater than one,

576
00:26:17,074 --> 00:26:19,657
so you'll still have zero loss.

577
00:26:20,980 --> 00:26:22,982
And this is kind of interesting,

578
00:26:22,982 --> 00:26:25,372
because if our loss function

579
00:26:25,372 --> 00:26:28,228
is the way that we tell our
classifier which W we want

580
00:26:28,228 --> 00:26:29,845
and which W we care about,

581
00:26:29,845 --> 00:26:31,133
this is a little bit weird,

582
00:26:31,133 --> 00:26:32,595
now there's this inconsistency

583
00:26:32,595 --> 00:26:35,997
and how is the classifier to choose

584
00:26:35,997 --> 00:26:37,662
between these different versions of W

585
00:26:37,662 --> 00:26:39,912
that all achieve zero loss?

586
00:26:40,899 --> 00:26:43,010
And that's because what we've done here

587
00:26:43,010 --> 00:26:46,103
is written down only a
loss in terms of the data,

588
00:26:46,103 --> 00:26:48,936
and we've only told our classifier

589
00:26:49,782 --> 00:26:51,248
that it should try to find the W

590
00:26:51,248 --> 00:26:53,039
that fits the training data.

591
00:26:53,039 --> 00:26:55,033
But really in practice,
we don't actually care

592
00:26:55,033 --> 00:26:57,265
that much about fitting the training data,

593
00:26:57,265 --> 00:26:58,560
the whole point of machine learning

594
00:26:58,560 --> 00:27:02,353
is that we use the training
data to find some classifier

595
00:27:02,353 --> 00:27:05,022
and then we'll apply
that thing on test data.

596
00:27:05,022 --> 00:27:07,658
So we don't really care about
the training data performance,

597
00:27:07,658 --> 00:27:09,303
we really care about the performance

598
00:27:09,303 --> 00:27:11,728
of this classifier on test data.

599
00:27:11,728 --> 00:27:13,585
So as a result, if the only thing

600
00:27:13,585 --> 00:27:15,471
we're telling our classifier to do

601
00:27:15,471 --> 00:27:16,976
is fit the training data,

602
00:27:16,976 --> 00:27:18,853
then we can lead ourselves

603
00:27:18,853 --> 00:27:21,107
into some of these weird
situations sometimes,

604
00:27:21,107 --> 00:27:24,879
where the classifier might
have unintuitive behavior.

605
00:27:24,879 --> 00:27:28,586
So a concrete, canonical example
of this sort of thing,

606
00:27:28,586 --> 00:27:30,785
by the way, this is not
linear classification anymore,

607
00:27:30,785 --> 00:27:32,226
this is a little bit of a more general

608
00:27:32,226 --> 00:27:33,718
machine learning concept,

609
00:27:33,718 --> 00:27:36,615
is that suppose we have this
data set of blue points,

610
00:27:36,615 --> 00:27:39,468
and we're going to fit some
curve to the training data,

611
00:27:39,468 --> 00:27:40,627
the blue points,

612
00:27:40,627 --> 00:27:43,583
then if the only thing we've
told our classifier to do

613
00:27:43,583 --> 00:27:45,332
is to try and fit the training data,

614
00:27:45,332 --> 00:27:47,251
it might go in and have very wiggly curves

615
00:27:47,251 --> 00:27:48,612
to try to perfectly classify

616
00:27:48,612 --> 00:27:50,447
all of the training data points.

617
00:27:50,447 --> 00:27:53,035
But this is bad, because
we don't actually care

618
00:27:53,035 --> 00:27:54,319
about this performance,

619
00:27:54,319 --> 00:27:57,129
we care about the
performance on the test data.

620
00:27:57,129 --> 00:27:59,179
So now if we have some new data come in

621
00:27:59,179 --> 00:28:01,510
that sort of follows the same trend,

622
00:28:01,510 --> 00:28:03,067
then this very wiggly blue line

623
00:28:03,067 --> 00:28:04,672
is going to be totally wrong.

624
00:28:04,672 --> 00:28:06,401
And in fact, what we
probably would have preferred

625
00:28:06,401 --> 00:28:08,606
the classifier to do was maybe predict

626
00:28:08,606 --> 00:28:10,134
this straight green line,

627
00:28:10,134 --> 00:28:12,483
rather than this very complex wiggly line

628
00:28:12,483 --> 00:28:15,971
to perfectly fit all the training data.

629
00:28:15,971 --> 00:28:18,621
And this is a core fundamental problem

630
00:28:18,621 --> 00:28:20,032
in machine learning,

631
00:28:20,032 --> 00:28:21,462
and the way we usually solve it,

632
00:28:21,462 --> 00:28:23,616
is this concept of regularization.

633
00:28:23,616 --> 00:28:25,959
So here we're going to
add an additional term

634
00:28:25,959 --> 00:28:27,243
to the loss function.

635
00:28:27,243 --> 00:28:28,632
In addition to the data loss,

636
00:28:28,632 --> 00:28:31,055
which will tell our
classifier that it should fit

637
00:28:31,055 --> 00:28:33,248
the training data,
we'll also typically add

638
00:28:33,248 --> 00:28:35,109
another term to the loss function

639
00:28:35,109 --> 00:28:36,857
called a regularization term,

640
00:28:36,857 --> 00:28:41,491
which encourages the model
to somehow pick a simpler W,

641
00:28:41,491 --> 00:28:43,292
where the concept of simple

642
00:28:43,292 --> 00:28:46,792
kind of depends on the task and the model.

643
00:28:48,725 --> 00:28:50,439
There's this whole idea of Occam's Razor,

644
00:28:50,439 --> 00:28:53,668
which is this fundamental
idea in scientific discovery

645
00:28:53,668 --> 00:28:56,374
more broadly, which is that
if you have many different

646
00:28:56,374 --> 00:28:58,513
competing hypotheses, that could explain

647
00:28:58,513 --> 00:28:59,958
your observations,

648
00:28:59,958 --> 00:29:01,839
you should generally
prefer the simpler one,

649
00:29:01,839 --> 00:29:04,199
because that's the explanation
that is more likely

650
00:29:04,199 --> 00:29:07,601
to generalize to new
observations in the future.

651
00:29:07,601 --> 00:29:09,514
And the way we
operationalize this intuition

652
00:29:09,514 --> 00:29:11,777
in machine learning is
typically through some explicit

653
00:29:11,777 --> 00:29:13,338
regularization penalty

654
00:29:13,338 --> 00:29:15,921
that's often written down as R.

655
00:29:17,112 --> 00:29:19,487
So then your standard loss function

656
00:29:19,487 --> 00:29:21,209
usually has these two terms,

657
00:29:21,209 --> 00:29:23,217
a data loss and a regularization loss,

658
00:29:23,217 --> 00:29:25,930
and there's some
hyper-parameter here, lambda,

659
00:29:25,930 --> 00:29:28,000
that trades off between the two.

660
00:29:28,000 --> 00:29:29,752
And we talked about hyper-parameters

661
00:29:29,752 --> 00:29:31,854
and cross-validation in the last lecture,

662
00:29:31,854 --> 00:29:34,620
so this regularization
hyper-parameter lambda

663
00:29:34,620 --> 00:29:36,597
will be one of the more important ones

664
00:29:36,597 --> 00:29:38,080
that you'll need to tune when training

665
00:29:38,080 --> 00:29:40,163
these models in practice.

666
00:29:41,029 --> 00:29:41,862
Question?

667
00:29:42,897 --> 00:29:45,479
- [Student] What does that lambda R W term

668
00:29:45,479 --> 00:29:49,646
have to do with [speaking faintly].

669
00:29:51,485 --> 00:29:52,741
- Yeah, so the question is,

670
00:29:52,741 --> 00:29:54,650
what's the connection
between this lambda R W term

671
00:29:54,650 --> 00:29:56,205
and actually forcing this wiggly line

672
00:29:56,205 --> 00:29:58,872
to become a straight green line?

673
00:30:00,712 --> 00:30:02,119
I didn't want to go through
the derivation on this

674
00:30:02,119 --> 00:30:04,350
because I thought it would
lead us too far astray,

675
00:30:04,350 --> 00:30:05,568
but you can imagine,

676
00:30:05,568 --> 00:30:07,110
maybe you're doing a regression problem,

677
00:30:07,110 --> 00:30:10,630
in terms of different
polynomial basis functions,

678
00:30:10,630 --> 00:30:14,721
and if you're adding
this regression penalty,

679
00:30:14,721 --> 00:30:16,575
maybe the model has access to polynomials

680
00:30:16,575 --> 00:30:19,116
of very high degree, but
through this regression term

681
00:30:19,116 --> 00:30:21,515
you could encourage the
model to prefer polynomials

682
00:30:21,515 --> 00:30:24,449
of lower degree, if they
fit the data properly,

683
00:30:24,449 --> 00:30:26,369
or if they fit the data relatively well.

684
00:30:26,369 --> 00:30:29,821
So you could imagine
there's two ways to do this,

685
00:30:29,821 --> 00:30:31,481
either you can constrain your model class

686
00:30:31,481 --> 00:30:33,697
to just not contain the more powerful,

687
00:30:33,697 --> 00:30:37,137
more complex models, or you
can add this soft penalty

688
00:30:37,137 --> 00:30:41,646
where the model still has
access to more complex models,

689
00:30:41,646 --> 00:30:43,912
maybe high degree
polynomials in this case,

690
00:30:43,912 --> 00:30:45,464
but you add this soft constraint

691
00:30:45,464 --> 00:30:48,721
saying that if you want to
use these more complex models,

692
00:30:48,721 --> 00:30:50,448
you need to overcome this penalty

693
00:30:50,448 --> 00:30:52,116
for using their complexity.

694
00:30:52,116 --> 00:30:53,749
So that's the connection here,

695
00:30:53,749 --> 00:30:56,082
that is not quite linear classification,

696
00:30:56,082 --> 00:30:58,658
this is the picture that
many people have in mind

697
00:30:58,658 --> 00:31:02,491
when they think about
regularization at least.

698
00:31:03,531 --> 00:31:05,443
So there's actually a
lot of different types

699
00:31:05,443 --> 00:31:07,717
of regularization that
get used in practice.

700
00:31:07,717 --> 00:31:10,674
The most common one is
probably L2 regularization,

701
00:31:10,674 --> 00:31:11,822
or weight decay.

702
00:31:11,822 --> 00:31:14,705
But there's a lot of other
ones that you might see.

703
00:31:14,705 --> 00:31:18,773
This L2 regularization is
just the euclidean norm

704
00:31:18,773 --> 00:31:20,270
of this weight vector W,

705
00:31:20,270 --> 00:31:22,692
or sometimes the squared norm.

706
00:31:22,692 --> 00:31:24,821
Or sometimes half the squared norm

707
00:31:24,821 --> 00:31:26,280
because it makes your derivatives work out

708
00:31:26,280 --> 00:31:27,538
a little bit nicer.

709
00:31:27,538 --> 00:31:29,632
But the idea of L2 regularization

710
00:31:29,632 --> 00:31:31,502
is you're just penalizing
the euclidean norm

711
00:31:31,502 --> 00:31:33,450
of this weight vector.

712
00:31:33,450 --> 00:31:36,393
You might also sometimes
see L1 regularization,

713
00:31:36,393 --> 00:31:39,757
where we're penalizing the
L1 norm of the weight vector,

714
00:31:39,757 --> 00:31:43,325
and the L1 regularization
has some nice properties

715
00:31:43,325 --> 00:31:47,201
like encouraging sparsity
in this matrix W.

716
00:31:47,201 --> 00:31:48,492
Some other things you might see

717
00:31:48,492 --> 00:31:50,615
would be this elastic net regularization,

718
00:31:50,615 --> 00:31:53,544
which is some combination of L1 and L2.

719
00:31:53,544 --> 00:31:56,637
You sometimes see max norm regularization,

720
00:31:56,637 --> 00:32:00,804
penalizing the max norm
rather than the L1 or L2 norm.

721
00:32:01,919 --> 00:32:03,825
But these sorts of regularizations

722
00:32:03,825 --> 00:32:06,592
are things that you see
not just in deep learning,

723
00:32:06,592 --> 00:32:08,174
but across many areas of machine learning

724
00:32:08,174 --> 00:32:11,797
and even optimization more broadly.

725
00:32:11,797 --> 00:32:13,891
In some later lectures, we'll also see

726
00:32:13,891 --> 00:32:16,526
some types of regularization
that are more specific

727
00:32:16,526 --> 00:32:17,938
to deep learning.

728
00:32:17,938 --> 00:32:20,402
For example dropout, we'll
see in a couple lectures,

729
00:32:20,402 --> 00:32:23,682
or batch normalization, stochastic depth,

730
00:32:23,682 --> 00:32:25,957
these things get kind of
crazy in recent years.

731
00:32:25,957 --> 00:32:27,561
But the whole idea of regularization

732
00:32:27,561 --> 00:32:30,129
is just any thing that
you do to your model,

733
00:32:30,129 --> 00:32:33,357
that sort of penalizes somehow
the complexity of the model,

734
00:32:33,357 --> 00:32:37,861
rather than explicitly trying
to fit the training data.

735
00:32:37,861 --> 00:32:39,106
Question?

736
00:32:39,106 --> 00:32:42,689
[student speaking faintly]

737
00:32:45,658 --> 00:32:46,904
Yeah, so the question is,

738
00:32:46,904 --> 00:32:49,404
how does the L2 regularization
measure the complexity

739
00:32:49,404 --> 00:32:50,986
of the model?

740
00:32:50,986 --> 00:32:52,835
Thankfully we have an
example of that right here,

741
00:32:52,835 --> 00:32:55,002
maybe we can walk through.

742
00:32:56,237 --> 00:32:58,579
So here we maybe have
some training example, x,

743
00:32:58,579 --> 00:33:00,858
and there's two different
Ws that we're considering.

744
00:33:00,858 --> 00:33:03,473
So x is just this vector of four ones,

745
00:33:03,473 --> 00:33:07,124
and we're considering these
two different possibilities

746
00:33:07,124 --> 00:33:07,957
for W.

747
00:33:07,957 --> 00:33:09,778
One is a one in the
first, one is a single one

748
00:33:09,778 --> 00:33:11,167
and three zeros,

749
00:33:11,167 --> 00:33:13,330
and the other has this 0.25 spread across

750
00:33:13,330 --> 00:33:14,991
the four different entries.

751
00:33:14,991 --> 00:33:16,698
And now, when we're doing
linear classification,

752
00:33:16,698 --> 00:33:18,099
we're really taking dot products

753
00:33:18,099 --> 00:33:20,502
between our x and our W.

754
00:33:20,502 --> 00:33:23,333
So in terms of linear classification,

755
00:33:23,333 --> 00:33:25,547
these two Ws are the same,

756
00:33:25,547 --> 00:33:27,119
because they give the same result

757
00:33:27,119 --> 00:33:29,102
when dot producted with x.

758
00:33:29,102 --> 00:33:30,435
But now the question is,

759
00:33:30,435 --> 00:33:32,100
if you look at these two examples,

760
00:33:32,100 --> 00:33:35,183
which one would L2 regression prefer?

761
00:33:36,852 --> 00:33:40,100
Yeah, so L2 regression would prefer W2,

762
00:33:40,100 --> 00:33:41,830
because it has a smaller norm.

763
00:33:41,830 --> 00:33:45,654
So the answer is that the L2 regression

764
00:33:45,654 --> 00:33:47,817
measures complexity of the classifier

765
00:33:47,817 --> 00:33:50,240
in this relatively coarse way,

766
00:33:50,240 --> 00:33:52,444
where the idea is that,

767
00:33:52,444 --> 00:33:55,357
remember the Ws in linear classification

768
00:33:55,357 --> 00:33:57,715
had this interpretation of how much

769
00:33:57,715 --> 00:34:00,351
does this value of the vector x

770
00:34:00,351 --> 00:34:02,720
correspond to this output class?

771
00:34:02,720 --> 00:34:05,002
So L2 regularization is saying

772
00:34:05,002 --> 00:34:07,045
that it prefers to spread that influence

773
00:34:07,045 --> 00:34:09,880
across all the different values in x.

774
00:34:09,880 --> 00:34:11,838
Maybe this might be more robust,

775
00:34:11,839 --> 00:34:15,385
in case you come up with xs that vary,

776
00:34:15,385 --> 00:34:17,487
then our decisions are spread out

777
00:34:17,487 --> 00:34:19,159
and depend on the entire x vector,

778
00:34:19,159 --> 00:34:21,005
rather than depending
only on certain elements

779
00:34:21,005 --> 00:34:22,859
of the x vector.

780
00:34:22,860 --> 00:34:24,746
And by the way, L1 regularization

781
00:34:24,746 --> 00:34:27,639
has this opposite interpretation.

782
00:34:27,639 --> 00:34:30,157
So actually if we were
using L1 regularization,

783
00:34:30,157 --> 00:34:33,746
then we would actually prefer W1 over W2,

784
00:34:33,746 --> 00:34:36,395
because L1 regularization
has this different notion

785
00:34:36,395 --> 00:34:40,562
of complexity, saying that
maybe the model is less complex,

786
00:34:42,369 --> 00:34:45,667
maybe we measure model
complexity by the number of zeros

787
00:34:45,667 --> 00:34:46,880
in the weight vector,

788
00:34:46,880 --> 00:34:50,132
so the question of how
do we measure complexity

789
00:34:50,132 --> 00:34:52,717
and how does L2 measure complexity?

790
00:34:52,717 --> 00:34:54,996
They're kind of problem dependent.

791
00:34:54,996 --> 00:34:57,926
And you have to think about
for your particular setup,

792
00:34:57,926 --> 00:34:59,721
for your particular model and data,

793
00:34:59,721 --> 00:35:02,554
how do you think that
complexity should be measured

794
00:35:02,554 --> 00:35:03,637
on this task?

795
00:35:04,721 --> 00:35:05,588
Question?

796
00:35:05,588 --> 00:35:07,840
- [Student] So why would L1 prefer W1?

797
00:35:07,840 --> 00:35:09,929
Don't they sum to the same one?

798
00:35:09,929 --> 00:35:11,185
- Oh yes, you're right.

799
00:35:11,185 --> 00:35:13,130
So in this case, L1 is actually the same

800
00:35:13,130 --> 00:35:14,630
between these two.

801
00:35:15,993 --> 00:35:18,818
But you could construct
a similar example to this

802
00:35:18,818 --> 00:35:22,346
where W1 would be preferred
by L1 regularization.

803
00:35:22,346 --> 00:35:25,443
I guess the general intuition behind L1

804
00:35:25,443 --> 00:35:27,708
is that it generally
prefers sparse solutions,

805
00:35:27,708 --> 00:35:31,358
that it drives all your
entries of W to zero

806
00:35:31,358 --> 00:35:32,754
for most of the entries,
except for a couple

807
00:35:32,754 --> 00:35:35,816
where it's allowed to deviate from zero.

808
00:35:35,816 --> 00:35:38,745
The way of measuring complexity for L1

809
00:35:38,745 --> 00:35:40,991
is maybe the number of non-zero entries,

810
00:35:40,991 --> 00:35:44,477
and then for L2, it thinks
that things that spread the W

811
00:35:44,477 --> 00:35:46,482
across all the values are less complex.

812
00:35:46,482 --> 00:35:49,720
So it depends on your data,
depends on your problem.

813
00:35:49,720 --> 00:35:52,393
Oh and by the way, if
you're a hardcore Bayesian,

814
00:35:52,393 --> 00:35:55,384
then using L2 regularization
has this nice interpretation

815
00:35:55,384 --> 00:35:58,006
of MAP inference under a Gaussian prior

816
00:35:58,006 --> 00:35:59,697
on the parameter vector.

817
00:35:59,697 --> 00:36:01,230
I think there was a
homework problem about that

818
00:36:01,230 --> 00:36:03,296
in 229, but we won't talk about that

819
00:36:03,296 --> 00:36:06,143
for the rest of the quarter.

820
00:36:06,143 --> 00:36:08,905
That's sort of my long, deep dive

821
00:36:08,905 --> 00:36:12,200
into the multi-class SVM loss.

822
00:36:12,200 --> 00:36:13,583
Question?

823
00:36:13,583 --> 00:36:15,199
- [Student] Yeah, so I'm still confused

824
00:36:15,199 --> 00:36:18,446
about what the kind of stuff I need to do

825
00:36:18,446 --> 00:36:21,779
when the linear versus polynomial thing,

826
00:36:25,390 --> 00:36:28,473
because the use of this loss function

827
00:36:29,807 --> 00:36:32,911
isn't going to change the
fact that you're just doing,

828
00:36:32,911 --> 00:36:34,782
you're looking at a
linear classifier, right?

829
00:36:34,782 --> 00:36:37,705
- Yeah, so the question is that,

830
00:36:37,705 --> 00:36:38,625
adding a regularization

831
00:36:38,625 --> 00:36:40,598
is not going to change
the hypothesis class.

832
00:36:40,598 --> 00:36:44,836
This is not going to change us
away from a linear classifier.

833
00:36:44,836 --> 00:36:46,657
The idea is that maybe this example

834
00:36:46,657 --> 00:36:48,291
of this polynomial regression

835
00:36:48,291 --> 00:36:50,838
is definitely not linear regression.

836
00:36:50,838 --> 00:36:52,974
That could be seen as linear regression

837
00:36:52,974 --> 00:36:57,626
on top of a polynomial
expansion of the input,

838
00:36:57,626 --> 00:37:00,512
and in which case, this
regression sort of says

839
00:37:00,512 --> 00:37:04,032
that you're not allowed
to use as many polynomial

840
00:37:04,032 --> 00:37:06,602
coefficients as maybe you should have.

841
00:37:06,602 --> 00:37:08,185
Right, so you can imagine this is like,

842
00:37:08,185 --> 00:37:10,090
when you're doing polynomial regression,

843
00:37:10,090 --> 00:37:12,562
you can write out a polynomial as f of x

844
00:37:12,562 --> 00:37:16,987
equals A zero plus A one
x plus A two x squared

845
00:37:16,987 --> 00:37:18,763
plus A three x whatever,

846
00:37:18,763 --> 00:37:21,143
in that case your parameters, your Ws,

847
00:37:21,143 --> 00:37:23,893
would be these As, in which case,

848
00:37:25,011 --> 00:37:26,954
penalizing the W could force it

849
00:37:26,954 --> 00:37:28,990
towards lower degree polynomials.

850
00:37:28,990 --> 00:37:30,939
Except in the case of
polynomial regression,

851
00:37:30,939 --> 00:37:32,291
you don't actually want to parameterize

852
00:37:32,291 --> 00:37:34,326
in terms of As, there's
some other paramterization

853
00:37:34,326 --> 00:37:35,525
that you want to use,

854
00:37:35,525 --> 00:37:37,164
but that's the general idea,

855
00:37:37,164 --> 00:37:39,671
that you're sort of penalizing
the parameters of the model

856
00:37:39,671 --> 00:37:42,668
to force it towards the simpler hypotheses

857
00:37:42,668 --> 00:37:45,085
within your hypothesis class.

858
00:37:46,029 --> 00:37:47,756
And maybe we can take this offline

859
00:37:47,756 --> 00:37:51,140
if that's still a bit confusing.

860
00:37:51,140 --> 00:37:55,389
So then we've sort of seen
this multi-class SVM loss,

861
00:37:55,389 --> 00:37:57,149
and just by the way as a side note,

862
00:37:57,149 --> 00:38:01,316
this is one extension or
generalization of the SVM loss

863
00:38:02,724 --> 00:38:03,897
to multiple classes,

864
00:38:03,897 --> 00:38:05,147
there's actually a couple
different formulations

865
00:38:05,147 --> 00:38:07,396
that you can see around in literature,

866
00:38:07,396 --> 00:38:10,648
but I mean, my intuition is
that they all tend to work

867
00:38:10,648 --> 00:38:12,555
similarly in practice,

868
00:38:12,555 --> 00:38:14,613
at least in the context of deep learning.

869
00:38:14,613 --> 00:38:17,249
So we'll stick with this
one particular formulation

870
00:38:17,249 --> 00:38:20,749
of the multi-class SVM loss in this class.

871
00:38:21,861 --> 00:38:23,945
But of course there's many
different loss functions

872
00:38:23,945 --> 00:38:25,958
you might imagine.

873
00:38:25,958 --> 00:38:28,070
And another really popular choice,

874
00:38:28,070 --> 00:38:31,403
in addition to the multi-class SVM loss,

875
00:38:32,561 --> 00:38:34,558
another really popular
choice in deep learning

876
00:38:34,558 --> 00:38:37,069
is this multinomial logistic regression,

877
00:38:37,069 --> 00:38:38,569
or a softmax loss.

878
00:38:40,205 --> 00:38:42,098
And this one is probably
actually a bit more common

879
00:38:42,098 --> 00:38:44,022
in the context of deep learning,

880
00:38:44,022 --> 00:38:48,927
but I decided to present
this second for some reason.

881
00:38:48,927 --> 00:38:52,542
So remember in the context
of the multi-class SVM loss,

882
00:38:52,542 --> 00:38:54,451
we didn't actually have an interpretation

883
00:38:54,451 --> 00:38:55,896
for those scores.

884
00:38:55,896 --> 00:38:57,700
Remember, when we're
doing some classification,

885
00:38:57,700 --> 00:39:01,324
our model F, spits our these 10 numbers,

886
00:39:01,324 --> 00:39:03,470
which are our scores for the classes,

887
00:39:03,470 --> 00:39:05,481
and for the multi-class SVM,

888
00:39:05,481 --> 00:39:08,587
we didn't actually give much
interpretation to those scores.

889
00:39:08,587 --> 00:39:10,526
We just said that we want the true score,

890
00:39:10,526 --> 00:39:11,912
the score of the correct class

891
00:39:11,912 --> 00:39:14,297
to be greater than the incorrect classes,

892
00:39:14,297 --> 00:39:18,512
and beyond that we don't really
say what those scores mean.

893
00:39:18,512 --> 00:39:22,512
But now, for the multinomial
logistic regression

894
00:39:24,058 --> 00:39:26,425
loss function, we actually
will endow those scores

895
00:39:26,425 --> 00:39:28,468
with some additional meaning.

896
00:39:28,468 --> 00:39:30,515
And in particular we're
going to use those scores

897
00:39:30,515 --> 00:39:33,064
to compute a probability distribution

898
00:39:33,064 --> 00:39:34,707
over our classes.

899
00:39:34,707 --> 00:39:38,124
So we use this so-called softmax function

900
00:39:38,124 --> 00:39:40,575
where we take all of our scores,

901
00:39:40,575 --> 00:39:43,992
we exponentiate them so that
now they become positive,

902
00:39:43,992 --> 00:39:47,340
then we re-normalize them by
the sum of those exponents

903
00:39:47,340 --> 00:39:49,940
so now after we send our scores

904
00:39:49,940 --> 00:39:51,436
through this softmax function,

905
00:39:51,436 --> 00:39:53,853
now we end up with this
probability distribution,

906
00:39:53,853 --> 00:39:56,592
where now we have
probabilities over our classes,

907
00:39:56,592 --> 00:39:58,993
where each probability
is between zero and one,

908
00:39:58,993 --> 00:40:02,170
and the sum of probabilities
across all classes

909
00:40:02,170 --> 00:40:03,087
sum to one.

910
00:40:04,754 --> 00:40:07,969
And now the interpretation
is that we want,

911
00:40:07,969 --> 00:40:10,913
there's this computed
probability distribution

912
00:40:10,913 --> 00:40:12,820
that's implied by our scores,

913
00:40:12,820 --> 00:40:15,450
and we want to compare
this with the target

914
00:40:15,450 --> 00:40:17,938
or true probability distribution.

915
00:40:17,938 --> 00:40:19,966
So if we know that the thing is a cat,

916
00:40:19,966 --> 00:40:22,883
then the target probability distribution

917
00:40:22,883 --> 00:40:25,535
would put all of the
probability mass on cat,

918
00:40:25,535 --> 00:40:27,704
so we would have probability
of cat equals one,

919
00:40:27,704 --> 00:40:30,554
and zero probability for
all the other classes.

920
00:40:30,554 --> 00:40:32,412
So now what we want to do is encourage

921
00:40:32,412 --> 00:40:34,328
our computed probability distribution

922
00:40:34,328 --> 00:40:36,374
that's coming out of this softmax function

923
00:40:36,374 --> 00:40:39,176
to match this target
probability distribution

924
00:40:39,176 --> 00:40:41,471
that has all the mass
on the correct class.

925
00:40:41,471 --> 00:40:42,975
And the way that we do this,

926
00:40:42,975 --> 00:40:45,981
I mean, you can do this
equation in many ways,

927
00:40:45,981 --> 00:40:47,595
you can do this as a KL divergence

928
00:40:47,595 --> 00:40:49,083
between the target

929
00:40:49,083 --> 00:40:51,902
and the computed probability distribution,

930
00:40:51,902 --> 00:40:54,021
you can do this as a
maximum likelihood estimate,

931
00:40:54,021 --> 00:40:55,266
but at the end of the day,

932
00:40:55,266 --> 00:40:57,274
what we really want is
that the probability

933
00:40:57,274 --> 00:41:01,639
of the true class is
high and as close to one.

934
00:41:01,639 --> 00:41:04,815
So then our loss will
now be the negative log

935
00:41:04,815 --> 00:41:07,189
of the probability of the true class.

936
00:41:07,189 --> 00:41:08,951
This is confusing 'cause
we're putting this

937
00:41:08,951 --> 00:41:10,507
through multiple different things,

938
00:41:10,507 --> 00:41:12,545
but remember we wanted the probability

939
00:41:12,545 --> 00:41:14,324
to be close to one,

940
00:41:14,324 --> 00:41:17,871
so now log is a monotonic
function, it goes like this,

941
00:41:17,871 --> 00:41:19,214
and it turns out mathematically,

942
00:41:19,214 --> 00:41:21,640
it's easier to maximize log

943
00:41:21,640 --> 00:41:24,077
than it is to maximize
the raw probability,

944
00:41:24,077 --> 00:41:26,404
so we stick with log.

945
00:41:26,404 --> 00:41:27,784
And now log is monotonic,

946
00:41:27,784 --> 00:41:31,044
so if we maximize log P of correct class,

947
00:41:31,044 --> 00:41:33,399
that means we want that to be high,

948
00:41:33,399 --> 00:41:36,824
but loss functions measure
badness not goodness

949
00:41:36,824 --> 00:41:38,254
so we need to put in the minus one

950
00:41:38,254 --> 00:41:40,851
to make it go the right way.

951
00:41:40,851 --> 00:41:43,114
So now our loss function for SVM

952
00:41:43,114 --> 00:41:45,448
is going to be the minus
log of the probability

953
00:41:45,448 --> 00:41:46,948
of the true class.

954
00:41:49,709 --> 00:41:52,122
Yeah, so that's the summary here,

955
00:41:52,122 --> 00:41:54,582
is that we take our scores,
we run through the softmax,

956
00:41:54,582 --> 00:41:56,875
and now our loss is this
minus log of the probability

957
00:41:56,875 --> 00:41:58,375
of the true class.

958
00:42:02,497 --> 00:42:04,543
Okay, so then if you look
at what this looks like

959
00:42:04,543 --> 00:42:05,843
on a concrete example,

960
00:42:05,843 --> 00:42:08,549
then we go back to our
favorite beautiful cat

961
00:42:08,549 --> 00:42:11,434
with our three examples and
we've got these three scores

962
00:42:11,434 --> 00:42:15,286
that are coming out of
our linear classifier,

963
00:42:15,286 --> 00:42:17,096
and these scores are exactly
the way that they were

964
00:42:17,096 --> 00:42:19,512
in the context of the SVM loss.

965
00:42:19,512 --> 00:42:21,626
But now, rather than taking these scores

966
00:42:21,626 --> 00:42:23,617
and putting them directly
into our loss function,

967
00:42:23,617 --> 00:42:26,222
we're going to take them
all and exponentiate them

968
00:42:26,222 --> 00:42:27,790
so that they're all positive,

969
00:42:27,790 --> 00:42:29,895
and then we'll normalize them to make sure

970
00:42:29,895 --> 00:42:31,825
that they all sum to one.

971
00:42:31,825 --> 00:42:34,588
And now our loss will be the minus log

972
00:42:34,588 --> 00:42:36,588
of the true class score.

973
00:42:37,443 --> 00:42:39,693
So that's the softmax loss,

974
00:42:40,956 --> 00:42:44,623
also called multinomial
logistic regression.

975
00:42:46,296 --> 00:42:48,053
So now we asked several questions

976
00:42:48,053 --> 00:42:51,550
to try to gain intuition about
the multi-class SVM loss,

977
00:42:51,550 --> 00:42:54,578
and it's useful to think about
some of the same questions

978
00:42:54,578 --> 00:42:58,160
to contrast with the softmax loss.

979
00:42:58,160 --> 00:42:59,414
So then the question is,

980
00:42:59,414 --> 00:43:03,497
what's the min and max
value of the softmax loss?

981
00:43:05,784 --> 00:43:07,585
Okay, maybe not so sure,

982
00:43:07,585 --> 00:43:09,103
there's too many logs and sums and stuff

983
00:43:09,103 --> 00:43:10,520
going on in here.

984
00:43:12,098 --> 00:43:14,559
So the answer is that the min loss is zero

985
00:43:14,559 --> 00:43:16,230
and the max loss is infinity.

986
00:43:16,230 --> 00:43:19,063
And the way that you can see this,

987
00:43:20,222 --> 00:43:21,965
the probability distribution that we want

988
00:43:21,965 --> 00:43:25,267
is one on the correct class,
zero on the incorrect classes,

989
00:43:25,267 --> 00:43:26,642
the way that we do that is,

990
00:43:26,642 --> 00:43:27,999
so if that were the case,

991
00:43:27,999 --> 00:43:32,166
then this thing inside the
log would end up being one,

992
00:43:34,462 --> 00:43:37,550
because it's the log
probability of the true class,

993
00:43:37,550 --> 00:43:41,717
then log of one is zero, minus
log of one is still zero.

994
00:43:42,693 --> 00:43:44,801
So that means that if we
got the thing totally right,

995
00:43:44,801 --> 00:43:47,315
then our loss would be zero.

996
00:43:47,315 --> 00:43:51,049
But by the way, in order to
get the thing totally right,

997
00:43:51,049 --> 00:43:54,382
what would our scores have to look like?

998
00:43:56,763 --> 00:43:58,052
Murmuring, murmuring.

999
00:43:58,052 --> 00:44:00,935
So the scores would actually
have to go quite extreme,

1000
00:44:00,935 --> 00:44:02,372
like towards infinity.

1001
00:44:02,372 --> 00:44:05,184
So because we actually
have this exponentiation,

1002
00:44:05,184 --> 00:44:06,898
this normalization, the only way

1003
00:44:06,898 --> 00:44:09,829
we can actually get a
probability distribution of one

1004
00:44:09,829 --> 00:44:12,770
and zero, is actually
putting an infinite score

1005
00:44:12,770 --> 00:44:16,806
for the correct class,
and minus infinity score

1006
00:44:16,806 --> 00:44:18,309
for all the incorrect classes.

1007
00:44:18,309 --> 00:44:21,452
And computers don't do
so well with infinities,

1008
00:44:21,452 --> 00:44:23,039
so in practice, you'll
never get to zero loss

1009
00:44:23,039 --> 00:44:24,908
on this thing with finite precision.

1010
00:44:24,908 --> 00:44:26,555
But you still have this interpretation

1011
00:44:26,555 --> 00:44:30,023
that zero is the theoretical
minimum loss here.

1012
00:44:30,023 --> 00:44:32,407
And the maximum loss is unbounded.

1013
00:44:32,407 --> 00:44:35,980
So suppose that if we
had zero probability mass

1014
00:44:35,980 --> 00:44:40,283
on the correct class, then
you would have minus log

1015
00:44:40,283 --> 00:44:43,443
of zero, log of zero is minus infinity,

1016
00:44:43,443 --> 00:44:47,083
so minus log of zero
would be plus infinity,

1017
00:44:47,083 --> 00:44:48,134
so that's really bad.

1018
00:44:48,134 --> 00:44:49,871
But again, you'll never really get here

1019
00:44:49,871 --> 00:44:54,548
because the only way you can
actually get this probability

1020
00:44:54,548 --> 00:44:59,363
to be zero, is if e to the
correct class score is zero,

1021
00:44:59,363 --> 00:45:00,893
and that can only happen
if that correct class score

1022
00:45:00,893 --> 00:45:02,430
is minus infinity.

1023
00:45:02,430 --> 00:45:04,834
So again, you'll never
actually get to these minimum,

1024
00:45:04,834 --> 00:45:07,917
maximum values with finite precision.

1025
00:45:09,663 --> 00:45:11,832
So then, remember we had this debugging,

1026
00:45:11,832 --> 00:45:15,140
sanity check question in the
context of the multi-class SVM,

1027
00:45:15,140 --> 00:45:17,201
and we can ask the same for the softmax.

1028
00:45:17,201 --> 00:45:19,938
If all the Ss are small and about zero,

1029
00:45:19,938 --> 00:45:22,087
then what is the loss here?

1030
00:45:22,087 --> 00:45:23,092
Yeah, answer?

1031
00:45:23,092 --> 00:45:25,163
- [Student] Minus log one over C.

1032
00:45:25,163 --> 00:45:27,845
- So minus log of one over C?

1033
00:45:27,845 --> 00:45:29,595
I think that's, yeah,

1034
00:45:30,826 --> 00:45:34,152
so then it'd be minus log of one over C,

1035
00:45:34,152 --> 00:45:35,493
because log can flip the thing

1036
00:45:35,493 --> 00:45:37,326
so then it's just log of C.

1037
00:45:37,326 --> 00:45:38,879
Yeah, so it's just log of C.

1038
00:45:38,879 --> 00:45:40,709
And again, this is a nice debugging thing,

1039
00:45:40,709 --> 00:45:42,711
if you're training a model
with this softmax loss,

1040
00:45:42,711 --> 00:45:44,777
you should check at the first iteration.

1041
00:45:44,777 --> 00:45:48,694
If it's not log C, then
something's gone wrong.

1042
00:45:50,851 --> 00:45:54,057
So then we can compare and
contrast these two loss functions

1043
00:45:54,057 --> 00:45:55,400
a bit.

1044
00:45:55,400 --> 00:45:56,911
In terms of linear classification,

1045
00:45:56,911 --> 00:45:58,332
this setup looks the same.

1046
00:45:58,332 --> 00:46:00,046
We've got this W matrix
that gets multiplied

1047
00:46:00,046 --> 00:46:02,872
against our input to produce
this specter of scores,

1048
00:46:02,872 --> 00:46:04,846
and now the difference
between the two loss functions

1049
00:46:04,846 --> 00:46:07,234
is how we choose to interpret those scores

1050
00:46:07,234 --> 00:46:10,127
to quantitatively measure
the badness afterwards.

1051
00:46:10,127 --> 00:46:12,362
So for SVM, we were going to
go in and look at the margins

1052
00:46:12,362 --> 00:46:15,787
between the scores of the correct class

1053
00:46:15,787 --> 00:46:17,938
and the scores of the incorrect class,

1054
00:46:17,938 --> 00:46:21,056
whereas for this softmax
or cross-entropy loss,

1055
00:46:21,056 --> 00:46:23,503
we're going to go and compute
a probability distribution

1056
00:46:23,503 --> 00:46:25,780
and then look at the minus log probability

1057
00:46:25,780 --> 00:46:27,463
of the correct class.

1058
00:46:27,463 --> 00:46:29,796
So sometimes if you look at,

1059
00:46:30,998 --> 00:46:34,016
in terms of, nevermind,
I'll skip that point.

1060
00:46:34,016 --> 00:46:35,717
[laughing]

1061
00:46:35,717 --> 00:46:37,158
So another question that's interesting

1062
00:46:37,158 --> 00:46:41,654
when contrasting these two
loss functions is thinking,

1063
00:46:41,654 --> 00:46:45,041
suppose that I've got this example point,

1064
00:46:45,041 --> 00:46:46,858
and if you change its scores,

1065
00:46:46,858 --> 00:46:50,775
so assume that we've got
three scores for this,

1066
00:46:53,659 --> 00:46:54,842
ignore the part on the bottom.

1067
00:46:54,842 --> 00:46:57,358
But remember if we go back to this example

1068
00:46:57,358 --> 00:47:00,614
where in the multi-class SVM loss,

1069
00:47:00,614 --> 00:47:04,944
when we had the car, and the
car score was much better

1070
00:47:04,944 --> 00:47:07,093
than all the incorrect classes,

1071
00:47:07,093 --> 00:47:09,346
then jiggling the scores
for that car image

1072
00:47:09,346 --> 00:47:12,046
didn't change the
multi-class SVM loss at all,

1073
00:47:12,046 --> 00:47:13,941
because the only thing that the SVM loss

1074
00:47:13,941 --> 00:47:16,082
cared about was getting that correct score

1075
00:47:16,082 --> 00:47:19,159
to be greater than a margin
above the incorrect scores.

1076
00:47:19,159 --> 00:47:21,192
But now the softmax loss
is actually quite different

1077
00:47:21,192 --> 00:47:22,526
in this respect.

1078
00:47:22,526 --> 00:47:24,974
The softmax loss actually
always wants to drive

1079
00:47:24,974 --> 00:47:27,238
that probability mass all the way to one.

1080
00:47:27,238 --> 00:47:30,571
So even if you're giving very high score

1081
00:47:31,943 --> 00:47:33,406
to the correct class, and very low score

1082
00:47:33,406 --> 00:47:35,098
to all the incorrect classes,

1083
00:47:35,098 --> 00:47:37,652
softmax will want you to pile
more and more probability mass

1084
00:47:37,652 --> 00:47:40,844
on the correct class, and
continue to push the score

1085
00:47:40,844 --> 00:47:43,150
of that correct class up towards infinity,

1086
00:47:43,150 --> 00:47:44,952
and the score of the incorrect classes

1087
00:47:44,952 --> 00:47:46,938
down towards minus infinity.

1088
00:47:46,938 --> 00:47:48,235
So that's the interesting difference

1089
00:47:48,235 --> 00:47:50,768
between these two loss
functions in practice.

1090
00:47:50,768 --> 00:47:54,330
That SVM, it'll get this
data point over the bar

1091
00:47:54,330 --> 00:47:56,720
to be correctly classified
and then just give up,

1092
00:47:56,720 --> 00:47:58,539
it doesn't care about
that data point any more.

1093
00:47:58,539 --> 00:48:01,096
Whereas softmax will just always
try to continually improve

1094
00:48:01,096 --> 00:48:02,768
every single data point
to get better and better

1095
00:48:02,768 --> 00:48:04,638
and better and better.

1096
00:48:04,638 --> 00:48:06,205
So that's an interesting difference

1097
00:48:06,205 --> 00:48:08,178
between these two functions.

1098
00:48:08,178 --> 00:48:10,774
In practice, I think it tends
not to make a huge difference

1099
00:48:10,774 --> 00:48:13,040
which one you choose, they tend to perform

1100
00:48:13,040 --> 00:48:14,840
pretty similarly across,

1101
00:48:14,840 --> 00:48:16,766
at least a lot of deep
learning applications.

1102
00:48:16,766 --> 00:48:19,937
But it is very useful to keep
some of these differences

1103
00:48:19,937 --> 00:48:20,770
in mind.

1104
00:48:23,854 --> 00:48:26,976
Yeah, so to recap where
we've come to from here,

1105
00:48:26,976 --> 00:48:30,385
is that we've got some
data set of xs and ys,

1106
00:48:30,385 --> 00:48:33,818
we use our linear classifier
to get some score function,

1107
00:48:33,818 --> 00:48:37,395
to compute our scores
S, from our inputs, x,

1108
00:48:37,395 --> 00:48:39,111
and then we'll use a loss function,

1109
00:48:39,111 --> 00:48:41,934
maybe softmax or SVM or
some other loss function

1110
00:48:41,934 --> 00:48:46,797
to compute how quantitatively
bad were our predictions

1111
00:48:46,797 --> 00:48:49,754
compared to this ground true targets, y.

1112
00:48:49,754 --> 00:48:53,210
And then we'll often
augment this loss function

1113
00:48:53,210 --> 00:48:54,649
with a regularization term,

1114
00:48:54,649 --> 00:48:56,974
that tries to trade off between
fitting the training data

1115
00:48:56,974 --> 00:48:59,859
and preferring simpler models.

1116
00:48:59,859 --> 00:49:02,290
So this is a pretty generic overview

1117
00:49:02,290 --> 00:49:04,766
of a lot of what we call
supervised learning,

1118
00:49:04,766 --> 00:49:07,865
and what we'll see in deep
learning as we move forward,

1119
00:49:07,865 --> 00:49:11,688
is that generally you'll want
to specify some function, f,

1120
00:49:11,688 --> 00:49:13,464
that could be very complex in structure,

1121
00:49:13,464 --> 00:49:15,289
specify some loss function that determines

1122
00:49:15,289 --> 00:49:18,828
how well your algorithm is doing,

1123
00:49:18,828 --> 00:49:20,445
given any value of the parameters,

1124
00:49:20,445 --> 00:49:21,853
some regularization term

1125
00:49:21,853 --> 00:49:25,060
for how to penalize model complexity

1126
00:49:25,060 --> 00:49:26,941
and then you combine these things together

1127
00:49:26,941 --> 00:49:28,424
and try to find the W

1128
00:49:28,424 --> 00:49:31,666
that minimizes this final loss function.

1129
00:49:31,666 --> 00:49:32,920
But then the question is,

1130
00:49:32,920 --> 00:49:34,436
how do we actually go about doing that?

1131
00:49:34,436 --> 00:49:37,932
How do we actually find this
W that minimizes the loss?

1132
00:49:37,932 --> 00:49:41,261
And that leads us to the
topic of optimization.

1133
00:49:41,261 --> 00:49:43,833
So when we're doing optimization,

1134
00:49:43,833 --> 00:49:46,295
I usually think of things
in terms of walking

1135
00:49:46,295 --> 00:49:48,282
around some large valley.

1136
00:49:48,282 --> 00:49:52,751
So the idea is that you're
walking around this large valley

1137
00:49:52,751 --> 00:49:54,983
with different mountains
and valleys and streams

1138
00:49:54,983 --> 00:49:57,703
and stuff, and every
point on this landscape

1139
00:49:57,703 --> 00:50:01,529
corresponds to some setting
of the parameters W.

1140
00:50:01,529 --> 00:50:03,854
And you're this little guy who's
walking around this valley,

1141
00:50:03,854 --> 00:50:05,317
and you're trying to find,

1142
00:50:05,317 --> 00:50:07,016
and the height of each of these points,

1143
00:50:07,016 --> 00:50:11,528
sorry, is equal to the loss
incurred by that setting of W.

1144
00:50:11,528 --> 00:50:13,679
And now your job as this little man

1145
00:50:13,679 --> 00:50:14,981
walking around this landscape,

1146
00:50:14,981 --> 00:50:18,800
you need to somehow find
the bottom of this valley.

1147
00:50:18,800 --> 00:50:21,317
And this is kind of a
hard problem in general.

1148
00:50:21,317 --> 00:50:23,651
You might think, maybe I'm really smart

1149
00:50:23,651 --> 00:50:26,023
and I can think really hard
about the analytic properties

1150
00:50:26,023 --> 00:50:28,179
of my loss function, my
regularization all that,

1151
00:50:28,179 --> 00:50:31,046
maybe I can just write down the minimizer,

1152
00:50:31,046 --> 00:50:33,889
and that would sort of correspond
to magically teleporting

1153
00:50:33,889 --> 00:50:36,309
all the way to the bottom of this valley.

1154
00:50:36,309 --> 00:50:39,540
But in practice, once your
prediction function, f,

1155
00:50:39,540 --> 00:50:41,525
and your loss function
and your regularizer,

1156
00:50:41,525 --> 00:50:43,242
once these things get big and complex

1157
00:50:43,242 --> 00:50:45,409
and using neural networks,

1158
00:50:46,990 --> 00:50:48,855
there's really not much
hope in trying to write down

1159
00:50:48,855 --> 00:50:50,498
an explicit analytic solution

1160
00:50:50,498 --> 00:50:52,817
that takes you directly to the minima.

1161
00:50:52,817 --> 00:50:54,071
So in practice

1162
00:50:54,071 --> 00:50:56,285
we tend to use various
types of iterative methods

1163
00:50:56,285 --> 00:50:58,190
where we start with some solution

1164
00:50:58,190 --> 00:51:01,324
and then gradually improve it over time.

1165
00:51:01,324 --> 00:51:04,157
So the very first, stupidest thing

1166
00:51:05,327 --> 00:51:07,785
that you might imagine is random search,

1167
00:51:07,785 --> 00:51:09,824
that will just take a bunch of Ws,

1168
00:51:09,824 --> 00:51:12,980
sampled randomly, and throw
them into our loss function

1169
00:51:12,980 --> 00:51:15,763
and see how well they do.

1170
00:51:15,763 --> 00:51:18,216
So spoiler alert, this is
a really bad algorithm,

1171
00:51:18,216 --> 00:51:19,628
you probably shouldn't use this,

1172
00:51:19,628 --> 00:51:24,123
but at least it's one thing
you might imagine trying.

1173
00:51:24,123 --> 00:51:25,980
And we can actually do this,

1174
00:51:25,980 --> 00:51:28,656
we can actually try to
train a linear classifier

1175
00:51:28,656 --> 00:51:31,613
via random search, for CIFAR-10

1176
00:51:31,613 --> 00:51:34,952
and for this there's 10 classes,

1177
00:51:34,952 --> 00:51:36,797
so random chance is 10%,

1178
00:51:36,797 --> 00:51:40,568
and if we did some
number of random trials,

1179
00:51:40,568 --> 00:51:43,012
we eventually found just
through sheer dumb luck,

1180
00:51:43,012 --> 00:51:46,445
some setting of W that
got maybe 15% accuracy.

1181
00:51:46,445 --> 00:51:48,819
So it's better than random,

1182
00:51:48,819 --> 00:51:51,038
but state of the art is maybe 95%

1183
00:51:51,038 --> 00:51:54,631
so we've got a little
bit of gap to close here.

1184
00:51:54,631 --> 00:51:57,548
So again, probably don't
use this in practice,

1185
00:51:57,548 --> 00:51:58,938
but you might imagine
that this is something

1186
00:51:58,938 --> 00:52:01,477
you could potentially do.

1187
00:52:01,477 --> 00:52:03,267
So in practice, maybe a better strategy

1188
00:52:03,267 --> 00:52:05,464
is actually using some
of the local geometry

1189
00:52:05,464 --> 00:52:06,968
of this landscape.

1190
00:52:06,968 --> 00:52:08,414
So if you're this little guy who's walking

1191
00:52:08,414 --> 00:52:10,710
around this landscape,

1192
00:52:10,710 --> 00:52:12,978
maybe you can't see directly the path

1193
00:52:12,978 --> 00:52:14,602
down to the bottom of the valley,

1194
00:52:14,602 --> 00:52:16,831
but what you can do is feel with your foot

1195
00:52:16,831 --> 00:52:20,331
and figure out what is the local geometry,

1196
00:52:21,497 --> 00:52:22,727
if I'm standing right here,

1197
00:52:22,727 --> 00:52:24,945
which way will take me
a little bit downhill?

1198
00:52:24,945 --> 00:52:26,395
So you can feel with your feet

1199
00:52:26,395 --> 00:52:28,936
and feel where is the slope of the ground

1200
00:52:28,936 --> 00:52:31,661
taking me down a little
bit in this direction?

1201
00:52:31,661 --> 00:52:33,449
And you can take a step in that direction,

1202
00:52:33,449 --> 00:52:34,837
and then you'll go down a little bit,

1203
00:52:34,837 --> 00:52:37,060
feel again with your feet to
figure out which way is down,

1204
00:52:37,060 --> 00:52:38,504
and then repeat over and over again

1205
00:52:38,504 --> 00:52:40,329
and hope that you'll end up at the bottom

1206
00:52:40,329 --> 00:52:42,326
of the valley eventually.

1207
00:52:42,326 --> 00:52:46,096
So this also seems like a
relatively simple algorithm,

1208
00:52:46,096 --> 00:52:48,036
but actually this one
tends to work really well

1209
00:52:48,036 --> 00:52:50,995
in practice if you get
all the details right.

1210
00:52:50,995 --> 00:52:53,009
So this is generally the
strategy that we'll follow

1211
00:52:53,009 --> 00:52:54,763
when training these large neural networks

1212
00:52:54,763 --> 00:52:57,828
and linear classifiers and other things.

1213
00:52:57,828 --> 00:52:59,569
So then, that was a little hand wavy,

1214
00:52:59,569 --> 00:53:00,752
so what is slope?

1215
00:53:00,752 --> 00:53:03,137
If you remember back
to your calculus class,

1216
00:53:03,137 --> 00:53:04,642
then at least in one dimension,

1217
00:53:04,642 --> 00:53:08,473
the slope is the derivative
of this function.

1218
00:53:08,473 --> 00:53:10,771
So if we've got some
one-dimensional function, f,

1219
00:53:10,771 --> 00:53:13,769
that takes in a scalar x,
and then outputs the height

1220
00:53:13,769 --> 00:53:17,260
of some curve, then we
can compute the slope

1221
00:53:17,260 --> 00:53:20,517
or derivative at any point by imagining,

1222
00:53:20,517 --> 00:53:24,267
if we take a small step,
h, in any direction,

1223
00:53:27,098 --> 00:53:28,857
take a small step, h, and
compare the difference

1224
00:53:28,857 --> 00:53:30,598
in the function value over that step

1225
00:53:30,598 --> 00:53:32,479
and then drag the step size to zero,

1226
00:53:32,479 --> 00:53:34,037
that will give us the
slope of that function

1227
00:53:34,037 --> 00:53:35,695
at that point.

1228
00:53:35,695 --> 00:53:36,894
And this generalizes quite naturally

1229
00:53:36,894 --> 00:53:39,133
to multi-variable functions as well.

1230
00:53:39,133 --> 00:53:42,177
So in practice, our x
is maybe not a scalar

1231
00:53:42,177 --> 00:53:43,412
but a whole vector,

1232
00:53:43,412 --> 00:53:47,245
'cause remember, x
might be a whole vector,

1233
00:53:48,332 --> 00:53:49,863
so we need to generalize this notion

1234
00:53:49,863 --> 00:53:52,741
to multi-variable things.

1235
00:53:52,741 --> 00:53:55,458
And the generalization that
we use of the derivative

1236
00:53:55,458 --> 00:53:58,696
in the multi-variable
setting is the gradient,

1237
00:53:58,696 --> 00:54:01,968
so the gradient is a vector
of partial derivatives.

1238
00:54:01,968 --> 00:54:05,209
So the gradient will
have the same shape as x,

1239
00:54:05,209 --> 00:54:08,273
and each element of the
gradient will tell us

1240
00:54:08,273 --> 00:54:10,395
what is the slope of the function f,

1241
00:54:10,395 --> 00:54:13,191
if we move in that coordinate direction.

1242
00:54:13,191 --> 00:54:14,699
And the gradient turns out

1243
00:54:14,699 --> 00:54:17,616
to have these very nice properties,

1244
00:54:19,173 --> 00:54:21,836
so the gradient is now a
vector of partial derivatives,

1245
00:54:21,836 --> 00:54:24,028
but it points in the
direction of greatest increase

1246
00:54:24,028 --> 00:54:26,457
of the function and correspondingly,

1247
00:54:26,457 --> 00:54:28,345
if you look at the negative
gradient direction,

1248
00:54:28,345 --> 00:54:30,570
that gives you the direction
of greatest decrease

1249
00:54:30,570 --> 00:54:32,412
of the function.

1250
00:54:32,412 --> 00:54:35,010
And more generally, if you want to know,

1251
00:54:35,010 --> 00:54:38,218
what is the slope of my
landscape in any direction?

1252
00:54:38,218 --> 00:54:40,157
Then that's equal to the
dot product of the gradient

1253
00:54:40,157 --> 00:54:43,493
with the unit vector
describing that direction.

1254
00:54:43,493 --> 00:54:45,349
So this gradient is super important,

1255
00:54:45,349 --> 00:54:48,519
because it gives you this
linear, first-order approximation

1256
00:54:48,519 --> 00:54:50,998
to your function at your current point.

1257
00:54:50,998 --> 00:54:52,182
So in practice, a lot of deep learning

1258
00:54:52,182 --> 00:54:54,461
is about computing
gradients of your functions

1259
00:54:54,461 --> 00:54:57,090
and then using those gradients
to iteratively update

1260
00:54:57,090 --> 00:54:58,923
your parameter vector.

1261
00:55:00,004 --> 00:55:02,995
So one naive way that you might imagine

1262
00:55:02,995 --> 00:55:05,612
actually evaluating this
gradient on a computer,

1263
00:55:05,612 --> 00:55:07,755
is using the method of finite differences,

1264
00:55:07,755 --> 00:55:10,288
going back to the limit
definition of gradient.

1265
00:55:10,288 --> 00:55:13,421
So here on the left, we
imagine that our current W

1266
00:55:13,421 --> 00:55:14,812
is this parameter vector

1267
00:55:14,812 --> 00:55:18,232
that maybe gives us some
current loss of maybe 1.25

1268
00:55:18,232 --> 00:55:21,927
and our goal is to
compute the gradient, dW,

1269
00:55:21,927 --> 00:55:24,722
which will be a vector
of the same shape as W,

1270
00:55:24,722 --> 00:55:26,821
and each slot in that
gradient will tell us

1271
00:55:26,821 --> 00:55:29,850
how much will the loss
change is we move a tiny,

1272
00:55:29,850 --> 00:55:32,534
infinitesimal amount in
that coordinate direction.

1273
00:55:32,534 --> 00:55:34,041
So one thing you might imagine

1274
00:55:34,041 --> 00:55:36,541
is just computing these
finite differences,

1275
00:55:36,541 --> 00:55:39,136
that we have our W, we
might try to increment

1276
00:55:39,136 --> 00:55:42,702
the first element of
W by a small value, h,

1277
00:55:42,702 --> 00:55:45,010
and then re-compute the
loss using our loss function

1278
00:55:45,010 --> 00:55:46,642
and our classifier and all that.

1279
00:55:46,642 --> 00:55:48,912
And maybe in this setting,
if we move a little bit

1280
00:55:48,912 --> 00:55:51,592
in the first dimension,
then our loss will decrease

1281
00:55:51,592 --> 00:55:54,592
a little bit from 1.2534 to 1.25322.

1282
00:55:56,745 --> 00:55:58,374
And then we can use this limit definition

1283
00:55:58,374 --> 00:56:02,163
to come up with this finite
differences approximation

1284
00:56:02,163 --> 00:56:05,178
to the gradient in this first dimension.

1285
00:56:05,178 --> 00:56:06,994
And now you can imagine
repeating this procedure

1286
00:56:06,994 --> 00:56:08,595
in the second dimension,

1287
00:56:08,595 --> 00:56:10,171
where now we take the first dimension,

1288
00:56:10,171 --> 00:56:11,829
set it back to the original value,

1289
00:56:11,829 --> 00:56:14,528
and now increment the second
direction by a small step.

1290
00:56:14,528 --> 00:56:16,094
And again, we compute the loss

1291
00:56:16,094 --> 00:56:18,393
and use this finite
differences approximation

1292
00:56:18,393 --> 00:56:20,244
to compute an approximation
to the gradient

1293
00:56:20,244 --> 00:56:21,965
in the second slot.

1294
00:56:21,965 --> 00:56:23,643
And now repeat this again for the third,

1295
00:56:23,643 --> 00:56:25,935
and on and on and on.

1296
00:56:25,935 --> 00:56:28,483
So this is actually a terrible idea

1297
00:56:28,483 --> 00:56:29,950
because it's super slow.

1298
00:56:29,950 --> 00:56:32,780
So you might imagine that
computing this function, f,

1299
00:56:32,780 --> 00:56:35,030
might actually be super
slow if it's a large,

1300
00:56:35,030 --> 00:56:36,720
convolutional neural network.

1301
00:56:36,720 --> 00:56:39,169
And this parameter vector, W,

1302
00:56:39,169 --> 00:56:41,493
probably will not have 10
entries like it does here,

1303
00:56:41,493 --> 00:56:43,066
it might have tens of millions

1304
00:56:43,066 --> 00:56:45,144
or even hundreds of millions
for some of these large,

1305
00:56:45,144 --> 00:56:47,246
complex deep learning models.

1306
00:56:47,246 --> 00:56:49,282
So in practice, you'll
never want to compute

1307
00:56:49,282 --> 00:56:51,181
your gradients for your
finite differences,

1308
00:56:51,181 --> 00:56:53,664
'cause you'd have to wait
for hundreds of millions

1309
00:56:53,664 --> 00:56:55,549
potentially of function evaluations

1310
00:56:55,549 --> 00:56:57,708
to get a single gradient,
and that would be super slow

1311
00:56:57,708 --> 00:56:58,875
and super bad.

1312
00:57:00,151 --> 00:57:03,324
But thankfully we don't have to do that.

1313
00:57:03,324 --> 00:57:04,476
Hopefully you took a calculus course

1314
00:57:04,476 --> 00:57:06,115
at some point in your lives,

1315
00:57:06,115 --> 00:57:08,946
so you know that thanks to these guys,

1316
00:57:08,946 --> 00:57:12,006
we can just write down the
expression for our loss

1317
00:57:12,006 --> 00:57:14,458
and then use the magical
hammer of calculus

1318
00:57:14,458 --> 00:57:16,384
to just write down an expression

1319
00:57:16,384 --> 00:57:18,172
for what this gradient should be.

1320
00:57:18,172 --> 00:57:19,508
And this'll be way more efficient

1321
00:57:19,508 --> 00:57:21,045
than trying to compute it analytically

1322
00:57:21,045 --> 00:57:22,458
via finite differences.

1323
00:57:22,458 --> 00:57:23,729
One, it'll be exact,

1324
00:57:23,729 --> 00:57:26,233
and two, it'll be much faster
since we just need to compute

1325
00:57:26,233 --> 00:57:28,150
this single expression.

1326
00:57:29,745 --> 00:57:32,205
So what this would look like is now,

1327
00:57:32,205 --> 00:57:34,313
if we go back to this
picture of our current W,

1328
00:57:34,313 --> 00:57:37,648
rather than iterating over
all the dimensions of W,

1329
00:57:37,648 --> 00:57:39,111
we'll figure out ahead of time

1330
00:57:39,111 --> 00:57:41,453
what is the analytic
expression for the gradient,

1331
00:57:41,453 --> 00:57:45,079
and then just write it down
and go directly from the W

1332
00:57:45,079 --> 00:57:48,137
and compute the dW or
the gradient in one step.

1333
00:57:48,137 --> 00:57:51,675
And that will be much better in practice.

1334
00:57:51,675 --> 00:57:54,646
So in summary, this numerical gradient

1335
00:57:54,646 --> 00:57:57,538
is something that's
simple and makes sense,

1336
00:57:57,538 --> 00:57:59,545
but you won't really use it in practice.

1337
00:57:59,545 --> 00:58:02,594
In practice, you'll always
take an analytic gradient

1338
00:58:02,594 --> 00:58:03,839
and use that

1339
00:58:03,839 --> 00:58:06,101
when actually performing
these gradient computations.

1340
00:58:06,101 --> 00:58:07,751
However, one interesting note is that

1341
00:58:07,751 --> 00:58:10,160
these numeric gradients
are actually a very useful

1342
00:58:10,160 --> 00:58:11,410
debugging tool.

1343
00:58:13,372 --> 00:58:14,641
Say you've written some code,

1344
00:58:14,641 --> 00:58:16,969
and you wrote some code
that computes the loss

1345
00:58:16,969 --> 00:58:18,570
and the gradient of the loss,

1346
00:58:18,570 --> 00:58:20,362
then how do you debug this thing?

1347
00:58:20,362 --> 00:58:22,709
How do you make sure that
this analytic expression

1348
00:58:22,709 --> 00:58:24,885
that you derived and wrote down in code

1349
00:58:24,885 --> 00:58:26,484
is actually correct?

1350
00:58:26,484 --> 00:58:29,243
So a really common debugging
strategy for these things

1351
00:58:29,243 --> 00:58:31,959
is to use the numeric gradient as a way,

1352
00:58:31,959 --> 00:58:33,623
as sort of a unit test to make sure

1353
00:58:33,623 --> 00:58:35,941
that your analytic gradient was correct.

1354
00:58:35,941 --> 00:58:39,120
Again, because this is
super slow and inexact,

1355
00:58:39,120 --> 00:58:42,235
then when doing this
numeric gradient checking,

1356
00:58:42,235 --> 00:58:44,539
as it's called, you'll tend
to scale down the parameter

1357
00:58:44,539 --> 00:58:45,984
of the problem so that it actually runs

1358
00:58:45,984 --> 00:58:47,555
in a reasonable amount of time.

1359
00:58:47,555 --> 00:58:50,176
But this ends up being a super
useful debugging strategy

1360
00:58:50,176 --> 00:58:52,521
when you're writing your
own gradient computations.

1361
00:58:52,521 --> 00:58:54,912
So this is actually very
commonly used in practice,

1362
00:58:54,912 --> 00:58:59,410
and you'll do this on
your assignments as well.

1363
00:58:59,410 --> 00:59:02,634
So then once we know how
to compute the gradient,

1364
00:59:02,634 --> 00:59:05,347
then it leads us to this
super simple algorithm

1365
00:59:05,347 --> 00:59:07,790
that's like three lines, but
turns out to be at the heart

1366
00:59:07,790 --> 00:59:10,280
of how we train even these very biggest,

1367
00:59:10,280 --> 00:59:12,407
most complex deep learning algorithms,

1368
00:59:12,407 --> 00:59:13,952
and that's gradient descent.

1369
00:59:13,952 --> 00:59:17,791
So gradient descent is
first we initialize our W

1370
00:59:17,791 --> 00:59:20,344
as some random thing, then while true,

1371
00:59:20,344 --> 00:59:22,355
we'll compute our loss and our gradient

1372
00:59:22,355 --> 00:59:25,321
and then we'll update our weights

1373
00:59:25,321 --> 00:59:28,347
in the opposite of the gradient direction,

1374
00:59:28,347 --> 00:59:29,522
'cause remember that the gradient

1375
00:59:29,522 --> 00:59:31,510
was pointing in the direction
of greatest increase

1376
00:59:31,510 --> 00:59:33,105
of the function, so minus gradient

1377
00:59:33,105 --> 00:59:34,847
points in the direction
of greatest decrease,

1378
00:59:34,847 --> 00:59:37,527
so we'll take a small
step in the direction

1379
00:59:37,527 --> 00:59:40,062
of minus gradient, and
just repeat this forever

1380
00:59:40,062 --> 00:59:41,574
and eventually your network will converge

1381
00:59:41,574 --> 00:59:44,055
and you'll be very happy, hopefully.

1382
00:59:44,055 --> 00:59:46,396
But this step size is
actually a hyper-parameter,

1383
00:59:46,396 --> 00:59:49,019
and this tells us that every
time we compute the gradient,

1384
00:59:49,019 --> 00:59:51,642
how far do we step in that direction.

1385
00:59:51,642 --> 00:59:54,402
And this step size, also
sometimes called a learning rate,

1386
00:59:54,402 --> 00:59:56,245
is probably one of the
single most important

1387
00:59:56,245 --> 00:59:58,178
hyper-parameters that you need to set

1388
00:59:58,178 --> 01:00:00,833
when you're actually training
these things in practice.

1389
01:00:00,833 --> 01:00:02,917
Actually for me when I'm
training these things,

1390
01:00:02,917 --> 01:00:05,000
trying to figure out this step size

1391
01:00:05,000 --> 01:00:07,140
or this learning rate, is
the first hyper-parameter

1392
01:00:07,140 --> 01:00:08,302
that I always check.

1393
01:00:08,302 --> 01:00:11,749
Things like model size or
regularization strength

1394
01:00:11,749 --> 01:00:13,234
I leave until a little bit later,

1395
01:00:13,234 --> 01:00:16,208
and getting the learning
rate or the step size correct

1396
01:00:16,208 --> 01:00:20,161
is the first thing that I
try to set at the beginning.

1397
01:00:20,161 --> 01:00:23,815
So pictorially what this looks like

1398
01:00:23,815 --> 01:00:26,012
here's a simple example in two dimensions.

1399
01:00:26,012 --> 01:00:28,804
So here we've got maybe this bowl

1400
01:00:28,804 --> 01:00:30,851
that's showing our loss function

1401
01:00:30,851 --> 01:00:34,435
where this red region in the center

1402
01:00:34,435 --> 01:00:37,398
is this region of low
loss we want to get to

1403
01:00:37,398 --> 01:00:39,652
and these blue and green
regions towards the edge

1404
01:00:39,652 --> 01:00:41,987
are higher loss that we want to avoid.

1405
01:00:41,987 --> 01:00:44,004
So now we're going to start of our W

1406
01:00:44,004 --> 01:00:45,550
at some random point in the space,

1407
01:00:45,550 --> 01:00:48,336
and then we'll compute the
negative gradient direction,

1408
01:00:48,336 --> 01:00:50,480
which will hopefully
point us in the direction

1409
01:00:50,480 --> 01:00:52,187
of the minima eventually.

1410
01:00:52,187 --> 01:00:53,971
And if we repeat this over and over again,

1411
01:00:53,971 --> 01:00:57,207
we'll hopefully eventually
get to the exact minima.

1412
01:00:57,207 --> 01:01:01,083
And what this looks like in practice is,

1413
01:01:01,083 --> 01:01:03,940
oh man, we've got this
mouse problem again.

1414
01:01:03,940 --> 01:01:05,158
So what this looks like in practice

1415
01:01:05,158 --> 01:01:10,050
is that if we repeat this
thing over and over again,

1416
01:01:10,050 --> 01:01:12,861
then we will start off at some point

1417
01:01:12,861 --> 01:01:16,066
and eventually, taking tiny
gradient steps each time,

1418
01:01:16,066 --> 01:01:20,021
you'll see that the parameter
will arc in toward the center,

1419
01:01:20,021 --> 01:01:21,426
this region of minima,

1420
01:01:21,426 --> 01:01:23,036
and that's really what you want,

1421
01:01:23,036 --> 01:01:25,041
because you want to get to low loss.

1422
01:01:25,041 --> 01:01:27,612
And by the way, as a bit of a teaser,

1423
01:01:27,612 --> 01:01:28,939
we saw in the previous slide,

1424
01:01:28,939 --> 01:01:31,705
this example of very
simple gradient descent,

1425
01:01:31,705 --> 01:01:33,505
where at every step, we're
just stepping in the direction

1426
01:01:33,505 --> 01:01:34,798
of the gradient.

1427
01:01:34,798 --> 01:01:36,797
But in practice, over the
next couple of lectures,

1428
01:01:36,797 --> 01:01:39,479
we'll see that there are
slightly fancier step,

1429
01:01:39,479 --> 01:01:41,685
what they call these update rules,

1430
01:01:41,685 --> 01:01:44,398
where you can take slightly fancier things

1431
01:01:44,398 --> 01:01:46,837
to incorporate gradients
across multiple time steps

1432
01:01:46,837 --> 01:01:49,006
and stuff like that, that tend
to work a little bit better

1433
01:01:49,006 --> 01:01:51,636
in practice and are
used much more commonly

1434
01:01:51,636 --> 01:01:53,313
than this vanilla gradient descent

1435
01:01:53,313 --> 01:01:55,410
when training these things in practice.

1436
01:01:55,410 --> 01:01:56,677
And then, as a bit of a preview,

1437
01:01:56,677 --> 01:01:59,901
we can look at some of these
slightly fancier methods

1438
01:01:59,901 --> 01:02:01,854
on optimizing the same problem.

1439
01:02:01,854 --> 01:02:05,501
So again, the black will be
this same gradient computation,

1440
01:02:05,501 --> 01:02:08,330
and these, I forgot which color they are,

1441
01:02:08,330 --> 01:02:09,671
but these two other curves

1442
01:02:09,671 --> 01:02:12,380
are using slightly fancier update rules

1443
01:02:12,380 --> 01:02:14,760
to decide exactly how to
use the gradient information

1444
01:02:14,760 --> 01:02:16,729
to make our next step.

1445
01:02:16,729 --> 01:02:21,251
So one of these is gradient
descent with momentum,

1446
01:02:21,251 --> 01:02:23,635
the other is this Adam optimizer,

1447
01:02:23,635 --> 01:02:25,073
and we'll see more details about those

1448
01:02:25,073 --> 01:02:26,189
later in the course.

1449
01:02:26,189 --> 01:02:29,100
But the idea is that we have
this very basic algorithm

1450
01:02:29,100 --> 01:02:30,595
called gradient descent,

1451
01:02:30,595 --> 01:02:32,344
where we use the gradient
at every time step

1452
01:02:32,344 --> 01:02:34,224
to determine where to step next,

1453
01:02:34,224 --> 01:02:36,674
and there exist different
update rules which tell us

1454
01:02:36,674 --> 01:02:39,359
how exactly do we use
that gradient information.

1455
01:02:39,359 --> 01:02:41,191
But it's all the same basic algorithm

1456
01:02:41,191 --> 01:02:44,858
of trying to go downhill
at every time step.

1457
01:02:50,822 --> 01:02:52,652
But there's actually
one more little wrinkle

1458
01:02:52,652 --> 01:02:54,012
that we should talk about.

1459
01:02:54,012 --> 01:02:57,618
So remember that we
defined our loss function,

1460
01:02:57,618 --> 01:03:00,156
we defined a loss that computes how bad

1461
01:03:00,156 --> 01:03:03,037
is our classifier doing at
any single training example,

1462
01:03:03,037 --> 01:03:05,336
and then we said that our
full loss over the data set

1463
01:03:05,336 --> 01:03:06,877
was going to be the average loss

1464
01:03:06,877 --> 01:03:09,114
across the entire training set.

1465
01:03:09,114 --> 01:03:13,372
But in practice, this N
could be very very large.

1466
01:03:13,372 --> 01:03:16,074
If we're using the image
net data set for example,

1467
01:03:16,074 --> 01:03:17,810
that we talked about in the first lecture,

1468
01:03:17,810 --> 01:03:19,999
then N could be like 1.3 million,

1469
01:03:19,999 --> 01:03:22,203
so actually computing this loss

1470
01:03:22,203 --> 01:03:24,008
could be actually very expensive

1471
01:03:24,008 --> 01:03:26,881
and require computing perhaps
millions of evaluations

1472
01:03:26,881 --> 01:03:29,006
of this function.

1473
01:03:29,006 --> 01:03:30,621
So that could be really slow.

1474
01:03:30,621 --> 01:03:32,894
And actually, because the
gradient is a linear operator,

1475
01:03:32,894 --> 01:03:34,909
when you actually try
to compute the gradient

1476
01:03:34,909 --> 01:03:37,834
of this expression, you see
that the gradient of our loss

1477
01:03:37,834 --> 01:03:40,304
is now the sum of the
gradient of the losses

1478
01:03:40,304 --> 01:03:42,261
for each of the individual terms.

1479
01:03:42,261 --> 01:03:44,534
So now if we want to
compute the gradient again,

1480
01:03:44,534 --> 01:03:46,048
it sort of requires us to iterate

1481
01:03:46,048 --> 01:03:47,730
over the entire training data set

1482
01:03:47,730 --> 01:03:49,269
all N of these examples.

1483
01:03:49,269 --> 01:03:51,190
So if our N was like a million,

1484
01:03:51,190 --> 01:03:52,760
this would be super super slow,

1485
01:03:52,760 --> 01:03:54,878
and we would have to wait
a very very long time

1486
01:03:54,878 --> 01:03:57,778
before we make any individual update to W.

1487
01:03:57,778 --> 01:03:59,377
So in practice, we tend to use

1488
01:03:59,377 --> 01:04:01,528
what is called stochastic
gradient descent,

1489
01:04:01,528 --> 01:04:04,861
where rather than computing
the loss and gradient

1490
01:04:04,861 --> 01:04:06,497
over the entire training set,

1491
01:04:06,497 --> 01:04:09,738
instead at every iteration,
we sample some small set

1492
01:04:09,738 --> 01:04:13,340
of training examples, called a minibatch.

1493
01:04:13,340 --> 01:04:15,013
Usually this is a power
of two by convention,

1494
01:04:15,013 --> 01:04:18,505
like 32, 64, 128 are common numbers,

1495
01:04:18,505 --> 01:04:20,687
and then we'll use this small minibatch

1496
01:04:20,687 --> 01:04:23,283
to compute an estimate of the full sum,

1497
01:04:23,283 --> 01:04:25,847
and an estimate of the true gradient.

1498
01:04:25,847 --> 01:04:28,503
And now this is stochastic
because you can view this

1499
01:04:28,503 --> 01:04:32,645
as maybe a Monte Carlo
estimate of some expectation

1500
01:04:32,645 --> 01:04:34,145
of the true value.

1501
01:04:35,516 --> 01:04:38,118
So now this makes our
algorithm slightly fancier,

1502
01:04:38,118 --> 01:04:39,745
but it's still only four lines.

1503
01:04:39,745 --> 01:04:43,912
So now it's well true, sample
some random minibatch of data,

1504
01:04:45,091 --> 01:04:47,482
evaluate your loss and
gradient on the minibatch,

1505
01:04:47,482 --> 01:04:49,490
and now make an update on your parameters

1506
01:04:49,490 --> 01:04:51,913
based on this estimate of the loss,

1507
01:04:51,913 --> 01:04:54,461
and this estimate of the gradient.

1508
01:04:54,461 --> 01:04:57,569
And again, we'll see
slightly fancier update rules

1509
01:04:57,569 --> 01:05:00,611
of exactly how to integrate
multiple gradients

1510
01:05:00,611 --> 01:05:03,580
over time, but this is the
basic training algorithm

1511
01:05:03,580 --> 01:05:05,748
that we use for pretty much
all deep neural networks

1512
01:05:05,748 --> 01:05:06,748
in practice.

1513
01:05:07,675 --> 01:05:11,035
So we have another interactive web demo

1514
01:05:11,035 --> 01:05:13,425
actually playing around
with linear classifiers,

1515
01:05:13,425 --> 01:05:15,770
and training these things via
stochastic gradient descent,

1516
01:05:15,770 --> 01:05:18,582
but given how miserable
the web demo was last time,

1517
01:05:18,582 --> 01:05:20,922
I'm not actually going to open the link.

1518
01:05:20,922 --> 01:05:24,069
Instead, I'll just play this video.

1519
01:05:24,069 --> 01:05:26,139
[laughing]

1520
01:05:26,139 --> 01:05:27,394
But I encourage you to go check this out

1521
01:05:27,394 --> 01:05:28,549
and play with it online,

1522
01:05:28,549 --> 01:05:30,056
because it actually helps
to build some intuition

1523
01:05:30,056 --> 01:05:31,836
about linear classifiers and training them

1524
01:05:31,836 --> 01:05:33,535
via gradient descent.

1525
01:05:33,535 --> 01:05:35,719
So here you can see on the left,

1526
01:05:35,719 --> 01:05:38,285
we've got this problem
where we're categorizing

1527
01:05:38,285 --> 01:05:40,946
three different classes,

1528
01:05:40,946 --> 01:05:43,351
and we've got these
green, blue and red points

1529
01:05:43,351 --> 01:05:46,553
that are our training samples
from these three classes.

1530
01:05:46,553 --> 01:05:49,151
And now we've drawn
the decision boundaries

1531
01:05:49,151 --> 01:05:52,868
for these classes, which are
the colored background regions,

1532
01:05:52,868 --> 01:05:55,070
as well as these directions,

1533
01:05:55,070 --> 01:05:58,287
giving you the direction of
increase for the class scores

1534
01:05:58,287 --> 01:05:59,807
for each of these three classes.

1535
01:05:59,807 --> 01:06:03,908
And now if you see, if
you actually go and play

1536
01:06:03,908 --> 01:06:05,614
with this thing online,

1537
01:06:05,614 --> 01:06:08,379
you can see that we can
go in and adjust the Ws

1538
01:06:08,379 --> 01:06:10,242
and changing the values of the Ws

1539
01:06:10,242 --> 01:06:12,976
will cause these decision
boundaries to rotate.

1540
01:06:12,976 --> 01:06:14,989
If you change the biases,
then the decision boundaries

1541
01:06:14,989 --> 01:06:18,276
will not rotate, but will
instead move side to side

1542
01:06:18,276 --> 01:06:19,462
or up and down.

1543
01:06:19,462 --> 01:06:20,789
Then we can actually make steps

1544
01:06:20,789 --> 01:06:22,655
that are trying to update this loss,

1545
01:06:22,655 --> 01:06:24,784
or you can change the step
size with this slider.

1546
01:06:24,784 --> 01:06:26,809
You can hit this button
to actually run the thing.

1547
01:06:26,809 --> 01:06:28,096
So now with a big step size,

1548
01:06:28,096 --> 01:06:29,876
we're running gradient descent right now,

1549
01:06:29,876 --> 01:06:31,424
and these decision boundaries
are flipping around

1550
01:06:31,424 --> 01:06:33,674
and trying to fit the data.

1551
01:06:35,353 --> 01:06:37,232
So it's doing okay now,

1552
01:06:37,232 --> 01:06:40,690
but we can actually change
the loss function in real time

1553
01:06:40,690 --> 01:06:42,645
between these different SVM formulations

1554
01:06:42,645 --> 01:06:44,367
and the different softmax.

1555
01:06:44,367 --> 01:06:45,554
And you can see that as you flip

1556
01:06:45,554 --> 01:06:48,495
between these different
formulations of loss functions,

1557
01:06:48,495 --> 01:06:51,019
it's generally doing the same thing.

1558
01:06:51,019 --> 01:06:53,140
Our decision regions are
mostly in the same place,

1559
01:06:53,140 --> 01:06:55,745
but exactly how they end
up relative to each other

1560
01:06:55,745 --> 01:06:57,063
and exactly what the trade-offs are

1561
01:06:57,063 --> 01:06:59,939
between categorizing
these different things

1562
01:06:59,939 --> 01:07:01,543
changes a little bit.

1563
01:07:01,543 --> 01:07:02,937
So I really encourage you to go online

1564
01:07:02,937 --> 01:07:04,677
and play with this thing to
try to get some intuition

1565
01:07:04,677 --> 01:07:06,499
for what it actually looks like

1566
01:07:06,499 --> 01:07:08,228
to try to train these linear classifiers

1567
01:07:08,228 --> 01:07:09,978
via gradient descent.

1568
01:07:13,143 --> 01:07:17,045
Now as an aside, I'd like
to talk about another idea,

1569
01:07:17,045 --> 01:07:18,902
which is that of image features.

1570
01:07:18,902 --> 01:07:21,468
So so far we've talked
about linear classifiers,

1571
01:07:21,468 --> 01:07:23,832
which is just maybe taking
our raw image pixels

1572
01:07:23,832 --> 01:07:25,984
and then feeding the raw pixels themselves

1573
01:07:25,984 --> 01:07:28,234
into our linear classifier.

1574
01:07:29,264 --> 01:07:31,875
But as we talked about
in the last lecture,

1575
01:07:31,875 --> 01:07:34,143
this is maybe not such
a great thing to do,

1576
01:07:34,143 --> 01:07:37,033
because of things like
multi-modality and whatnot.

1577
01:07:37,033 --> 01:07:40,168
So in practice, actually
feeding raw pixel values

1578
01:07:40,168 --> 01:07:43,589
into linear classifiers
tends to not work so well.

1579
01:07:43,589 --> 01:07:46,542
So it was actually common
before the dominance

1580
01:07:46,542 --> 01:07:47,945
of deep neural networks,

1581
01:07:47,945 --> 01:07:50,392
was instead to have
this two-stage approach,

1582
01:07:50,392 --> 01:07:51,905
where first, you would take your image

1583
01:07:51,905 --> 01:07:54,716
and then compute various
feature representations

1584
01:07:54,716 --> 01:07:56,686
of that image, that are maybe computing

1585
01:07:56,686 --> 01:08:00,019
different kinds of quantities
relating to the appearance

1586
01:08:00,019 --> 01:08:00,979
of the image,

1587
01:08:00,979 --> 01:08:02,917
and then concatenate these
different feature vectors

1588
01:08:02,917 --> 01:08:05,937
to give you some feature
representation of the image,

1589
01:08:05,937 --> 01:08:07,819
and now this feature
representation of the image

1590
01:08:07,819 --> 01:08:09,636
would be fed into a linear classifier,

1591
01:08:09,636 --> 01:08:11,483
rather than feeding the
raw pixels themselves

1592
01:08:11,483 --> 01:08:13,702
into the classifier.

1593
01:08:13,702 --> 01:08:16,471
And the motivation here is that,

1594
01:08:16,471 --> 01:08:18,500
so imagine we have a
training data set on the left

1595
01:08:18,501 --> 01:08:20,993
of these red points, and
red points in the middle

1596
01:08:20,993 --> 01:08:23,044
and blue points around that.

1597
01:08:23,044 --> 01:08:24,493
And for this kind of data set,

1598
01:08:24,493 --> 01:08:27,240
there's no way that we can
draw a linear decision boundary

1599
01:08:27,241 --> 01:08:29,957
to separate the red points
from the blue points.

1600
01:08:29,957 --> 01:08:32,955
And we saw more examples of
this in the last lecture.

1601
01:08:32,956 --> 01:08:35,259
But if we use a clever feature transform,

1602
01:08:35,259 --> 01:08:37,460
in this case transforming
to polar coordinates,

1603
01:08:37,460 --> 01:08:39,879
then now after we do
the feature transform,

1604
01:08:39,879 --> 01:08:43,161
then this complex data
set actually might become

1605
01:08:43,161 --> 01:08:44,477
linearly separable,

1606
01:08:44,477 --> 01:08:45,960
and actually could be classified correctly

1607
01:08:45,960 --> 01:08:47,658
by a linear classifier.

1608
01:08:47,658 --> 01:08:49,235
And the whole trick here
now is to figure out

1609
01:08:49,236 --> 01:08:51,834
what is the right feature transform

1610
01:08:51,834 --> 01:08:53,997
that is computing the right quantities

1611
01:08:53,997 --> 01:08:55,929
for the problem that you care about.

1612
01:08:55,929 --> 01:08:58,817
So for images, maybe
converting your pixels

1613
01:08:58,817 --> 01:09:00,551
to polar coordinates, doesn't make sense,

1614
01:09:00,551 --> 01:09:02,305
but you actually can try to write down

1615
01:09:02,305 --> 01:09:03,884
feature representations of images

1616
01:09:03,885 --> 01:09:05,549
that might make sense,

1617
01:09:05,549 --> 01:09:07,258
and actually might help you out

1618
01:09:07,258 --> 01:09:09,185
and might do better than
putting in raw pixels

1619
01:09:09,185 --> 01:09:11,191
into the classifier.

1620
01:09:11,192 --> 01:09:13,957
So one example of this kind
of feature representation

1621
01:09:13,957 --> 01:09:17,143
that's super simple, is this
idea of a color histogram.

1622
01:09:17,143 --> 01:09:19,326
So you'll take maybe each pixel,

1623
01:09:19,326 --> 01:09:21,988
you'll take this hue color spectrum

1624
01:09:21,988 --> 01:09:24,785
and divide it into buckets
and then for every pixel,

1625
01:09:24,786 --> 01:09:27,225
you'll map it into one
of those color buckets

1626
01:09:27,225 --> 01:09:29,335
and then count up how many pixels

1627
01:09:29,336 --> 01:09:31,962
fall into each of these different buckets.

1628
01:09:31,962 --> 01:09:35,438
So this tells you globally
what colors are in the image.

1629
01:09:35,439 --> 01:09:37,078
Maybe if this example of a frog,

1630
01:09:37,078 --> 01:09:38,300
this feature vector would tell us

1631
01:09:38,300 --> 01:09:39,876
there's a lot of green stuff,

1632
01:09:39,876 --> 01:09:41,738
and maybe not a lot of
purple or red stuff.

1633
01:09:41,738 --> 01:09:43,843
And this is kind of a simple feature
vector that you might see

1634
01:09:43,843 --> 01:09:44,843
in practice.

1635
01:09:45,908 --> 01:09:48,520
Another common feature vector that we saw

1636
01:09:48,520 --> 01:09:50,230
before the rise of neural networks,

1637
01:09:50,231 --> 01:09:51,783
or before the dominance of neural networks

1638
01:09:51,783 --> 01:09:54,019
was this histogram of oriented gradients.

1639
01:09:54,020 --> 01:09:55,752
So remember from the first lecture,

1640
01:09:55,752 --> 01:09:58,629
that Hubel and Wiesel
found these oriented edges

1641
01:09:58,629 --> 01:10:00,846
are really important in
the human visual system,

1642
01:10:00,846 --> 01:10:03,009
and this histogram of oriented gradients

1643
01:10:03,009 --> 01:10:05,490
feature representation tries to capture

1644
01:10:05,490 --> 01:10:08,480
the same intuition and
measure the local orientation

1645
01:10:08,480 --> 01:10:10,774
of edges on the image.

1646
01:10:10,774 --> 01:10:12,080
So what this thing is going to do,

1647
01:10:12,080 --> 01:10:14,086
is take our image and then divide it

1648
01:10:14,086 --> 01:10:17,154
into these little eight
by eight pixel regions.

1649
01:10:17,154 --> 01:10:19,942
And then within each of those
eight by eight pixel regions,

1650
01:10:19,942 --> 01:10:23,068
we'll compute what is the
dominant edge direction

1651
01:10:23,068 --> 01:10:25,721
of each pixel, quantize
those edge directions

1652
01:10:25,721 --> 01:10:28,576
into several buckets and then
within each of those regions,

1653
01:10:28,576 --> 01:10:32,657
compute a histogram over these
different edge orientations.

1654
01:10:32,657 --> 01:10:34,217
And now your full-feature vector

1655
01:10:34,217 --> 01:10:36,597
will be these different
bucketed histograms

1656
01:10:36,597 --> 01:10:38,182
of edge orientations

1657
01:10:38,182 --> 01:10:39,921
across all the different
eight by eight regions

1658
01:10:39,921 --> 01:10:41,004
in the image.

1659
01:10:42,460 --> 01:10:44,250
So this is in some sense dual

1660
01:10:44,250 --> 01:10:47,829
to the color histogram
classifier that we saw before.

1661
01:10:47,829 --> 01:10:50,504
So color histogram is
saying, globally, what colors

1662
01:10:50,504 --> 01:10:51,882
exist in the image,

1663
01:10:51,882 --> 01:10:54,551
and this is saying, overall,
what types of edge information

1664
01:10:54,551 --> 01:10:56,105
exist in the image.

1665
01:10:56,105 --> 01:10:58,791
And even localized to
different parts of the image,

1666
01:10:58,791 --> 01:11:01,991
what types of edges exist
in different regions.

1667
01:11:01,991 --> 01:11:04,346
So maybe for this frog on the left,

1668
01:11:04,346 --> 01:11:05,738
you can see he's sitting on a leaf,

1669
01:11:05,738 --> 01:11:08,361
and these leaves have these
dominant diagonal edges,

1670
01:11:08,361 --> 01:11:11,280
and if you visualize the
histogram of oriented gradient

1671
01:11:11,280 --> 01:11:13,633
features, then you can
see that in this region,

1672
01:11:13,633 --> 01:11:15,467
we've got a lot of diagonal edges,

1673
01:11:15,467 --> 01:11:17,027
that this histogram of oriented gradient

1674
01:11:17,027 --> 01:11:20,140
feature representation's capturing.

1675
01:11:20,140 --> 01:11:22,309
So this was a super common
feature representation

1676
01:11:22,309 --> 01:11:24,335
and was used a lot for object recognition

1677
01:11:24,335 --> 01:11:26,502
actually not too long ago.

1678
01:11:27,373 --> 01:11:30,589
Another feature representation
that you might see out there

1679
01:11:30,589 --> 01:11:33,610
is this idea of bag of words.

1680
01:11:33,610 --> 01:11:35,002
So this is taking inspiration

1681
01:11:35,002 --> 01:11:37,155
from natural language processing.

1682
01:11:37,155 --> 01:11:39,020
So if you've got a paragraph,

1683
01:11:39,020 --> 01:11:41,599
then a way that you might
represent a paragraph

1684
01:11:41,599 --> 01:11:44,198
by a feature vector is
counting up the occurrences

1685
01:11:44,198 --> 01:11:46,532
of different words in that paragraph.

1686
01:11:46,532 --> 01:11:48,466
So we want to take that
intuition and apply it

1687
01:11:48,466 --> 01:11:50,464
to images in some way.

1688
01:11:50,464 --> 01:11:52,508
But the problem is that
there's no really simple,

1689
01:11:52,508 --> 01:11:55,088
straightforward analogy
of words to images,

1690
01:11:55,088 --> 01:11:57,432
so we need to define our own vocabulary

1691
01:11:57,432 --> 01:11:58,765
of visual words.

1692
01:11:59,680 --> 01:12:01,906
So we take this two-stage approach,

1693
01:12:01,906 --> 01:12:05,118
where first we'll get a bunch of images,

1694
01:12:05,118 --> 01:12:07,255
sample a whole bunch of tiny random crops

1695
01:12:07,255 --> 01:12:08,795
from those images and then cluster them

1696
01:12:08,795 --> 01:12:10,523
using something like K means

1697
01:12:10,523 --> 01:12:13,620
to come up with these
different cluster centers

1698
01:12:13,620 --> 01:12:15,989
that are maybe representing
different types

1699
01:12:15,989 --> 01:12:17,939
of visual words in the images.

1700
01:12:17,939 --> 01:12:20,141
So if you look at this
example on the right here,

1701
01:12:20,141 --> 01:12:21,960
this is a real example of clustering

1702
01:12:21,960 --> 01:12:24,135
actually different image
patches from images,

1703
01:12:24,135 --> 01:12:26,427
and you can see that after
this clustering step,

1704
01:12:26,427 --> 01:12:29,250
our visual words capture
these different colors,

1705
01:12:29,250 --> 01:12:31,352
like red and blue and yellow,

1706
01:12:31,352 --> 01:12:33,353
as well as these different
types of oriented edges

1707
01:12:33,353 --> 01:12:35,356
in different directions,

1708
01:12:35,356 --> 01:12:37,282
which is interesting that
now we're starting to see

1709
01:12:37,282 --> 01:12:39,709
these oriented edges
come out from the data

1710
01:12:39,709 --> 01:12:41,280
in a data-driven way.

1711
01:12:41,280 --> 01:12:43,897
And now, once we've got
these set of visual words,

1712
01:12:43,897 --> 01:12:45,091
also called a codebook,

1713
01:12:45,091 --> 01:12:48,049
then we can encode our
image by trying to say,

1714
01:12:48,049 --> 01:12:49,662
for each of these visual words,

1715
01:12:49,662 --> 01:12:53,268
how much does this visual
word occur in the image?

1716
01:12:53,268 --> 01:12:54,882
And now this gives us, again,

1717
01:12:54,882 --> 01:12:56,263
some slightly different information

1718
01:12:56,263 --> 01:13:00,227
about what is the visual
appearance of this image.

1719
01:13:00,227 --> 01:13:02,924
And actually this is a type
of feature representation

1720
01:13:02,924 --> 01:13:05,438
that Fei-Fei worked on when
she was a grad student,

1721
01:13:05,438 --> 01:13:08,355
so this is something
that you saw in practice

1722
01:13:08,355 --> 01:13:09,772
not too long ago.

1723
01:13:11,583 --> 01:13:13,833
So then as a bit of teaser,

1724
01:13:15,751 --> 01:13:17,637
tying this all back together,

1725
01:13:17,637 --> 01:13:20,543
the way that this image
classification pipeline

1726
01:13:20,543 --> 01:13:21,686
might have looked like,

1727
01:13:21,686 --> 01:13:23,525
maybe about five to 10 years ago,

1728
01:13:23,525 --> 01:13:25,221
would be that you would take your image,

1729
01:13:25,221 --> 01:13:27,477
and then compute these different
feature representations

1730
01:13:27,477 --> 01:13:29,609
of your image, things like bag of words,

1731
01:13:29,609 --> 01:13:31,973
or histogram of orientated gradients,

1732
01:13:31,973 --> 01:13:34,181
concatenate a whole bunch
of features together,

1733
01:13:34,181 --> 01:13:36,319
and then feed these feature extractors

1734
01:13:36,319 --> 01:13:39,390
down into some linear classifier.

1735
01:13:39,390 --> 01:13:40,261
I'm simplifying a little bit,

1736
01:13:40,261 --> 01:13:42,818
the pipelines were a little
bit more complex than that,

1737
01:13:42,818 --> 01:13:45,376
but this is the general intuition.

1738
01:13:45,376 --> 01:13:48,958
And then the idea here was
that after you extracted

1739
01:13:48,958 --> 01:13:50,996
these features, this feature extractor

1740
01:13:50,996 --> 01:13:53,126
would be a fixed block
that would not be updated

1741
01:13:53,126 --> 01:13:54,363
during training.

1742
01:13:54,363 --> 01:13:55,196
And during training,

1743
01:13:55,196 --> 01:13:56,733
you would only update
the linear classifier

1744
01:13:56,733 --> 01:13:58,707
if it's working on top of features.

1745
01:13:58,707 --> 01:14:00,772
And actually, I would
argue that once we move

1746
01:14:00,772 --> 01:14:02,574
to convolutional neural networks,

1747
01:14:02,574 --> 01:14:03,940
and these deep neural networks,

1748
01:14:03,940 --> 01:14:07,286
it actually doesn't look that different.

1749
01:14:07,286 --> 01:14:09,451
The only difference is that
rather than writing down

1750
01:14:09,451 --> 01:14:11,122
the features ahead of time,

1751
01:14:11,122 --> 01:14:13,487
we're going to learn the
features directly from the data.

1752
01:14:13,487 --> 01:14:16,716
So we'll take our raw pixels and feed them

1753
01:14:16,716 --> 01:14:18,330
into this to convolutional network,

1754
01:14:18,330 --> 01:14:20,487
which will end up computing
through many different layers

1755
01:14:20,487 --> 01:14:22,288
some type of feature representation

1756
01:14:22,288 --> 01:14:24,259
driven by the data, and
then we'll actually train

1757
01:14:24,259 --> 01:14:26,920
this entire weights for
this entire network,

1758
01:14:26,920 --> 01:14:28,754
rather than just the
weights of linear classifier

1759
01:14:28,754 --> 01:14:29,587
on top.

1760
01:14:31,129 --> 01:14:33,770
So, next time we'll really
start diving into this idea

1761
01:14:33,770 --> 01:14:36,931
in more detail, and we'll
introduce some neural networks,

1762
01:14:36,931 --> 00:00:00,000
and start talking about
backpropagation as well.