-
Notifications
You must be signed in to change notification settings - Fork 212
/
Copy pathLecture 3 _ Loss Functions and Optimization.srt
7912 lines (6150 loc) · 133 KB
/
Lecture 3 _ Loss Functions and Optimization.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:08,555 --> 00:00:12,482
- Okay so welcome to
CS 231N Lecture three.
2
00:00:12,482 --> 00:00:15,394
Today we're going to talk about
loss functions and optimization
3
00:00:15,394 --> 00:00:17,520
but as usual, before we
get to the main content
4
00:00:17,520 --> 00:00:19,762
of the lecture, there's a
couple administrative things
5
00:00:19,762 --> 00:00:20,929
to talk about.
6
00:00:21,894 --> 00:00:25,347
So the first thing is that
assignment one has been released.
7
00:00:25,347 --> 00:00:27,689
You can find the link up on the website.
8
00:00:27,689 --> 00:00:29,176
And since we were a little bit late
9
00:00:29,176 --> 00:00:30,886
in getting this assignment
out to you guys,
10
00:00:30,886 --> 00:00:33,981
we've decided to change
the due date to Thursday,
11
00:00:33,981 --> 00:00:36,064
April 20th at 11:59 p.m.,
12
00:00:37,174 --> 00:00:40,081
this will give you a full
two weeks from the assignment
13
00:00:40,081 --> 00:00:43,502
release date to go and
actually finish and work on it,
14
00:00:43,502 --> 00:00:47,299
so we'll update the syllabus
for this new due date
15
00:00:47,299 --> 00:00:49,887
in a little bit later today.
16
00:00:49,887 --> 00:00:51,825
And as a reminder, when you
complete the assignment,
17
00:00:51,825 --> 00:00:55,417
you should go turn in the
final zip file on Canvas
18
00:00:55,417 --> 00:00:57,579
so we can grade it and get
your grades back as quickly
19
00:00:57,579 --> 00:00:58,579
as possible.
20
00:00:59,599 --> 00:01:04,233
So the next thing is always
check out Piazza for interesting
21
00:01:04,233 --> 00:01:05,679
administrative stuff.
22
00:01:05,679 --> 00:01:08,588
So this week I wanted to
highlight that we have several
23
00:01:08,588 --> 00:01:12,232
example project ideas as
a pinned post on Piazza.
24
00:01:12,232 --> 00:01:15,730
So we went out and solicited
example of project ideas
25
00:01:15,730 --> 00:01:18,020
from various people in the
Stanford community or affiliated
26
00:01:18,020 --> 00:01:20,951
to Stanford, and they came
up with some interesting
27
00:01:20,951 --> 00:01:23,713
suggestions for projects
that they might want students
28
00:01:23,713 --> 00:01:25,383
in the class to work on.
29
00:01:25,383 --> 00:01:27,786
So check out this pinned post
on Piazza and if you want
30
00:01:27,786 --> 00:01:31,384
to work on any of these projects,
then feel free to contact
31
00:01:31,384 --> 00:01:35,031
the project mentors
directly about these things.
32
00:01:35,031 --> 00:01:37,890
Aditionally we posted office
hours on the course website,
33
00:01:37,890 --> 00:01:41,556
this is a Google calendar, so
this is something that people
34
00:01:41,556 --> 00:01:45,877
have been asking about
and now it's up there.
35
00:01:45,877 --> 00:01:49,107
The final administrative
note is about Google Cloud,
36
00:01:49,107 --> 00:01:52,545
as a reminder, because we're
supported by Google Cloud
37
00:01:52,545 --> 00:01:55,131
in this class, we're able to
give each of you an additional
38
00:01:55,131 --> 00:01:57,833
$100 credit for Google Cloud
to work on your assignments
39
00:01:57,833 --> 00:02:01,487
and projects, and the exact
details of how to redeem
40
00:02:01,487 --> 00:02:06,046
that credit will go out later
today, most likely on Piazza.
41
00:02:06,046 --> 00:02:08,610
So if there's, I guess if
there's no questions about
42
00:02:08,610 --> 00:02:12,777
administrative stuff then we'll
move on to course content.
43
00:02:14,240 --> 00:02:15,073
Okay cool.
44
00:02:16,359 --> 00:02:18,797
So recall from last time in lecture two,
45
00:02:18,797 --> 00:02:21,212
we were really talking about
the challenges of recognition
46
00:02:21,212 --> 00:02:23,002
and trying to hone in on this idea
47
00:02:23,002 --> 00:02:25,276
of a data-driven approach.
48
00:02:25,276 --> 00:02:27,721
We talked about this idea
of image classification,
49
00:02:27,721 --> 00:02:29,960
talked about why it's hard,
there's this semantic gap
50
00:02:29,960 --> 00:02:34,002
between the giant grid of
numbers that the computer sees
51
00:02:34,002 --> 00:02:36,612
and the actual image that you see.
52
00:02:36,612 --> 00:02:38,445
We talked about various
challenges regarding this
53
00:02:38,445 --> 00:02:40,757
around illumination,
deformation, et cetera,
54
00:02:40,757 --> 00:02:42,924
and why this is actually a
really, really hard problem
55
00:02:42,924 --> 00:02:44,986
even though it's super
easy for people to do
56
00:02:44,986 --> 00:02:48,712
with their human eyes
and human visual system.
57
00:02:48,712 --> 00:02:51,221
Then also recall last time
we talked about the k-nearest
58
00:02:51,221 --> 00:02:54,289
neighbor classifier as kind
of a simple introduction
59
00:02:54,289 --> 00:02:56,109
to this whole data-driven mindset.
60
00:02:56,109 --> 00:02:58,792
We talked about the CIFAR-10
data set where you can see
61
00:02:58,792 --> 00:03:01,624
an example of these images
on the upper left here,
62
00:03:01,624 --> 00:03:04,488
where CIFAR-10 gives you
these 10 different categories,
63
00:03:04,488 --> 00:03:06,587
airplane, automobile, whatnot,
64
00:03:06,587 --> 00:03:09,427
and we talked about how the
k-nearest neighbor classifier
65
00:03:09,427 --> 00:03:12,002
can be used to learn decision boundaries
66
00:03:12,002 --> 00:03:14,404
to separate these data points into classes
67
00:03:14,404 --> 00:03:16,546
based on the training data.
68
00:03:16,546 --> 00:03:19,399
This also led us to a
discussion of the idea of cross
69
00:03:19,399 --> 00:03:21,755
validation and setting
hyper parameters by dividing
70
00:03:21,755 --> 00:03:25,990
your data into train,
validation and test sets.
71
00:03:25,990 --> 00:03:28,008
Then also recall last time
we talked about linear
72
00:03:28,008 --> 00:03:30,857
classification as the first
sort of building block
73
00:03:30,857 --> 00:03:33,210
as we move toward neural networks.
74
00:03:33,210 --> 00:03:35,526
Recall that the linear
classifier is an example
75
00:03:35,526 --> 00:03:39,338
of a parametric classifier
where all of our knowledge
76
00:03:39,338 --> 00:03:41,328
about the training data gets summarized
77
00:03:41,328 --> 00:03:44,146
into this parameter matrix W that is set
78
00:03:44,146 --> 00:03:46,244
during the process of training.
79
00:03:46,244 --> 00:03:49,248
And this linear classifier
recall is super simple,
80
00:03:49,248 --> 00:03:51,115
where we're going to take
the image and stretch it out
81
00:03:51,115 --> 00:03:52,610
into a long vector.
82
00:03:52,610 --> 00:03:55,774
So here the image is x and
then we take that image
83
00:03:55,774 --> 00:03:59,095
which might be 32 by 32 by
3 pixels, stretch it out
84
00:03:59,095 --> 00:04:02,051
into a long column vector of 32 times 32
85
00:04:02,051 --> 00:04:03,718
times 3 entries,
86
00:04:05,144 --> 00:04:07,203
where the 32 and 32 are
the height and width,
87
00:04:07,203 --> 00:04:09,023
and the 3 give you
the three color channels,
88
00:04:09,023 --> 00:04:10,522
red, green, blue.
89
00:04:10,522 --> 00:04:14,361
Then there exists some parameter matrix, W
90
00:04:14,361 --> 00:04:16,481
which will take this long column vector
91
00:04:16,481 --> 00:04:19,317
representing the image
pixels, and convert this
92
00:04:19,317 --> 00:04:21,642
and give you 10 numbers giving scores
93
00:04:21,642 --> 00:04:25,187
for each of the 10 classes
in the case of CIFAR-10.
94
00:04:25,187 --> 00:04:26,916
Where we kind of had this interpretation
95
00:04:26,916 --> 00:04:30,417
where larger values of those scores,
96
00:04:30,417 --> 00:04:33,147
so a larger value for the cat
class means the classifier
97
00:04:33,147 --> 00:04:35,681
thinks that the cat is
more likely for that image,
98
00:04:35,681 --> 00:04:38,350
and lower values for
maybe the dog or car class
99
00:04:38,350 --> 00:04:41,353
indicate lower probabilities
of those classes being present
100
00:04:41,353 --> 00:04:43,243
in the image.
101
00:04:43,243 --> 00:04:46,564
Also, so I think this point
was a little bit unclear
102
00:04:46,564 --> 00:04:50,209
last time that linear classification
has this interpretation
103
00:04:50,209 --> 00:04:52,425
as learning templates per class,
104
00:04:52,425 --> 00:04:55,128
where if you look at the
diagram on the lower left,
105
00:04:55,128 --> 00:04:58,299
you think that, so for
every pixel in the image,
106
00:04:58,299 --> 00:05:00,411
and for every one of our 10 classes,
107
00:05:00,411 --> 00:05:03,244
there exists some entry in this matrix W,
108
00:05:03,244 --> 00:05:07,354
telling us how much does that
pixel influence that class.
109
00:05:07,354 --> 00:05:10,416
So that means that each of
these rows in the matrix W
110
00:05:10,416 --> 00:05:13,212
ends up corresponding to
a template for the class.
111
00:05:13,212 --> 00:05:15,479
And if we take those rows and unravel,
112
00:05:15,479 --> 00:05:17,724
so each of those rows again corresponds
113
00:05:17,724 --> 00:05:20,540
to a weighting between the values of,
114
00:05:20,540 --> 00:05:23,351
between the pixel values of
the image and that class,
115
00:05:23,351 --> 00:05:26,246
so if we take that row and
unravel it back into an image,
116
00:05:26,246 --> 00:05:28,787
then we can visualize the
learned template for each
117
00:05:28,787 --> 00:05:30,700
of these classes.
118
00:05:30,700 --> 00:05:33,324
We also had this interpretation
of linear classification
119
00:05:33,324 --> 00:05:36,199
as learning linear decision
boundaries between pixels
120
00:05:36,199 --> 00:05:38,588
in some high dimensional
space where the dimensions
121
00:05:38,588 --> 00:05:41,611
of the space correspond
to the values of the pixel
122
00:05:41,611 --> 00:05:44,574
intensity values of the image.
123
00:05:44,574 --> 00:05:48,371
So this is kind of where
we left off last time.
124
00:05:48,371 --> 00:05:51,615
And so where we kind of
stopped, where we ended up last
125
00:05:51,615 --> 00:05:54,941
time is we got this idea
of a linear classifier,
126
00:05:54,941 --> 00:05:58,354
and we didn't talk about how
to actually choose the W.
127
00:05:58,354 --> 00:06:00,189
How to actually use the training data
128
00:06:00,189 --> 00:06:03,428
to determine which value
of W should be best.
129
00:06:03,428 --> 00:06:05,256
So kind of where we stopped off at
130
00:06:05,256 --> 00:06:09,092
is that for some setting
of W, we can use this W
131
00:06:09,092 --> 00:06:12,868
to come up with 10 with our
class scores for any image.
132
00:06:12,868 --> 00:06:16,397
So and some of these class
scores might be better or worse.
133
00:06:16,397 --> 00:06:17,964
So here in this simple example,
134
00:06:17,964 --> 00:06:21,633
we've shown maybe just a
training data set of three images
135
00:06:21,633 --> 00:06:25,384
along with the 10 class scores
predicted for some value of W
136
00:06:25,384 --> 00:06:26,846
for those images.
137
00:06:26,846 --> 00:06:28,647
And you can see that some
of these scores are better
138
00:06:28,647 --> 00:06:30,306
or worse than others.
139
00:06:30,306 --> 00:06:33,144
So for example in the image
on the left, if you look up,
140
00:06:33,144 --> 00:06:35,042
it's actually a cat because you're a human
141
00:06:35,042 --> 00:06:36,724
and you can tell these things,
142
00:06:36,724 --> 00:06:39,752
but if we look at the
assigned probabilities, cat,
143
00:06:39,752 --> 00:06:41,868
well not probabilities but scores,
144
00:06:41,868 --> 00:06:44,236
then the classifier maybe
for this setting of W
145
00:06:44,236 --> 00:06:48,882
gave the cat class a score
of 2.9 for this image,
146
00:06:48,882 --> 00:06:51,818
whereas the frog class gave 3.78.
147
00:06:51,818 --> 00:06:53,909
So maybe the classifier
is not doing not so good
148
00:06:53,909 --> 00:06:56,236
on this image, that's bad,
we wanted the true class
149
00:06:56,236 --> 00:06:58,720
to be actually the highest class score,
150
00:06:58,720 --> 00:07:00,909
whereas for some of these
other examples, like the car
151
00:07:00,909 --> 00:07:03,529
for example, you see
that the automobile class
152
00:07:03,529 --> 00:07:05,193
has a score of six which is much higher
153
00:07:05,193 --> 00:07:07,619
than any of the others, so that's good.
154
00:07:07,619 --> 00:07:11,433
And the frog, the predicted
scores are maybe negative four,
155
00:07:11,433 --> 00:07:13,637
which is much lower
than all the other ones,
156
00:07:13,637 --> 00:07:15,157
so that's actually bad.
157
00:07:15,157 --> 00:07:17,331
So this is kind of a hand wavy approach,
158
00:07:17,331 --> 00:07:19,140
just kind of looking at
the scores and eyeballing
159
00:07:19,140 --> 00:07:21,454
which ones are good
and which ones are bad.
160
00:07:21,454 --> 00:07:23,610
But to actually write
algorithms about these things
161
00:07:23,610 --> 00:07:26,064
and to actually to determine
automatically which W
162
00:07:26,064 --> 00:07:29,660
will be best, we need some
way to quantify the badness
163
00:07:29,660 --> 00:07:31,832
of any particular W.
164
00:07:31,832 --> 00:07:35,826
And that's this function
that takes in a W,
165
00:07:35,826 --> 00:07:39,283
looks at the scores and then
tells us how bad quantitatively
166
00:07:39,283 --> 00:07:42,787
is that W, is something that
we'll call a loss function.
167
00:07:42,787 --> 00:07:45,467
And in this lecture we'll
see a couple examples
168
00:07:45,467 --> 00:07:48,093
of different loss functions
that you can use for this image
169
00:07:48,093 --> 00:07:50,582
classification problem.
170
00:07:50,582 --> 00:07:53,483
So then once we've got this
idea of a loss function,
171
00:07:53,483 --> 00:07:57,532
this allows us to quantify
for any given value of W,
172
00:07:57,532 --> 00:07:59,298
how good or bad is it?
173
00:07:59,298 --> 00:08:00,834
But then we actually need to find
174
00:08:00,834 --> 00:08:02,730
and come up with an efficient procedure
175
00:08:02,730 --> 00:08:05,570
for searching through the
space of all possible Ws
176
00:08:05,570 --> 00:08:08,934
and actually come up with
what is the correct value
177
00:08:08,934 --> 00:08:11,488
of W that is the least bad,
178
00:08:11,488 --> 00:08:13,660
and this process will be
an optimization procedure
179
00:08:13,660 --> 00:08:17,076
and we'll talk more about
that in this lecture.
180
00:08:17,076 --> 00:08:19,091
So I'm going to shrink
this example a little bit
181
00:08:19,091 --> 00:08:21,803
because 10 classes is
a little bit unwieldy.
182
00:08:21,803 --> 00:08:24,731
So we'll kind of work with
this tiny toy data set
183
00:08:24,731 --> 00:08:27,551
of three examples and
three classes going forward
184
00:08:27,551 --> 00:08:29,686
in this lecture.
185
00:08:29,686 --> 00:08:33,639
So again, in this example, the
cat is maybe not so correctly
186
00:08:33,639 --> 00:08:38,407
classified, the car is correctly
classified, and the frog,
187
00:08:38,407 --> 00:08:41,320
this setting of W got this
frog image totally wrong,
188
00:08:41,320 --> 00:08:45,225
because the frog score is
much lower than others.
189
00:08:45,225 --> 00:08:47,764
So to formalize this a little
bit, usually when we talk
190
00:08:47,764 --> 00:08:49,617
about a loss function, we imagine
191
00:08:49,617 --> 00:08:53,670
that we have some training
data set of xs and ys,
192
00:08:53,670 --> 00:08:56,996
usually N examples of these
where the xs are the inputs
193
00:08:56,996 --> 00:09:00,004
to the algorithm in the
image classification case,
194
00:09:00,004 --> 00:09:03,862
the xs would be the actually
pixel values of your images,
195
00:09:03,862 --> 00:09:06,207
and the ys will be the things
you want your algorithm
196
00:09:06,207 --> 00:09:09,730
to predict, we usually call
these the labels or the targets.
197
00:09:09,730 --> 00:09:11,782
So in the case of image classification,
198
00:09:11,782 --> 00:09:14,540
remember we're trying
to categorize each image
199
00:09:14,540 --> 00:09:17,597
for CIFAR-10 to one of 10 categories,
200
00:09:17,597 --> 00:09:19,801
so the label y here will be an integer
201
00:09:19,801 --> 00:09:22,948
between one and 10 or
maybe between zero and nine
202
00:09:22,948 --> 00:09:25,214
depending on what programming
language you're using,
203
00:09:25,214 --> 00:09:27,045
but it'll be an integer telling you
204
00:09:27,045 --> 00:09:31,070
what is the correct category
for each one of those images x.
205
00:09:31,070 --> 00:09:35,284
And now our loss function
will denote L_i to denote the,
206
00:09:35,284 --> 00:09:37,693
so then we have this prediction function x
207
00:09:37,693 --> 00:09:41,769
which takes in our example
x and our weight matrix W
208
00:09:41,769 --> 00:09:43,638
and makes some prediction for y,
209
00:09:43,638 --> 00:09:45,235
in the case of image classification
210
00:09:45,235 --> 00:09:47,246
these will be our 10 numbers.
211
00:09:47,246 --> 00:09:50,738
Then we'll define some loss function L_i
212
00:09:50,738 --> 00:09:53,400
which will take in the predicted scores
213
00:09:53,400 --> 00:09:54,983
coming out of the function f
214
00:09:54,983 --> 00:09:57,604
together with the true target or label Y
215
00:09:57,604 --> 00:10:00,112
and give us some quantitative
value for how bad