-
Notifications
You must be signed in to change notification settings - Fork 770
/
Copy path73_debugging-the-training-pipeline-(pytorch).srt
505 lines (404 loc) · 11.5 KB
/
73_debugging-the-training-pipeline-(pytorch).srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
1
00:00:06,210 --> 00:00:08,760
- 在本视频中,我们将了解如何调试错误
- In this video, we will see how to debug an error
2
00:00:08,760 --> 00:00:11,896
你在运行 Trainer.train 时你会遇到的
you encounter when running Trainer.train
3
00:00:11,896 --> 00:00:15,066
作为一个例子,我们将使用这个脚本微调
As an example, we will use this script that finetunes
4
00:00:15,066 --> 00:00:17,760
一个 BERT 模型, 在 GLUE MNLI 数据集上。
a BERT model on the GLUE MNLI dataset.
5
00:00:17,760 --> 00:00:19,470
查看下面链接的视频
Checkout the videos linked below
6
00:00:19,470 --> 00:00:21,840
以看看我们是如何得出这样一个脚本的。
to see how we came to such a script.
7
00:00:21,840 --> 00:00:24,540
这里我们要学习如何调试其中的问题。
Here we want to learn how to debug the problems in it.
8
00:00:25,470 --> 00:00:28,110
运行脚本很快就会给我们一个错误。
Running the script gives us an error pretty quickly.
9
00:00:28,110 --> 00:00:29,040
它发生在这一行
It happens at the line
10
00:00:29,040 --> 00:00:30,990
我们将输入提供给模型,
where we feed the inputs to the model,
11
00:00:30,990 --> 00:00:32,850
根据回溯。
according to the traceback.
12
00:00:32,850 --> 00:00:34,702
这告诉我们这里有问题,
That tells us there is a problem there,
13
00:00:34,702 --> 00:00:37,881
但问题可能有许多不同的原因。
but the problem could come from many different causes.
14
00:00:37,881 --> 00:00:39,330
要调试训练中的错误,
To debug an error in a training,
15
00:00:39,330 --> 00:00:41,760
你需要确保训练 pipeline 的每一步
you need to make sure each step of the training pipeline
16
00:00:41,760 --> 00:00:43,440
按预期工作。
works as intended.
17
00:00:43,440 --> 00:00:45,780
这意味着检查数据集的输入
This means checking that the inputs of your dataset
18
00:00:45,780 --> 00:00:47,040
是正确的,
are correct,
19
00:00:47,040 --> 00:00:48,720
你可以把它们分批在一起,
you can batch them together,
20
00:00:48,720 --> 00:00:50,790
把他们通过模型以获得损失,
feed them through the model to get a loss,
21
00:00:50,790 --> 00:00:52,500
然后计算该损失的梯度
then compute the gradients of that loss
22
00:00:52,500 --> 00:00:54,303
在执行优化器步骤之前。
before performing an optimizer step.
23
00:00:55,470 --> 00:00:57,810
因此,让我们从查看训练数据集开始
So let's start by looking at the training dataset
24
00:00:57,810 --> 00:00:59,043
这个 Trainer 正在使用。
this Trainer is using.
25
00:00:59,910 --> 00:01:02,190
这里肯定有问题。
There is definitely a problem here.
26
00:01:02,190 --> 00:01:04,293
我们看到的是文字而不是数字。
We see texts and not number.
27
00:01:05,130 --> 00:01:06,660
错误消息告诉我们模型
The error message was telling us the model
28
00:01:06,660 --> 00:01:08,220
没有得到输入 ID
did not get input IDs
29
00:01:08,220 --> 00:01:11,100
我们数据集中确实没有那些。
and we do not have those in the dataset indeed.
30
00:01:11,100 --> 00:01:12,660
回顾我们的代码,
Looking back at our code,
31
00:01:12,660 --> 00:01:14,400
我们可以看到我们犯了一个错误
we can see we made a mistake
32
00:01:14,400 --> 00:01:17,400
并将错误的数据集传递给 Trainer 。
and passed the wrong datasets to the Trainer.
33
00:01:17,400 --> 00:01:19,173
所以让我们修复它并再次运行。
So let's fix that and run again.
34
00:01:20,490 --> 00:01:21,840
现在我们有一个新的错误。
Now we have a new error.
35
00:01:21,840 --> 00:01:23,130
检查回溯
Inspecting the traceback
36
00:01:23,130 --> 00:01:25,860
告诉我们当我们尝试创建批处理时会发生,
tells us it happens when we try to create a batch,
37
00:01:25,860 --> 00:01:28,743
特别对于对张量中的特征进行分组。
specifically to group the features in a tensor.
38
00:01:29,700 --> 00:01:32,610
我们可以通过要求 Trainer 给我们一分批来确认这一点
We can confirm this by asking the Trainer to get us a batch
39
00:01:32,610 --> 00:01:34,230
训练数据加载器,
of the training data loader,
40
00:01:34,230 --> 00:01:35,913
复现相同的错误。
which reproduces the same error.
41
00:01:36,780 --> 00:01:39,064
通过检查输入或调试,
Either by inspecting the inputs or debugging,
42
00:01:39,064 --> 00:01:42,870
然后我们可以看到它们的大小并不相同。
we can then see they are not all of the same size.
43
00:01:42,870 --> 00:01:45,120
这是因为我们还没有输入数据整理器
This is because we have not passed a data collator
44
00:01:45,120 --> 00:01:46,890
对 Trainer 进行填充
to do the padding to the Trainer
45
00:01:46,890 --> 00:01:49,443
并且在预处理数据时也没有填充。
and didn't pad when preprocessing the data either.
46
00:01:50,430 --> 00:01:52,710
Trainer 内部的填充通常是默认设置,
Padding inside the Trainer is normally the default,
47
00:01:52,710 --> 00:01:55,380
但前提是你将 tokenizer 提供给 Trainer ,
but only if you provide your tokenizer to the Trainer,
48
00:01:55,380 --> 00:01:57,270
我们忘了这样做。
and we forgot to do that.
49
00:01:57,270 --> 00:01:59,120
因此,让我们解决问题并再次运行。
So let's fix the issue and run again.
50
00:02:00,510 --> 00:02:02,883
这次我们得到了一个讨厌的 CUDA 错误。
This time we get a nasty CUDA error.
51
00:02:03,765 --> 00:02:06,285
它们很难调试,因为一方面,
They are very difficult to debug because for one,
52
00:02:06,285 --> 00:02:10,530
他们将你的内核置于不可恢复的状态
they put your kernel in a state that is not recoverable
53
00:02:10,530 --> 00:02:13,260
所以你必须从头开始重启你的笔记本
so you have to restart your notebook from the beginning
54
00:02:13,260 --> 00:02:16,950
第二,追溯对那些来说完全没用。
and two, the traceback is completely useless for those.
55
00:02:16,950 --> 00:02:19,230
这里的回溯告诉我们错误发生了
Here the traceback tells us the error happens
56
00:02:19,230 --> 00:02:22,500
当我们使用 loss.backward 进行梯度计算时,
when we do the gradient computation with loss.backward,
57
00:02:22,500 --> 00:02:25,113
但正如我们稍后将看到的那样,情况并非如此。
but as we will see later on that is not the case.
58
00:02:26,520 --> 00:02:28,920
这是因为在 GPU 上发生的一切
This is because everything that happens on the GPU
59
00:02:28,920 --> 00:02:30,720
是异步完成的。
is done asynchronously.
60
00:02:30,720 --> 00:02:32,880
当你执行模型调用时,
When you execute the model call,
61
00:02:32,880 --> 00:02:34,457
该程序所做的只是堆叠它
what the program does is just stacking that
62
00:02:34,457 --> 00:02:36,600
在 GPU 的队列中,
in the queue of GPU,
63
00:02:36,600 --> 00:02:39,856
那么如果 GPU 当前没有任何工作要做,
then if the GPU didn't have any current job to do,
64
00:02:39,856 --> 00:02:41,850
工作将同时在 GPU 上开始
the work will start on the GPU at the same time
65
00:02:41,850 --> 00:02:45,000
当 CPU 移动到下一条指令时。
as the CPU moves to the next instruction.
66
00:02:45,000 --> 00:02:47,040
继续提取损失,
Continuing with the extraction of the loss,
67
00:02:47,040 --> 00:02:49,170
这被堆叠到 GPU 队列中
this is stacked into the GPU queue
68
00:02:49,170 --> 00:02:51,953
同时 CPU 移动到指令 loss.backward。
while the CPU moves to the instruction loss.backward.
69
00:02:51,953 --> 00:02:54,180
但是 GPU 还没有完成
But the GPU still hasn't finished
70
00:02:54,180 --> 00:02:55,710
模型的前向传播
the forward pass of the model
71
00:02:55,710 --> 00:02:57,603
因为这一切根本不需要时间。
since all that took no time at all.
72
00:02:58,440 --> 00:03:00,210
CPU 停止前进,
The CPU stops moving forward,
73
00:03:00,210 --> 00:03:03,240
因为 loss.backward 作为指令告诉它等待
because loss.backward as an instruction telling it to wait
74
00:03:03,240 --> 00:03:04,830
为了完成 GPU,
for the GPUs to be finished,
75
00:03:04,830 --> 00:03:06,780
以确保梯度是正确的。
to make sure the gradients are correct.
76
00:03:07,650 --> 00:03:09,570
当 GPU 遇到错误时,
When the GPU encounters an error,
77
00:03:09,570 --> 00:03:13,140
它通过一条神秘的消息将它返回给 CPU
it gives it back to the CPU with a cryptic message
78
00:03:13,140 --> 00:03:15,423
谁在错误的地方提出错误。
who raises the error at the wrong place.
79
00:03:16,350 --> 00:03:18,720
所以要调试这个,我们需要执行
So to debug this, we will need to execute
80
00:03:18,720 --> 00:03:21,211
接下来 CPU 上的训练流水线的步骤。
the next steps of the training pipeline on the CPU.
81
00:03:21,211 --> 00:03:22,380
这很容易做到,
It is very easy to do,
82
00:03:22,380 --> 00:03:25,350
这次我们得到了可以信任的回溯。
and we get a traceback we can trust this time.
83
00:03:25,350 --> 00:03:26,520
正如我们之前所说,
As we said before,
84
00:03:26,520 --> 00:03:28,620
错误实际上发生在前向传播过程中
the error actually happens during the forward pass
85
00:03:28,620 --> 00:03:29,453
对模型,
of the model,
86
00:03:29,453 --> 00:03:30,993
而不是 loss.backward。
and not loss.backward.
87
00:03:31,920 --> 00:03:33,680
这是一个索引错误。
It's an index error.
88
00:03:33,680 --> 00:03:34,950
经过一些调试,
With a bit of debugging,
89
00:03:34,950 --> 00:03:37,410
我们看到我们有从 0 到 2 的标签,
we see we have labels ranging from 0 to 2,
90
00:03:37,410 --> 00:03:39,000
所以三个不同的值,
so three different values,
91
00:03:39,000 --> 00:03:42,191
但我们的输出具有每 2 个批量大小的形状。
but our outputs have a shape of batch size per 2.
92
00:03:42,191 --> 00:03:45,600
看起来我们的模型有错误的标签数量。
It looks like our model has the wrong number of labels.
93
00:03:45,600 --> 00:03:47,190
我们确实可以确认,
We can indeed confirm that,
94
00:03:47,190 --> 00:03:49,860
现在我们知道在代码中修复它很容易
and now that we know it's easy to fix it in the code
95
00:03:49,860 --> 00:03:53,969
通过在创建模型时添加 num_labels=3。
by adding num_labels=3 when we create the model.
96
00:03:53,969 --> 00:03:56,883
现在训练脚本将运行完成。
Now the training script will run to completion.
97
00:03:58,440 --> 00:03:59,430
我们还不需要它,
We did not need it yet,
98
00:03:59,430 --> 00:04:00,960
但这是我们调试下一步的方式
but here is how we would debug the next step
99
00:04:00,960 --> 00:04:02,944
pipeline ,梯度计算,
of the pipeline, gradient computation,
100
00:04:02,944 --> 00:04:05,850
以及优化器更新。
as well as the optimizer step.
101
00:04:05,850 --> 00:04:08,823
有了所有这些,祝你在调试自己的训练时好运!
With all of this, good luck debugging your own trainings!