-
Notifications
You must be signed in to change notification settings - Fork 776
/
Copy path08_what-happens-inside-the-pipeline-function-(pytorch).srt
532 lines (426 loc) · 12.4 KB
/
08_what-happens-inside-the-pipeline-function-(pytorch).srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
1
00:00:00,554 --> 00:00:03,304
(徽标呼啸而过)
(logo whooshing)
2
00:00:05,340 --> 00:00:07,563
- pipeline 函数内部发生了什么?
*[译者注: pipeline 作为 流水线 的意思]
- What happens inside the pipeline function?
3
00:00:08,760 --> 00:00:11,580
在这段视频中,我们将看看实际发生了什么
In this video, we will look at what actually happens
4
00:00:11,580 --> 00:00:13,080
当我们使用 Transformers 库的
when we use the pipeline function
5
00:00:13,080 --> 00:00:15,090
pipeline 函数时
of the Transformers library.
6
00:00:15,090 --> 00:00:16,860
详细来讲,我们将看
More specifically, we will look
7
00:00:16,860 --> 00:00:19,200
在情绪分析的 pipeline 中,
at the sentiment analysis pipeline,
8
00:00:19,200 --> 00:00:22,020
它是如何从以下两个句子开始的,
and how it went from the two following sentences,
9
00:00:22,020 --> 00:00:23,970
将正负标签
to the positive and negative labels
10
00:00:23,970 --> 00:00:25,420
加上各自的分数。
with their respective scores.
11
00:00:26,760 --> 00:00:29,190
正如我们在 pipeline 展示中看到的那样,
As we have seen in the pipeline presentation,
12
00:00:29,190 --> 00:00:31,860
pipeline 分为三个阶段。
there are three stages in the pipeline.
13
00:00:31,860 --> 00:00:34,620
首先,我们将原始文本转换为数字
First, we convert the raw texts to numbers
14
00:00:34,620 --> 00:00:37,173
该模型可以通过使用分词器理解。
the model can make sense of using a tokenizer.
15
00:00:38,010 --> 00:00:40,530
然后这些数字通过模型,
Then those numbers go through the model,
16
00:00:40,530 --> 00:00:41,943
输出 logits 。
which outputs logits.
17
00:00:42,780 --> 00:00:45,600
最后,后处理步骤转换
Finally, the post-processing steps transforms
18
00:00:45,600 --> 00:00:48,150
那些 logits 包含标签和分数。
those logits into labels and scores.
19
00:00:48,150 --> 00:00:50,700
让我们详细看看这三个步骤
Let's look in detail at those three steps
20
00:00:50,700 --> 00:00:53,640
以及如何使用 Transformers 库复制它们,
and how to replicate them using the Transformers library,
21
00:00:53,640 --> 00:00:56,043
从第一阶段开始,分词化。
beginning with the first stage, tokenization.
22
00:00:57,915 --> 00:01:00,360
分词化过程有几个步骤。
The tokenization process has several steps.
23
00:01:00,360 --> 00:01:04,950
首先,文本被分成小块, 称之为 token。
*[译者注: 后面 token-* 均翻译成 分词-*]
First, the text is split into small chunks called tokens.
24
00:01:04,950 --> 00:01:08,550
它们可以是单词、单词的一部分或标点符号。
They can be words, parts of words or punctuation symbols.
25
00:01:08,550 --> 00:01:11,580
然后分词器将有一些特殊的 token ,
Then the tokenizer will had some special tokens,
26
00:01:11,580 --> 00:01:13,500
如果模型期望它们。
if the model expect them.
27
00:01:13,500 --> 00:01:16,860
这里的模型在开头使用期望 CLS token
Here the model uses expects a CLS token at the beginning
28
00:01:16,860 --> 00:01:19,743
以及用于分类的句子末尾的 SEP token。
and a SEP token at the end of the sentence to classify.
29
00:01:20,580 --> 00:01:24,180
最后,分词器将每个 token 与其唯一 ID 匹配
Lastly, the tokenizer matches each token to its unique ID
30
00:01:24,180 --> 00:01:27,000
在预训练模型的词汇表中。
in the vocabulary of the pretrained model.
31
00:01:27,000 --> 00:01:28,680
要加载这样的分词器,
To load such a tokenizer,
32
00:01:28,680 --> 00:01:31,743
Transformers 库提供了 AutoTokenizer API。
the Transformers library provides the AutoTokenizer API.
33
00:01:32,730 --> 00:01:36,120
这个类最重要的方法是 from_pretrained,
The most important method of this class is from_pretrained,
34
00:01:36,120 --> 00:01:38,910
这将下载并缓存配置
which will download and cache the configuration
35
00:01:38,910 --> 00:01:41,853
以及与给定检查点相关联的词汇表。
and the vocabulary associated to a given checkpoint.
36
00:01:43,200 --> 00:01:45,360
这里默认使用的 checkpoint
Here the checkpoint used by default
37
00:01:45,360 --> 00:01:47,280
用于情绪分析的 pipeline
for the sentiment analysis pipeline
38
00:01:47,280 --> 00:01:51,986
是 distilbert-base-uncased-finetuned-sst-2-English。
is distilbert-base-uncased-finetuned-sst-2-English.
39
00:01:51,986 --> 00:01:53,700
(模糊)
(indistinct)
40
00:01:53,700 --> 00:01:56,490
我们实例化一个与该检查点关联的分词器,
We instantiate a tokenizer associated with that checkpoint,
41
00:01:56,490 --> 00:01:59,490
然后给它输入两个句子。
then feed it the two sentences.
42
00:01:59,490 --> 00:02:02,100
由于这两个句子的大小不同,
Since those two sentences are not of the same size,
43
00:02:02,100 --> 00:02:03,930
我们需要填充最短的一个
we will need to pad the shortest one
44
00:02:03,930 --> 00:02:06,030
能够构建一个数组。
to be able to build an array.
45
00:02:06,030 --> 00:02:09,840
这是由标记器使用选项 padding=True 完成的。
This is done by the tokenizer with the option, padding=True.
46
00:02:09,840 --> 00:02:12,810
使用 truncation=True,我们确保任何句子
With truncation=True, we ensure that any sentence
47
00:02:12,810 --> 00:02:15,873
超过模型可以处理的最大值的长度将被截断。
longer than the maximum the model can handle is truncated.
48
00:02:17,010 --> 00:02:19,620
最后, return_tensors 选项
Lastly, the return_tensors option
49
00:02:19,620 --> 00:02:22,323
告诉分词器返回一个 PyTorch 张量。
tells the tokenizer to return a PyTorch tensor.
50
00:02:23,190 --> 00:02:25,590
查看结果,我们看到我们有一本字典
Looking at the result, we see we have a dictionary
51
00:02:25,590 --> 00:02:26,670
和两个主键
with two keys.
52
00:02:26,670 --> 00:02:29,970
输入 ID 包含两个句子的 ID,
Input IDs contains the IDs of both sentences,
53
00:02:29,970 --> 00:02:32,550
应用填充的位置为零。
with zero where the padding is applied.
54
00:02:32,550 --> 00:02:34,260
第二个键值,注意力掩码,
The second key, attention mask,
55
00:02:34,260 --> 00:02:36,150
指示已应用填充的位置,
indicates where padding has been applied,
56
00:02:36,150 --> 00:02:38,940
所以模型不会关注它。
so the model does not pay attention to it.
57
00:02:38,940 --> 00:02:42,090
这就是分词化步骤中的全部内容。
This is all what is inside the tokenization step.
58
00:02:42,090 --> 00:02:46,289
现在,让我们来看看第二步,模型。
Now, let's have a look at the second step, the model.
59
00:02:46,289 --> 00:02:47,952
至于分词器,
As for the tokenizer,
60
00:02:47,952 --> 00:02:51,133
有一个带有 from_pretrained 方法的 AutoModel API。
there is an AutoModel API with a from_pretrained method.
61
00:02:51,133 --> 00:02:53,954
它将下载并缓存模型的配置
It will download and cache the configuration of the model
62
00:02:53,954 --> 00:02:56,280
以及预训练的权重。
as well as the pretrained weights.
63
00:02:56,280 --> 00:02:58,200
然而,AutoModel API
However, the AutoModel API
64
00:02:58,200 --> 00:03:00,630
只会实例化模型的主体,
will only instantiate the body of the model,
65
00:03:00,630 --> 00:03:03,420
那是模型剩下的部分
that is the part of the model that is left
66
00:03:03,420 --> 00:03:06,090
一旦预训练头被移除。
once the pretraining head is removed.
67
00:03:06,090 --> 00:03:08,610
它会输出一个高维张量
It will output a high-dimensional tensor
68
00:03:08,610 --> 00:03:11,220
这是通过的句子的表示,
that is a representation of the sentences passed,
69
00:03:11,220 --> 00:03:12,690
但这不是直接有用的
but which is not directly useful
70
00:03:12,690 --> 00:03:15,030
对于我们的分类问题。
for our classification problem.
71
00:03:15,030 --> 00:03:19,230
这里的张量有两个句子,每个句子有 16 个 token ,
Here the tensor has two sentences, each of 16 tokens,
72
00:03:19,230 --> 00:03:23,433
最后一个维度是我们模型的隐藏大小,768。
and the last dimension is the hidden size of our model, 768.
73
00:03:24,900 --> 00:03:27,510
要获得与我们的分类问题相关的输出,
To get an output linked to our classification problem,
74
00:03:27,510 --> 00:03:31,170
我们需要使用 AutoModelForSequenceClassification 类。
we need to use the AutoModelForSequenceClassification class.
75
00:03:31,170 --> 00:03:33,330
它与 AutoModel 类完全一样工作,
It works exactly as the AutoModel class,
76
00:03:33,330 --> 00:03:35,130
除了它会建立一个模型
except that it will build a model
77
00:03:35,130 --> 00:03:36,543
带分类头。
with a classification head.
78
00:03:37,483 --> 00:03:39,560
每个常见的 NLP 任务在 Transformers 库中
There is one auto class for each common NLP task
79
00:03:39,560 --> 00:03:40,960
都有一个自动类
in the Transformers library.
80
00:03:42,150 --> 00:03:45,570
在给我们的模型两个句子之后,
Here after giving our model the two sentences,
81
00:03:45,570 --> 00:03:47,820
我们得到一个大小为二乘二的张量,
we get a tensor of size two by two,
82
00:03:47,820 --> 00:03:50,943
每个句子和每个可能的标签都有一个结果。
one result for each sentence and for each possible label.
83
00:03:51,840 --> 00:03:53,970
这些输出还不是概率,
Those outputs are not probabilities yet,
84
00:03:53,970 --> 00:03:56,100
我们可以看到它们的总和不为 1。
we can see they don't sum to 1.
85
00:03:56,100 --> 00:03:57,270
这是因为 Transformers 库中
This is because each model
86
00:03:57,270 --> 00:04:00,810
每个模型都会返回 logits 。
of the Transformers library returns logits.
87
00:04:00,810 --> 00:04:02,250
为了理解这些 logits ,
To make sense of those logits,
88
00:04:02,250 --> 00:04:05,910
我们需要深入研究管道的第三步也是最后一步。
we need to dig into the third and last step of the pipeline.
89
00:04:05,910 --> 00:04:10,620
后处理,将 logits 转换为概率,
Post-processing, to convert logits into probabilities,
90
00:04:10,620 --> 00:04:13,470
我们需要对它们应用 SoftMax 层。
we need to apply a SoftMax layers to them.
91
00:04:13,470 --> 00:04:14,610
正如我们所见,
As we can see,
92
00:04:14,610 --> 00:04:17,267
这会将它们转换为正数
this transforms them into positive number
93
00:04:17,267 --> 00:04:18,663
总结为一个。
that sum up to one.
94
00:04:18,663 --> 00:04:21,360
最后一步是知道哪些对应
The last step is to know which of those corresponds
95
00:04:21,360 --> 00:04:23,580
正面或负面的标签。
to the positive or the negative label.
96
00:04:23,580 --> 00:04:28,020
这是由模型配置的 id2label 字段给出的。
This is given by the id2label field of the model config.
97
00:04:28,020 --> 00:04:30,390
第一概率,指数零,
The first probabilities, index zero,
98
00:04:30,390 --> 00:04:32,250
对应负标签,
correspond to the negative label,
99
00:04:32,250 --> 00:04:34,140
然后第二个,索引一,
and the seconds, index one,
100
00:04:34,140 --> 00:04:36,480
对应正标签。
correspond to the positive label.
101
00:04:36,480 --> 00:04:37,950
这就是我们的分类器的构建方式
This is how our classifier built
102
00:04:37,950 --> 00:04:40,230
使用 pipeline 功能选择了那些标签
with the pipeline function picked those labels
103
00:04:40,230 --> 00:04:42,240
并计算出这些分数。
and computed those scores.
104
00:04:42,240 --> 00:04:44,220
既然你知道每个步骤是如何工作的,
Now that you know how each steps works,
105
00:04:44,220 --> 00:04:46,220
你可以轻松地根据需要调整它们。
you can easily tweak them to your needs.
106
00:04:47,524 --> 00:04:50,274
(徽标呼啸而过)
(logo whooshing)