-
Notifications
You must be signed in to change notification settings - Fork 776
/
Copy path45_inside-the-token-classification-pipeline-(pytorch).srt
385 lines (308 loc) · 8.43 KB
/
45_inside-the-token-classification-pipeline-(pytorch).srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
1
00:00:00,076 --> 00:00:01,462
(标题嘶嘶作响)
(title whooshes)
2
00:00:01,462 --> 00:00:02,382
(标志弹出)
(logo pops)
3
00:00:02,382 --> 00:00:05,340
(标题嘶嘶作响)
(title whooshes)
4
00:00:05,340 --> 00:00:06,210
- 我们来看一下
- Let's have a look
5
00:00:06,210 --> 00:00:08,283
在 token 分类管线( pipeline )内。
inside the token classification pipeline.
6
00:00:10,080 --> 00:00:11,580
在有关管线视频中,
In the pipeline video,
7
00:00:11,580 --> 00:00:13,320
我们学习了不同的应用
we looked at the different applications
8
00:00:13,320 --> 00:00:15,960
transformers 库支持开箱即用的,
the Transformers library supports out of the box,
9
00:00:15,960 --> 00:00:18,780
其中之一是 token 分类,
one of them being token classification,
10
00:00:18,780 --> 00:00:21,810
例如预测句子中的每个单词
for instance predicting for each word in a sentence
11
00:00:21,810 --> 00:00:24,510
是否对应于一个人,一个组织
whether they correspond to a person, an organization
12
00:00:24,510 --> 00:00:25,353
或一个位置。
or a location.
13
00:00:26,670 --> 00:00:28,920
我们甚至可以将相应的 token 组合在一起
We can even group together the tokens corresponding
14
00:00:28,920 --> 00:00:32,040
到同一个实体,例如所有 token
to the same entity, for instance all the tokens
15
00:00:32,040 --> 00:00:35,373
在这里形成了 Sylvain 这个词,或 Hugging 和 Face。
that formed the word Sylvain here, or Hugging and Face.
16
00:00:37,290 --> 00:00:40,230
token 分类管线的工作方式相同
The token classification pipeline works the same way
17
00:00:40,230 --> 00:00:42,630
和我们研究的文本分类的管线
as the text classification pipeline we studied
18
00:00:42,630 --> 00:00:44,430
在上一个视频中。
in the previous video.
19
00:00:44,430 --> 00:00:45,930
分为三个步骤。
There are three steps.
20
00:00:45,930 --> 00:00:49,623
分词化、模型和后处理。
The tokenization, the model, and the postprocessing.
21
00:00:50,940 --> 00:00:52,530
前两个步骤相同
The first two steps are identical
22
00:00:52,530 --> 00:00:54,630
到文本分类管线,
to the text classification pipeline,
23
00:00:54,630 --> 00:00:57,300
除了我们使用 auto (自动) 的 token 分类模型
except we use an auto token classification model
24
00:00:57,300 --> 00:01:00,150
而不是序列分类。
instead of a sequence classification one.
25
00:01:00,150 --> 00:01:03,720
我们分词化我们的文本,然后将其提供给模型。
We tokenize our text then feed it to the model.
26
00:01:03,720 --> 00:01:05,877
而不是为每个可能的标签获取一个数字
Instead of getting one number for each possible label
27
00:01:05,877 --> 00:01:08,700
对于整个句子,我们得到一个数字
for the whole sentence, we get one number
28
00:01:08,700 --> 00:01:10,770
对于可能的九个标签中的每一个
for each of the possible nine labels
29
00:01:10,770 --> 00:01:13,983
对于句子中的每个 token ,此处为 19。
for every token in the sentence, here 19.
30
00:01:15,300 --> 00:01:18,090
与 transformers 库的所有其他模型一样,
Like all the other models of the transformers library,
31
00:01:18,090 --> 00:01:19,830
我们的模型输出 logits,
our model outputs logits,
32
00:01:19,830 --> 00:01:23,073
我们使用 SoftMax 将其转化为预测值。
which we turn into predictions by using a SoftMax.
33
00:01:23,940 --> 00:01:26,190
我们还获得了每个 token 的预测标签
We also get the predicted label for each token
34
00:01:26,190 --> 00:01:27,990
通过最大预测,
by taking the maximum prediction,
35
00:01:27,990 --> 00:01:29,880
因为 SoftMax 函数保留了顺序,
since the SoftMax function preserves the orders,
36
00:01:29,880 --> 00:01:31,200
我们本可以在 logits 上完成
we could have done it on the logits
37
00:01:31,200 --> 00:01:33,050
如果我们不需要预测。
if we had no need of the predictions.
38
00:01:33,930 --> 00:01:35,880
模型配置包含标签映射
The model config contains the label mapping
39
00:01:35,880 --> 00:01:37,740
在其 id2label 字段中。
in its id2label field.
40
00:01:37,740 --> 00:01:41,430
使用它,我们可以将每个 token 映射到其相应的标签。
Using it, we can map every token to its corresponding label.
41
00:01:41,430 --> 00:01:43,950
标签 O 不对应任何实体,
The label, O, correspond to no entity,
42
00:01:43,950 --> 00:01:45,985
这就是为什么我们没有在结果中看到它
which is why we didn't see it in our results
43
00:01:45,985 --> 00:01:47,547
在第一张幻灯片中。
in the first slide.
44
00:01:47,547 --> 00:01:49,440
在标签和概率之上,
On top of the label and the probability,
45
00:01:49,440 --> 00:01:51,000
这些结果包括开始
those results included the start
46
00:01:51,000 --> 00:01:53,103
和句末字符。
and end character in the sentence.
47
00:01:54,120 --> 00:01:55,380
我们需要使用偏移映射
We'll need to use the offset mapping
48
00:01:55,380 --> 00:01:56,640
分词器得到那些。
of the tokenizer to get those.
49
00:01:56,640 --> 00:01:58,050
看看下面链接的视频
Look at the video linked below
50
00:01:58,050 --> 00:02:00,300
如果你还不知道它们。
if you don't know about them already.
51
00:02:00,300 --> 00:02:02,280
然后,遍历每个 token
Then, looping through each token
52
00:02:02,280 --> 00:02:04,080
具有不同于 O 的标签,
that has a label distinct from O,
53
00:02:04,080 --> 00:02:06,120
我们可以建立我们得到的结果列表
we can build the list of results we got
54
00:02:06,120 --> 00:02:07,320
用我们的第一条管线。
with our first pipeline.
55
00:02:08,460 --> 00:02:10,560
最后一步是将 token 组合在一起
The last step is to group together tokens
56
00:02:10,560 --> 00:02:12,310
对应于同一个实体。
that correspond to the same entity.
57
00:02:13,264 --> 00:02:16,140
这就是为什么我们为每种类型的实体设置了两个标签,
This is why we had two labels for each type of entity,
58
00:02:16,140 --> 00:02:18,450
例如,I-PER 和 B-PER。
I-PER and B-PER, for instance.
59
00:02:18,450 --> 00:02:20,100
它让我们知道一个 token 是否是
It allows us to know if a token is
60
00:02:20,100 --> 00:02:22,323
在与前一个相同的实体中。
in the same entity as the previous one.
61
00:02:23,310 --> 00:02:25,350
请注意,有两种标记方式
Note, that there are two ways of labeling used
62
00:02:25,350 --> 00:02:26,850
用于 token 分类。
for token classification.
63
00:02:26,850 --> 00:02:29,420
一个,这里是粉红色的,使用 B-PER 标签
One, in pink here, uses the B-PER label
64
00:02:29,420 --> 00:02:30,810
在每个新实体的开始,
at the beginning of each new entity,
65
00:02:30,810 --> 00:02:32,760
但另一个,蓝色的,
but the other, in blue,
66
00:02:32,760 --> 00:02:35,340
只用它来分隔两个相邻的实体
only uses it to separate two adjacent entities
67
00:02:35,340 --> 00:02:37,140
同类型的。
of the same type.
68
00:02:37,140 --> 00:02:39,690
在这两种情况下,我们都可以标记一个新实体
In both cases, we can flag a new entity
69
00:02:39,690 --> 00:02:41,940
每次我们看到一个新标签出现时,
each time we see a new label appearing,
70
00:02:41,940 --> 00:02:44,730
带有 I 或 B 前缀,
either with the I or B prefix,
71
00:02:44,730 --> 00:02:47,130
然后将以下所有标记为相同的标记,
then take all the following tokens labeled the same,
72
00:02:47,130 --> 00:02:48,870
带有 I 标志。
with an I-flag.
73
00:02:48,870 --> 00:02:51,330
这与偏移映射一起开始
This, coupled with the offset mapping to get the start
74
00:02:51,330 --> 00:02:54,210
和结束字符允许我们获得文本的跨度
and end characters allows us to get the span of texts
75
00:02:54,210 --> 00:02:55,233
对于每个实体。
for each entity.
76
00:02:56,569 --> 00:02:59,532
(标题嘶嘶作响)
(title whooshes)
77
00:02:59,532 --> 00:03:01,134
(标题失败)
(title fizzles)