subtitles/zh-CN/45_inside-the-token-classification-pipeline-(pytorch).srt

1
00:00:00,076 --> 00:00:01,462
（标题嘶嘶作响）
(title whooshes)

2
00:00:01,462 --> 00:00:02,382
（标志弹出）
(logo pops)

3
00:00:02,382 --> 00:00:05,340
（标题嘶嘶作响）
(title whooshes)

4
00:00:05,340 --> 00:00:06,210
- 我们来看一下
- Let's have a look

5
00:00:06,210 --> 00:00:08,283
在 token 分类管线( pipeline )内。
inside the token classification pipeline.

6
00:00:10,080 --> 00:00:11,580
在有关管线视频中，
In the pipeline video,

7
00:00:11,580 --> 00:00:13,320
我们学习了不同的应用
we looked at the different applications

8
00:00:13,320 --> 00:00:15,960
transformers 库支持开箱即用的，
the Transformers library supports out of the box,

9
00:00:15,960 --> 00:00:18,780
其中之一是 token 分类，
one of them being token classification,

10
00:00:18,780 --> 00:00:21,810
例如预测句子中的每个单词
for instance predicting for each word in a sentence

11
00:00:21,810 --> 00:00:24,510
是否对应于一个人，一个组织
whether they correspond to a person, an organization

12
00:00:24,510 --> 00:00:25,353
或一个位置。
or a location.

13
00:00:26,670 --> 00:00:28,920
我们甚至可以将相应的 token 组合在一起
We can even group together the tokens corresponding

14
00:00:28,920 --> 00:00:32,040
到同一个实体，例如所有 token 
to the same entity, for instance all the tokens

15
00:00:32,040 --> 00:00:35,373
在这里形成了 Sylvain 这个词，或 Hugging 和 Face。
that formed the word Sylvain here, or Hugging and Face.

16
00:00:37,290 --> 00:00:40,230
 token 分类管线的工作方式相同
The token classification pipeline works the same way

17
00:00:40,230 --> 00:00:42,630
和我们研究的文本分类的管线
as the text classification pipeline we studied

18
00:00:42,630 --> 00:00:44,430
在上一个视频中。
in the previous video.

19
00:00:44,430 --> 00:00:45,930
分为三个步骤。
There are three steps.

20
00:00:45,930 --> 00:00:49,623
分词化、模型和后处理。
The tokenization, the model, and the postprocessing.

21
00:00:50,940 --> 00:00:52,530
前两个步骤相同
The first two steps are identical

22
00:00:52,530 --> 00:00:54,630
到文本分类管线，
to the text classification pipeline,

23
00:00:54,630 --> 00:00:57,300
除了我们使用 auto (自动) 的 token 分类模型
except we use an auto token classification model

24
00:00:57,300 --> 00:01:00,150
而不是序列分类。
instead of a sequence classification one.

25
00:01:00,150 --> 00:01:03,720
我们分词化我们的文本，然后将其提供给模型。
We tokenize our text then feed it to the model.

26
00:01:03,720 --> 00:01:05,877
而不是为每个可能的标签获取一个数字
Instead of getting one number for each possible label

27
00:01:05,877 --> 00:01:08,700
对于整个句子，我们得到一个数字
for the whole sentence, we get one number

28
00:01:08,700 --> 00:01:10,770
对于可能的九个标签中的每一个
for each of the possible nine labels

29
00:01:10,770 --> 00:01:13,983
对于句子中的每个 token ，此处为 19。
for every token in the sentence, here 19.

30
00:01:15,300 --> 00:01:18,090
与 transformers 库的所有其他模型一样，
Like all the other models of the transformers library,

31
00:01:18,090 --> 00:01:19,830
我们的模型输出 logits，
our model outputs logits,

32
00:01:19,830 --> 00:01:23,073
我们使用 SoftMax 将其转化为预测值。
which we turn into predictions by using a SoftMax.

33
00:01:23,940 --> 00:01:26,190
我们还获得了每个 token 的预测标签
We also get the predicted label for each token

34
00:01:26,190 --> 00:01:27,990
通过最大预测，
by taking the maximum prediction,

35
00:01:27,990 --> 00:01:29,880
因为 SoftMax 函数保留了顺序，
since the SoftMax function preserves the orders,

36
00:01:29,880 --> 00:01:31,200
我们本可以在 logits 上完成
we could have done it on the logits

37
00:01:31,200 --> 00:01:33,050
如果我们不需要预测。
if we had no need of the predictions.

38
00:01:33,930 --> 00:01:35,880
模型配置包含标签映射
The model config contains the label mapping

39
00:01:35,880 --> 00:01:37,740
在其 id2label 字段中。
in its id2label field.

40
00:01:37,740 --> 00:01:41,430
使用它，我们可以将每个 token 映射到其相应的标签。
Using it, we can map every token to its corresponding label.

41
00:01:41,430 --> 00:01:43,950
标签 O 不对应任何实体，
The label, O, correspond to no entity,

42
00:01:43,950 --> 00:01:45,985
这就是为什么我们没有在结果中看到它
which is why we didn't see it in our results

43
00:01:45,985 --> 00:01:47,547
在第一张幻灯片中。
in the first slide.

44
00:01:47,547 --> 00:01:49,440
在标签和概率之上，
On top of the label and the probability,

45
00:01:49,440 --> 00:01:51,000
这些结果包括开始
those results included the start

46
00:01:51,000 --> 00:01:53,103
和句末字符。
and end character in the sentence.

47
00:01:54,120 --> 00:01:55,380
我们需要使用偏移映射
We'll need to use the offset mapping

48
00:01:55,380 --> 00:01:56,640
分词器得到那些。
of the tokenizer to get those.

49
00:01:56,640 --> 00:01:58,050
看看下面链接的视频
Look at the video linked below

50
00:01:58,050 --> 00:02:00,300
如果你还不知道它们。
if you don't know about them already.

51
00:02:00,300 --> 00:02:02,280
然后，遍历每个 token 
Then, looping through each token

52
00:02:02,280 --> 00:02:04,080
具有不同于 O 的标签，
that has a label distinct from O,

53
00:02:04,080 --> 00:02:06,120
我们可以建立我们得到的结果列表
we can build the list of results we got

54
00:02:06,120 --> 00:02:07,320
用我们的第一条管线。
with our first pipeline.

55
00:02:08,460 --> 00:02:10,560
最后一步是将 token 组合在一起
The last step is to group together tokens

56
00:02:10,560 --> 00:02:12,310
对应于同一个实体。
that correspond to the same entity.

57
00:02:13,264 --> 00:02:16,140
这就是为什么我们为每种类型的实体设置了两个标签，
This is why we had two labels for each type of entity,

58
00:02:16,140 --> 00:02:18,450
例如，I-PER 和 B-PER。
I-PER and B-PER, for instance.

59
00:02:18,450 --> 00:02:20,100
它让我们知道一个 token 是否是
It allows us to know if a token is

60
00:02:20,100 --> 00:02:22,323
在与前一个相同的实体中。
in the same entity as the previous one.

61
00:02:23,310 --> 00:02:25,350
请注意，有两种标记方式
Note, that there are two ways of labeling used

62
00:02:25,350 --> 00:02:26,850
用于 token 分类。
for token classification.

63
00:02:26,850 --> 00:02:29,420
一个，这里是粉红色的，使用 B-PER 标签
One, in pink here, uses the B-PER label

64
00:02:29,420 --> 00:02:30,810
在每个新实体的开始，
at the beginning of each new entity,

65
00:02:30,810 --> 00:02:32,760
但另一个，蓝色的，
but the other, in blue,

66
00:02:32,760 --> 00:02:35,340
只用它来分隔两个相邻的实体
only uses it to separate two adjacent entities

67
00:02:35,340 --> 00:02:37,140
同类型的。
of the same type.

68
00:02:37,140 --> 00:02:39,690
在这两种情况下，我们都可以标记一个新实体
In both cases, we can flag a new entity

69
00:02:39,690 --> 00:02:41,940
每次我们看到一个新标签出现时，
each time we see a new label appearing,

70
00:02:41,940 --> 00:02:44,730
带有 I 或 B 前缀，
either with the I or B prefix,

71
00:02:44,730 --> 00:02:47,130
然后将以下所有标记为相同的标记，
then take all the following tokens labeled the same,

72
00:02:47,130 --> 00:02:48,870
带有 I 标志。
with an I-flag.

73
00:02:48,870 --> 00:02:51,330
这与偏移映射一起开始
This, coupled with the offset mapping to get the start

74
00:02:51,330 --> 00:02:54,210
和结束字符允许我们获得文本的跨度
and end characters allows us to get the span of texts

75
00:02:54,210 --> 00:02:55,233
对于每个实体。
for each entity.

76
00:02:56,569 --> 00:02:59,532
（标题嘶嘶作响）
(title whooshes)

77
00:02:59,532 --> 00:03:01,134
（标题失败）
(title fizzles)