subtitles/zh-CN/52_wordpiece-tokenization.srt

1
00:00:00,151 --> 00:00:02,818
（空气呼啸）
(air whooshing)

2
00:00:05,520 --> 00:00:08,370
- 一起来看看什么是训练策略
- Let's see together what is the training strategy

3
00:00:08,370 --> 00:00:11,851
对 WordPiece 算法及其执行方式
of the WordPiece algorithm, and how it performs

4
00:00:11,851 --> 00:00:15,150
一旦训练，文本的分词化。
the tokenization of a text, once trained.

5
00:00:19,351 --> 00:00:23,580
WordPiece 是 Google 推出的一种分词算法。
WordPiece is a tokenization algorithm introduced by Google.

6
00:00:23,580 --> 00:00:25,653
例如，它被 BERT 使用。
It is used, for example, by BERT.

7
00:00:26,640 --> 00:00:28,020
据我们所知，
To our knowledge,

8
00:00:28,020 --> 00:00:31,590
WordPiece 的代码还没有开源。
the code of WordPiece has not been open source.

9
00:00:31,590 --> 00:00:33,510
所以我们根据我们的解释
So we base our explanations

10
00:00:33,510 --> 00:00:36,903
根据我们自己对已发表文献的理解。
on our own interpretation of the published literature.

11
00:00:42,090 --> 00:00:44,883
那么，WordPiece 的训练策略是怎样的呢？
So, what is the training strategy of WordPiece?

12
00:00:46,200 --> 00:00:48,663
与 BPE 算法类似，
Similarly to the BPE algorithm,

13
00:00:48,663 --> 00:00:52,380
WordPiece 从建立初始词汇表开始
WordPiece starts by establishing an initial vocabulary

14
00:00:52,380 --> 00:00:54,660
由基本单元组成，
composed of elementary units,

15
00:00:54,660 --> 00:00:58,773
然后将这个词汇量增加到所需的大小。
and then increases this vocabulary to the desired size.

16
00:00:59,970 --> 00:01:01,950
为了建立初始词汇表，
To build the initial vocabulary,

17
00:01:01,950 --> 00:01:04,920
我们划分训练语料库中的每个单词
we divide each word in the training corpus

18
00:01:04,920 --> 00:01:07,443
进入构成它的字母序列。
into the sequence of letters that make it up.

19
00:01:08,430 --> 00:01:11,820
如你所见，有一个小细节。
As you can see, there is a small subtlety.

20
00:01:11,820 --> 00:01:14,190
我们在字母前添加两个标签
We add two hashtags in front of the letters

21
00:01:14,190 --> 00:01:16,083
那不开启一个词。
that do not start a word.

22
00:01:17,190 --> 00:01:20,430
通过每个基本单元只保留一次出现，
By keeping only one occurrence per elementary unit,

23
00:01:20,430 --> 00:01:23,313
我们现在有了最初的词汇表。
we now have our initial vocabulary.

24
00:01:26,580 --> 00:01:29,823
我们将列出语料库中所有现有的对。
We will list all the existing pairs in our corpus.

25
00:01:30,990 --> 00:01:32,640
一旦我们有了这份表格，
Once we have this list,

26
00:01:32,640 --> 00:01:35,253
我们将为每一对计算一个分数。
we will calculate a score for each of these pairs.

27
00:01:36,630 --> 00:01:38,400
至于 BPE 算法，
As for the BPE algorithm,

28
00:01:38,400 --> 00:01:40,750
我们将选择得分最高的一对。
we will select the pair with the highest score.

29
00:01:43,260 --> 00:01:44,340
举个例子，
Taking for example,

30
00:01:44,340 --> 00:01:47,343
第一对由字母 H 和 U 组成。
the first pair composed of the letters H and U.

31
00:01:48,510 --> 00:01:51,390
一对的分数单纯地等于频率
The score of a pair is simply equal to the frequency

32
00:01:51,390 --> 00:01:54,510
对这对出现的, 除以乘积
of appearance of the pair, divided by the product

33
00:01:54,510 --> 00:01:57,330
对第一个 token 出现的频率，
of the frequency of appearance of the first token,

34
00:01:57,330 --> 00:02:00,063
乘上第二个 token 的出现频率。
by the frequency of appearance of the second token.

35
00:02:01,260 --> 00:02:05,550
因此，在一对固定的出现频率下，
Thus, at a fixed frequency of appearance of the pair,

36
00:02:05,550 --> 00:02:09,913
如果该对的子部分在语料库中非常频繁，
if the subparts of the pair are very frequent in the corpus,

37
00:02:09,913 --> 00:02:11,823
那么这个分数就会降低。
then this score will be decreased.

38
00:02:13,140 --> 00:02:17,460
在我们的示例中，HU 对出现了四次，
In our example, the pair HU appears four times,

39
00:02:17,460 --> 00:02:22,460
字母 H 四次，字母 U 四次。
the letter H four times, and the letter U four times.

40
00:02:24,030 --> 00:02:26,733
这给了我们 0.25 的分数。
This gives us a score of 0.25.

41
00:02:28,410 --> 00:02:30,960
现在我们知道如何计算这个分数了，
Now that we know how to calculate this score,

42
00:02:30,960 --> 00:02:33,360
我们可以为所有对做。
we can do it for all pairs.

43
00:02:33,360 --> 00:02:35,217
我们现在可以添加到词汇表中
We can now add to the vocabulary

44
00:02:35,217 --> 00:02:38,973
得分最高的一对，当然是在合并之后。
the pair with the highest score, after merging it of course.

45
00:02:40,140 --> 00:02:43,863
现在我们可以将同样的操作应用于我们的拆分语料库。
And now we can apply this same fusion to our split corpus.

46
00:02:45,780 --> 00:02:47,490
你可以想象，
As you can imagine,

47
00:02:47,490 --> 00:02:50,130
我们只需要重复相同的操作
we just have to repeat the same operations

48
00:02:50,130 --> 00:02:53,013
直到我们拥有所需大小的词汇表。
until we have the vocabulary at the desired size.

49
00:02:54,000 --> 00:02:55,800
让我们再看几个步骤
Let's look at a few more steps

50
00:02:55,800 --> 00:02:58,113
看看我们词汇的演变，
to see the evolution of our vocabulary,

51
00:02:58,957 --> 00:03:01,773
以及拆分长度的演变。
and also the evolution of the length of the splits.

52
00:03:06,390 --> 00:03:09,180
现在我们对自己的词汇感到满意了，
And now that we are happy with our vocabulary,

53
00:03:09,180 --> 00:03:12,663
你可能想知道如何使用它来标记文本。
you are probably wondering how to use it to tokenize a text.

54
00:03:13,830 --> 00:03:17,640
假设我们想要标记 “huggingface” 这个词。
Let's say we want to tokenize the word "huggingface".

55
00:03:17,640 --> 00:03:20,310
WordPiece 遵循这些规则。
WordPiece follows these rules.

56
00:03:20,310 --> 00:03:22,530
我们将寻找最长的 token 
We will look for the longest possible token

57
00:03:22,530 --> 00:03:24,960
在单词的开头。
at the beginning of the word.

58
00:03:24,960 --> 00:03:28,920
然后我们重新开始我们的词语的剩余部分，
Then we start again on the remaining part of our word,

59
00:03:28,920 --> 00:03:31,143
依此类推，直到我们到达终点。
and so on until we reach the end.

60
00:03:32,100 --> 00:03:35,973
就是这样。 Huggingface 分为四个子 token 。 
And that's it. Huggingface is divided into four sub-tokens.

61
00:03:37,200 --> 00:03:39,180
本视频即将结束。
This video is about to end.

62
00:03:39,180 --> 00:03:41,370
我希望它能帮助你更好地理解
I hope it helped you to understand better

63
00:03:41,370 --> 00:03:43,653
工作的背后是什么，WordPiece。
what is behind the work, WordPiece.

64
00:03:45,114 --> 00:03:47,864
（空气呼啸）
(air whooshing)