-
Notifications
You must be signed in to change notification settings - Fork 770
/
Copy path52_wordpiece-tokenization.srt
320 lines (256 loc) · 7.46 KB
/
52_wordpiece-tokenization.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
1
00:00:00,151 --> 00:00:02,818
(空气呼啸)
(air whooshing)
2
00:00:05,520 --> 00:00:08,370
- 一起来看看什么是训练策略
- Let's see together what is the training strategy
3
00:00:08,370 --> 00:00:11,851
对 WordPiece 算法及其执行方式
of the WordPiece algorithm, and how it performs
4
00:00:11,851 --> 00:00:15,150
一旦训练,文本的分词化。
the tokenization of a text, once trained.
5
00:00:19,351 --> 00:00:23,580
WordPiece 是 Google 推出的一种分词算法。
WordPiece is a tokenization algorithm introduced by Google.
6
00:00:23,580 --> 00:00:25,653
例如,它被 BERT 使用。
It is used, for example, by BERT.
7
00:00:26,640 --> 00:00:28,020
据我们所知,
To our knowledge,
8
00:00:28,020 --> 00:00:31,590
WordPiece 的代码还没有开源。
the code of WordPiece has not been open source.
9
00:00:31,590 --> 00:00:33,510
所以我们根据我们的解释
So we base our explanations
10
00:00:33,510 --> 00:00:36,903
根据我们自己对已发表文献的理解。
on our own interpretation of the published literature.
11
00:00:42,090 --> 00:00:44,883
那么,WordPiece 的训练策略是怎样的呢?
So, what is the training strategy of WordPiece?
12
00:00:46,200 --> 00:00:48,663
与 BPE 算法类似,
Similarly to the BPE algorithm,
13
00:00:48,663 --> 00:00:52,380
WordPiece 从建立初始词汇表开始
WordPiece starts by establishing an initial vocabulary
14
00:00:52,380 --> 00:00:54,660
由基本单元组成,
composed of elementary units,
15
00:00:54,660 --> 00:00:58,773
然后将这个词汇量增加到所需的大小。
and then increases this vocabulary to the desired size.
16
00:00:59,970 --> 00:01:01,950
为了建立初始词汇表,
To build the initial vocabulary,
17
00:01:01,950 --> 00:01:04,920
我们划分训练语料库中的每个单词
we divide each word in the training corpus
18
00:01:04,920 --> 00:01:07,443
进入构成它的字母序列。
into the sequence of letters that make it up.
19
00:01:08,430 --> 00:01:11,820
如你所见,有一个小细节。
As you can see, there is a small subtlety.
20
00:01:11,820 --> 00:01:14,190
我们在字母前添加两个标签
We add two hashtags in front of the letters
21
00:01:14,190 --> 00:01:16,083
那不开启一个词。
that do not start a word.
22
00:01:17,190 --> 00:01:20,430
通过每个基本单元只保留一次出现,
By keeping only one occurrence per elementary unit,
23
00:01:20,430 --> 00:01:23,313
我们现在有了最初的词汇表。
we now have our initial vocabulary.
24
00:01:26,580 --> 00:01:29,823
我们将列出语料库中所有现有的对。
We will list all the existing pairs in our corpus.
25
00:01:30,990 --> 00:01:32,640
一旦我们有了这份表格,
Once we have this list,
26
00:01:32,640 --> 00:01:35,253
我们将为每一对计算一个分数。
we will calculate a score for each of these pairs.
27
00:01:36,630 --> 00:01:38,400
至于 BPE 算法,
As for the BPE algorithm,
28
00:01:38,400 --> 00:01:40,750
我们将选择得分最高的一对。
we will select the pair with the highest score.
29
00:01:43,260 --> 00:01:44,340
举个例子,
Taking for example,
30
00:01:44,340 --> 00:01:47,343
第一对由字母 H 和 U 组成。
the first pair composed of the letters H and U.
31
00:01:48,510 --> 00:01:51,390
一对的分数单纯地等于频率
The score of a pair is simply equal to the frequency
32
00:01:51,390 --> 00:01:54,510
对这对出现的, 除以乘积
of appearance of the pair, divided by the product
33
00:01:54,510 --> 00:01:57,330
对第一个 token 出现的频率,
of the frequency of appearance of the first token,
34
00:01:57,330 --> 00:02:00,063
乘上第二个 token 的出现频率。
by the frequency of appearance of the second token.
35
00:02:01,260 --> 00:02:05,550
因此,在一对固定的出现频率下,
Thus, at a fixed frequency of appearance of the pair,
36
00:02:05,550 --> 00:02:09,913
如果该对的子部分在语料库中非常频繁,
if the subparts of the pair are very frequent in the corpus,
37
00:02:09,913 --> 00:02:11,823
那么这个分数就会降低。
then this score will be decreased.
38
00:02:13,140 --> 00:02:17,460
在我们的示例中,HU 对出现了四次,
In our example, the pair HU appears four times,
39
00:02:17,460 --> 00:02:22,460
字母 H 四次,字母 U 四次。
the letter H four times, and the letter U four times.
40
00:02:24,030 --> 00:02:26,733
这给了我们 0.25 的分数。
This gives us a score of 0.25.
41
00:02:28,410 --> 00:02:30,960
现在我们知道如何计算这个分数了,
Now that we know how to calculate this score,
42
00:02:30,960 --> 00:02:33,360
我们可以为所有对做。
we can do it for all pairs.
43
00:02:33,360 --> 00:02:35,217
我们现在可以添加到词汇表中
We can now add to the vocabulary
44
00:02:35,217 --> 00:02:38,973
得分最高的一对,当然是在合并之后。
the pair with the highest score, after merging it of course.
45
00:02:40,140 --> 00:02:43,863
现在我们可以将同样的操作应用于我们的拆分语料库。
And now we can apply this same fusion to our split corpus.
46
00:02:45,780 --> 00:02:47,490
你可以想象,
As you can imagine,
47
00:02:47,490 --> 00:02:50,130
我们只需要重复相同的操作
we just have to repeat the same operations
48
00:02:50,130 --> 00:02:53,013
直到我们拥有所需大小的词汇表。
until we have the vocabulary at the desired size.
49
00:02:54,000 --> 00:02:55,800
让我们再看几个步骤
Let's look at a few more steps
50
00:02:55,800 --> 00:02:58,113
看看我们词汇的演变,
to see the evolution of our vocabulary,
51
00:02:58,957 --> 00:03:01,773
以及拆分长度的演变。
and also the evolution of the length of the splits.
52
00:03:06,390 --> 00:03:09,180
现在我们对自己的词汇感到满意了,
And now that we are happy with our vocabulary,
53
00:03:09,180 --> 00:03:12,663
你可能想知道如何使用它来标记文本。
you are probably wondering how to use it to tokenize a text.
54
00:03:13,830 --> 00:03:17,640
假设我们想要标记 “huggingface” 这个词。
Let's say we want to tokenize the word "huggingface".
55
00:03:17,640 --> 00:03:20,310
WordPiece 遵循这些规则。
WordPiece follows these rules.
56
00:03:20,310 --> 00:03:22,530
我们将寻找最长的 token
We will look for the longest possible token
57
00:03:22,530 --> 00:03:24,960
在单词的开头。
at the beginning of the word.
58
00:03:24,960 --> 00:03:28,920
然后我们重新开始我们的词语的剩余部分,
Then we start again on the remaining part of our word,
59
00:03:28,920 --> 00:03:31,143
依此类推,直到我们到达终点。
and so on until we reach the end.
60
00:03:32,100 --> 00:03:35,973
就是这样。 Huggingface 分为四个子 token 。
And that's it. Huggingface is divided into four sub-tokens.
61
00:03:37,200 --> 00:03:39,180
本视频即将结束。
This video is about to end.
62
00:03:39,180 --> 00:03:41,370
我希望它能帮助你更好地理解
I hope it helped you to understand better
63
00:03:41,370 --> 00:03:43,653
工作的背后是什么,WordPiece。
what is behind the work, WordPiece.
64
00:03:45,114 --> 00:03:47,864
(空气呼啸)
(air whooshing)