-
Notifications
You must be signed in to change notification settings - Fork 776
/
Copy path14_character-based-tokenizers.srt
305 lines (245 loc) · 7.13 KB
/
14_character-based-tokenizers.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
1
00:00:00,234 --> 00:00:02,901
(翻页)
(page whirring)
2
00:00:04,260 --> 00:00:07,200
- 在深入研究基于字符的分词化之前,
*[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳]
- Before diving in character-based tokenization,
3
00:00:07,200 --> 00:00:10,350
理解为什么这种分词化很有趣
understanding why this kind of tokenization is interesting
4
00:00:10,350 --> 00:00:13,533
需要了解基于单词的分词化的缺陷。
requires understanding the flaws of word-based tokenization.
5
00:00:14,640 --> 00:00:16,320
如果你还没有看过第一个视频,
If you haven't seen the first video
6
00:00:16,320 --> 00:00:17,880
基于词的分词
on word-based tokenization
7
00:00:17,880 --> 00:00:21,450
我们建议你在观看此视频之前看一下。
we recommend you check it out before looking at this video.
8
00:00:21,450 --> 00:00:24,250
好的,让我们看一下基于字符的分词化。
Okay, let's take a look at character-based tokenization.
9
00:00:25,650 --> 00:00:28,560
我们现在将文本拆分为单个字符,
We now split our text into individual characters,
10
00:00:28,560 --> 00:00:29,673
而不是文字。
rather than words.
11
00:00:32,850 --> 00:00:35,550
语言中通常有很多不同的词,
There are generally a lot of different words in languages,
12
00:00:35,550 --> 00:00:37,743
而字符数保持较低。
while the number of characters stays low.
13
00:00:38,610 --> 00:00:41,313
首先, 让我们看一下英语,
To begin let's take a look at the English language,
14
00:00:42,210 --> 00:00:45,540
它估计有 170,000 个不同的词,
it has an estimated 170,000 different words,
15
00:00:45,540 --> 00:00:47,730
所以我们需要非常大的词汇量
so we would need a very large vocabulary
16
00:00:47,730 --> 00:00:49,413
来包含所有单词。
to encompass all words.
17
00:00:50,280 --> 00:00:52,200
使用基于字符的词汇表,
With a character-based vocabulary,
18
00:00:52,200 --> 00:00:55,440
我们可以只用 256 个字符,
we can get by with only 256 characters,
19
00:00:55,440 --> 00:00:58,683
其中包括字母、数字和特殊字符。
which includes letters, numbers and special characters.
20
00:00:59,760 --> 00:01:02,190
即使是有大量不同字符的语言
Even languages with a lot of different characters
21
00:01:02,190 --> 00:01:04,800
就像中文可以有字典一样
like the Chinese languages can have dictionaries
22
00:01:04,800 --> 00:01:08,130
多达 20,000 个不同的汉字
with up to 20,000 different characters
23
00:01:08,130 --> 00:01:11,523
超过 375,000 个不同的词语。
but more than 375,000 different words.
24
00:01:12,480 --> 00:01:14,310
所以基于字符的词汇
So character-based vocabularies
25
00:01:14,310 --> 00:01:16,293
让我们使用更少的不同分词
let us use fewer different tokens
26
00:01:16,293 --> 00:01:19,050
比基于单词的分词词典
than the word-based tokenization dictionaries
27
00:01:19,050 --> 00:01:20,523
否则我们会使用。
we would otherwise use.
28
00:01:23,250 --> 00:01:25,830
这些词汇也更全面
These vocabularies are also more complete
29
00:01:25,830 --> 00:01:28,950
相较于其基于单词的词汇。
than their word-based vocabularies counterparts.
30
00:01:28,950 --> 00:01:31,410
由于我们的词汇表包含所有字符
As our vocabulary contains all characters
31
00:01:31,410 --> 00:01:33,960
在一种语言中的,甚至是看不见的词
used in a language, even words unseen
32
00:01:33,960 --> 00:01:36,990
在分词器训练期间仍然可以分词,
during the tokenizer training can still be tokenized,
33
00:01:36,990 --> 00:01:39,633
因此溢出的分词将不那么频繁。
so out-of-vocabulary tokens will be less frequent.
34
00:01:40,680 --> 00:01:42,840
这包括正确分词化
This includes the ability to correctly tokenize
35
00:01:42,840 --> 00:01:45,210
拼错的单词,而不是丢弃它们
misspelled words, rather than discarding them
36
00:01:45,210 --> 00:01:46,623
作为未知的。
as unknown straight away.
37
00:01:48,240 --> 00:01:52,380
然而,这个算法也不完美。
However, this algorithm isn't perfect either.
38
00:01:52,380 --> 00:01:54,360
直觉上,字符不成立
Intuitively, characters do not hold
39
00:01:54,360 --> 00:01:57,990
一个词所能包含的信息量。
as much information individually as a word would hold.
40
00:01:57,990 --> 00:02:00,930
例如,“让我们” 包含更多信息
For example, "Let's" holds more information
41
00:02:00,930 --> 00:02:03,570
比它的第一个字母 “l”。
than it's first letter "l".
42
00:02:03,570 --> 00:02:05,880
当然,并非所有语言都如此,
Of course, this is not true for all languages,
43
00:02:05,880 --> 00:02:08,880
作为一些语言,比如基于表意文字的语言
as some languages like ideogram-based languages
44
00:02:08,880 --> 00:02:11,523
有很多信息保存在单个字符中,
have a lot of information held in single characters,
45
00:02:12,750 --> 00:02:15,360
但对于其他像基于字母的语言,
but for others like roman-based languages,
46
00:02:15,360 --> 00:02:17,760
该模型必须一次性理解多个分词
the model will have to make sense of multiple tokens at a time
47
00:02:17,760 --> 00:02:20,670
以获取信息
to get the information otherwise held
48
00:02:20,670 --> 00:02:21,753
在一句话中。
in a single word.
49
00:02:23,760 --> 00:02:27,000
这导致了基于字符的分词器的另一个问题,
This leads to another issue with character-based tokenizers,
50
00:02:27,000 --> 00:02:29,520
他们的序列被翻译成非常大量
their sequences are translated into very large amount
51
00:02:29,520 --> 00:02:31,593
模型要处理的分词。
of tokens to be processed by the model.
52
00:02:33,090 --> 00:02:36,810
这会对上下文的大小产生影响
And this can have an impact on the size of the context
53
00:02:36,810 --> 00:02:40,020
该模型将装载,并会减小
the model will carry around, and will reduce the size
54
00:02:40,020 --> 00:02:42,030
可以用作模型输入文本的尺寸,
of the text we can use as input for our model,
55
00:02:42,030 --> 00:02:43,233
这通常是有限的。
which is often limited.
56
00:02:44,100 --> 00:02:46,650
这种标记化虽然存在一些问题,
This tokenization, while it has some issues,
57
00:02:46,650 --> 00:02:48,720
但在过去看到了一些非常好的结果
has seen some very good results in the past
58
00:02:48,720 --> 00:02:50,490
所以应该被考虑
and so it should be considered
59
00:02:50,490 --> 00:02:52,680
当碰到新问题时, 来解决问题.
when approaching a new problem as it solves issues
60
00:02:52,680 --> 00:02:54,843
在基于词的算法中遇到的。
encountered in the word-based algorithm.
61
00:02:56,107 --> 00:02:58,774
(翻页)
(page whirring)