-
Notifications
You must be signed in to change notification settings - Fork 770
/
Copy path63_data-processing-for-causal-language-modeling.srt
470 lines (376 loc) · 11.1 KB
/
63_data-processing-for-causal-language-modeling.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
1
00:00:00,000 --> 00:00:02,917
(过渡音乐)
(transition music)
2
00:00:05,364 --> 00:00:08,310
- 在这个视频中,我们来看看
- In this video, we take a look at the data processing
3
00:00:08,310 --> 00:00:10,803
训练因果语言模型所必需的数据处理。
necessary to train causal language models.
4
00:00:12,690 --> 00:00:14,400
因果语言建模是
Causal language modeling is the task
5
00:00:14,400 --> 00:00:17,820
基于先前的词元预测下一个词元的任务。
of predicting the next token based on the previous ones.
6
00:00:17,820 --> 00:00:19,680
因果语言建模的另一个术语
Another term for causal language modeling
7
00:00:19,680 --> 00:00:21,000
是自回归建模。
is autoregressive modeling.
8
00:00:21,000 --> 00:00:23,940
在这里的示例中,
In the example that you can see here,
9
00:00:23,940 --> 00:00:25,560
例如,下一个词元可以是
the next token could, for example,
10
00:00:25,560 --> 00:00:28,263
是 NLP,也可能是机器学习。
be NLP or it could be machine learning.
11
00:00:29,460 --> 00:00:31,457
因果语言模型的一个流行示例
A popular example of causal language models
12
00:00:31,457 --> 00:00:33,693
是 GPT 系列模型。
is the GPT family of models.
13
00:00:35,561 --> 00:00:38,010
训练 GPT 等模型,
To train models such as GPT,
14
00:00:38,010 --> 00:00:41,460
我们通常从大量文本文件组成的语料库开始。
we usually start with a large corpus of text files.
15
00:00:41,460 --> 00:00:43,890
这些文件可以是从互联网上抓取的网页
These files can be webpages scraped from the internet
16
00:00:43,890 --> 00:00:46,020
例如 Common Crawl 数据集
such as the Common Crawl dataset
17
00:00:46,020 --> 00:00:47,940
或者它们可以是来自 GitHub 的 Python 文件,
or they can be Python files from GitHub,
18
00:00:47,940 --> 00:00:49,490
就像你在这里看到的一样。
like the ones you can see here.
19
00:00:50,400 --> 00:00:52,680
作为第一步,我们需要词元化这些文件
As a first step, we need to tokenize these files
20
00:00:52,680 --> 00:00:55,380
这样我们就可以将它们输入给模型。
such that we can feed them through the model.
21
00:00:55,380 --> 00:00:58,500
在这里,我们将词元化的文本显示为不同长度的条,
Here, we show the tokenized texts as bars of various length,
22
00:00:58,500 --> 00:01:02,188
表明它们有些长一些有些短一些。
illustrating that they're shorter and longer ones.
23
00:01:02,188 --> 00:01:05,910
这在处理文本时很常见。
This is very common when working with text.
24
00:01:05,910 --> 00:01:09,270
但是,转换模型的上下文窗口有限
However, transform models have a limited context window
25
00:01:09,270 --> 00:01:10,770
并根据数据源的不同,
and depending on the data source,
26
00:01:10,770 --> 00:01:13,140
词元化的文本可能
it is possible that the tokenized texts
27
00:01:13,140 --> 00:01:15,183
比这个窗口长得多。
are much longer than this window.
28
00:01:16,080 --> 00:01:18,870
在这种情况下,我们可以将序列
In this case, we could just truncate the sequences
29
00:01:18,870 --> 00:01:20,182
截断为上下文长度,
to the context length,
30
00:01:20,182 --> 00:01:22,650
但这意味着在第一个上下文窗口之后
but this would mean that we lose everything
31
00:01:22,650 --> 00:01:24,513
我们将失去一切。
after the first context window.
32
00:01:25,500 --> 00:01:28,410
使用 return_overflowing_tokens 标志,
Using the return overflowing token flag,
33
00:01:28,410 --> 00:01:30,960
我们可以使用分词器来创建块
we can use the tokenizer to create chunks
34
00:01:30,960 --> 00:01:33,510
其中每个块都是上下文长度的大小。
with each one being the size of the context length.
35
00:01:34,860 --> 00:01:36,180
有时,如果没有足够的词元来填充它
Sometimes, it can still happen
36
00:01:36,180 --> 00:01:37,590
仍然会出现
that the last chunk is too short
37
00:01:37,590 --> 00:01:39,900
最后一块太短的情况。
if there aren't enough tokens to fill it.
38
00:01:39,900 --> 00:01:41,793
在这种情况下,我们可以将其删除。
In this case, we can just remove it.
39
00:01:42,990 --> 00:01:45,960
使用 return_length 关键字,
With the return_length keyword,
40
00:01:45,960 --> 00:01:49,173
我们还从分词器中获取每个块的长度。
we also get the length of each chunk from the tokenizer.
41
00:01:51,960 --> 00:01:53,640
此函数显示准备数据集
This function shows all the steps
42
00:01:53,640 --> 00:01:56,280
所必需的所有步骤。
necessary to prepare the dataset.
43
00:01:56,280 --> 00:01:57,960
首先,我们用我刚才提到的标志
First, we tokenize the dataset
44
00:01:57,960 --> 00:02:00,330
词元化数据集。
with the flags I just mentioned.
45
00:02:00,330 --> 00:02:02,190
然后,我们遍历每个块
Then, we go through each chunk
46
00:02:02,190 --> 00:02:04,680
如果它的长度与上下文长度匹配,
and if it's length matches the context length,
47
00:02:04,680 --> 00:02:06,663
我们将它添加到我们返回的输入中。
we add it to the inputs we return.
48
00:02:07,590 --> 00:02:10,260
我们可以将此函数应用于整个数据集。
We can apply this function to the whole dataset.
49
00:02:10,260 --> 00:02:11,700
此外,我们确保
In addition, we make sure
50
00:02:11,700 --> 00:02:15,450
使用批处理并删除现有列。
that to use batches and remove the existing columns.
51
00:02:15,450 --> 00:02:17,670
我们之所以需要删除现有的列,
We need to remove the existing columns,
52
00:02:17,670 --> 00:02:21,330
是因为我们可以为每个文本创建多个样本,
because we can create multiple samples per text,
53
00:02:21,330 --> 00:02:22,890
和数据集中的形状
and the shapes in the dataset
54
00:02:22,890 --> 00:02:24,753
在那种情况下将不再匹配。
would not match anymore in that case.
55
00:02:26,832 --> 00:02:30,330
如果上下文长度与文件长度相似,
If the context length is of similar lengths as the files,
56
00:02:30,330 --> 00:02:32,733
这种方法不再那么有效了。
this approach doesn't work so well anymore.
57
00:02:33,660 --> 00:02:36,420
在这个例子中,样本 1 和 2
In this example, both sample 1 and 2
58
00:02:36,420 --> 00:02:38,400
比上下文大小短
are shorter than the context size
59
00:02:38,400 --> 00:02:41,610
并且按照之前的方法处理的话,将会丢弃它。
and will be discarded with the previous approach.
60
00:02:41,610 --> 00:02:45,150
在这种情况下,最好先词元化每个样本
In this case, it is better to first tokenize each sample
61
00:02:45,150 --> 00:02:46,590
而不去截断
without truncation
62
00:02:46,590 --> 00:02:49,290
然后连接词元化后的样本
and then concatenate the tokenized samples
63
00:02:49,290 --> 00:02:52,353
并且之间以字符串结尾或 EOS 词元结尾。
with an end of string or EOS token in between.
64
00:02:53,546 --> 00:02:56,220
最后,我们可以按照上下文长度分块这个长序列,
Finally, we can chunk this long sequence
65
00:02:56,220 --> 00:02:59,490
我们不会丢失太多序列
with the context length and we don't lose too many sequences
66
00:02:59,490 --> 00:03:01,263
因为它们太短了。
because they're too short anymore.
67
00:03:04,170 --> 00:03:05,760
到目前为止,我们只介绍了
So far, we have only talked
68
00:03:05,760 --> 00:03:08,370
因果语言建模的输入,
about the inputs for causal language modeling,
69
00:03:08,370 --> 00:03:11,850
但还没有提到监督训练所需的标签。
but not the labels needed for supervised training.
70
00:03:11,850 --> 00:03:13,380
当我们进行因果语言建模时,
When we do causal language modeling,
71
00:03:13,380 --> 00:03:16,710
我们不需要输入序列的任何额外标签
we don't require any extra labels for the input sequences
72
00:03:16,710 --> 00:03:20,610
因为输入序列本身就是标签。
as the input sequences themselves are the labels.
73
00:03:20,610 --> 00:03:24,240
在这个例子中,当我们将词元 Trans 提供给模型时,
In this example, when we feed the token Trans to the model,
74
00:03:24,240 --> 00:03:27,510
我们要预测的下一个词元是 formers 。
the next token we wanted to predict is formers.
75
00:03:27,510 --> 00:03:30,780
在下一步中,我们将 Trans 和 formers 提供给模型
In the next step, we feed trans and formers to the model
76
00:03:30,780 --> 00:03:33,903
我们想要预测的标签是 are。
and the label we wanted to predict is are.
77
00:03:35,460 --> 00:03:38,130
这种模式仍在继续,如你所见,
This pattern continues, and as you can see,
78
00:03:38,130 --> 00:03:41,220
输入序列是前移了一个位置的
the input sequence is the label sequence
79
00:03:41,220 --> 00:03:42,663
标签序列。
just shifted by one.
80
00:03:43,590 --> 00:03:47,310
由于模型仅在第一个词元之后进行预测,
Since the model only makes prediction after the first token,
81
00:03:47,310 --> 00:03:49,350
输入序列的第一个元素,
the first element of the input sequence,
82
00:03:49,350 --> 00:03:52,980
在本例中,就是 Trans,不会作为标签使用。
in this case, Trans, is not used as a label.
83
00:03:52,980 --> 00:03:55,530
同样,对于序列中的最后一个词元
Similarly, we don't have a label
84
00:03:55,530 --> 00:03:57,600
我们也没有标签
for the last token in the sequence
85
00:03:57,600 --> 00:04:00,843
因为序列结束后没有词元。
since there is no token after the sequence ends.
86
00:04:04,110 --> 00:04:06,300
让我们看看当需要在代码中为因果语言建模创建标签
Let's have a look at what we need to do
87
00:04:06,300 --> 00:04:10,200
我们需要做什么操作。
to create the labels for causal language modeling in code.
88
00:04:10,200 --> 00:04:12,360
如果我们想计算一批的损失,
If we want to calculate a loss on a batch,
89
00:04:12,360 --> 00:04:15,120
我们可以将 input_ids 作为标签传递
we can just pass the input_ids as labels
90
00:04:15,120 --> 00:04:18,933
所有的转移都在模型内部处理。
and all the shifting is handled in the model internally.
91
00:04:20,032 --> 00:04:22,170
所以,你看,在处理因果语言建模的数据时,
So, you see, there's no matching involved
92
00:04:22,170 --> 00:04:24,870
不涉及任何匹配,
in processing data for causal language modeling,
93
00:04:24,870 --> 00:04:27,723
它只需要几个简单的步骤。
and it only requires a few simple steps.
94
00:04:28,854 --> 00:04:31,771
(过渡音乐)
(transition music)