Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(zh-cn): Reviewed 63_data-processing-for-causal-language-modeling.srt #421

Merged
merged 3 commits into from
Dec 23, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 55 additions & 55 deletions subtitles/zh-CN/63_data-processing-for-causal-language-modeling.srt
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,22 @@

2
00:00:05,364 --> 00:00:08,310
- 在这个视频中,我们来看看数据处理
- 在这个视频中,我们来看看
- In this video, we take a look at the data processing

3
00:00:08,310 --> 00:00:10,803
训练因果语言模型所必需的
训练因果语言模型所必需的数据处理
necessary to train causal language models.

4
00:00:12,690 --> 00:00:14,400
因果语言建模是任务
因果语言建模是
Causal language modeling is the task

5
00:00:14,400 --> 00:00:17,820
基于先前的标记预测下一个标记
基于先前的词元预测下一个词元的任务
of predicting the next token based on the previous ones.

6
Expand All @@ -35,12 +35,12 @@ is autoregressive modeling.

8
00:00:21,000 --> 00:00:23,940
在你可以在此处看到的示例中
在这里的示例中
In the example that you can see here,

9
00:00:23,940 --> 00:00:25,560
例如,下一个标记可以是
例如,下一个词元可以是
the next token could, for example,

10
Expand All @@ -65,7 +65,7 @@ To train models such as GPT,

14
00:00:38,010 --> 00:00:41,460
我们通常从大量文本文件开始
我们通常从大量文本文件组成的语料库开始
we usually start with a large corpus of text files.

15
Expand All @@ -90,22 +90,22 @@ like the ones you can see here.

19
00:00:50,400 --> 00:00:52,680
作为第一步,我们需要标记这些文件
作为第一步,我们需要词元化这些文件
As a first step, we need to tokenize these files

20
00:00:52,680 --> 00:00:55,380
这样我们就可以通过模型喂养他们
这样我们就可以将它们输入给模型
such that we can feed them through the model.

21
00:00:55,380 --> 00:00:58,500
在这里,我们将标记化的文本显示为不同长度的条
在这里,我们将词元化的文本显示为不同长度的条
Here, we show the tokenized texts as bars of various length,

22
00:00:58,500 --> 00:01:02,188
说明它们越来越短
表明它们有些长一些有些短一些
illustrating that they're shorter and longer ones.

23
Expand All @@ -120,12 +120,12 @@ However, transform models have a limited context window

25
00:01:09,270 --> 00:01:10,770
并根据数据源
并根据数据源的不同
and depending on the data source,

26
00:01:10,770 --> 00:01:13,140
标记化的文本可能
词元化的文本可能
it is possible that the tokenized texts

27
Expand All @@ -135,28 +135,28 @@ are much longer than this window.

28
00:01:16,080 --> 00:01:18,870
在这种情况下,我们可以截断序列
在这种情况下,我们可以将序列
In this case, we could just truncate the sequences

29
00:01:18,870 --> 00:01:20,182
上下文长度
截断为上下文长度
to the context length,

30
00:01:20,182 --> 00:01:22,650
但这意味着我们将失去一切
但这意味着在第一个上下文窗口之后
but this would mean that we lose everything

31
00:01:22,650 --> 00:01:24,513
在第一个上下文窗口之后
我们将失去一切
after the first context window.

32
00:01:25,500 --> 00:01:28,410
使用返回溢出令牌标志
Using the return overflowing token flag,
使用 return_overflowing_tokens 标志
Using the return overflowing token flag,

33
00:01:28,410 --> 00:01:30,960
Expand All @@ -165,22 +165,22 @@ we can use the tokenizer to create chunks

34
00:01:30,960 --> 00:01:33,510
每个都是上下文长度的大小
其中每个块都是上下文长度的大小
with each one being the size of the context length.

35
00:01:34,860 --> 00:01:36,180
有时,它仍然会发生
有时,如果没有足够的词元来填充它
Sometimes, it can still happen

36
00:01:36,180 --> 00:01:37,590
最后一块太短了
仍然会出现
that the last chunk is too short

37
00:01:37,590 --> 00:01:39,900
如果没有足够的令牌来填充它
最后一块太短的情况
if there aren't enough tokens to fill it.

38
Expand All @@ -200,22 +200,22 @@ we also get the length of each chunk from the tokenizer.

41
00:01:51,960 --> 00:01:53,640
此函数显示所有步骤
此函数显示准备数据集
This function shows all the steps

42
00:01:53,640 --> 00:01:56,280
准备数据集所必需的
所必需的所有步骤
necessary to prepare the dataset.

43
00:01:56,280 --> 00:01:57,960
首先,我们标记数据集
首先,我们用我刚才提到的标志
First, we tokenize the dataset

44
00:01:57,960 --> 00:02:00,330
用我刚才提到的标志
词元化数据集
with the flags I just mentioned.

45
Expand Down Expand Up @@ -250,12 +250,12 @@ that to use batches and remove the existing columns.

51
00:02:15,450 --> 00:02:17,670
我们需要删除现有的列
我们之所以需要删除现有的列
We need to remove the existing columns,

52
00:02:17,670 --> 00:02:21,330
因为我们可以为每个文本创建多个样本
是因为我们可以为每个文本创建多个样本
because we can create multiple samples per text,

53
Expand Down Expand Up @@ -290,37 +290,37 @@ are shorter than the context size

59
00:02:38,400 --> 00:02:41,610
并且将被以前的方法丢弃
并且按照之前的方法处理的话,将会丢弃它
and will be discarded with the previous approach.

60
00:02:41,610 --> 00:02:45,150
在这种情况下,最好先标记每个样本
在这种情况下,最好先词元化每个样本
In this case, it is better to first tokenize each sample

61
00:02:45,150 --> 00:02:46,590
没有截断
而不去截断
without truncation

62
00:02:46,590 --> 00:02:49,290
然后连接标记化样本
然后连接词元化后的样本
and then concatenate the tokenized samples

63
00:02:49,290 --> 00:02:52,353
中间有字符串结尾或 EOS 令牌
并且之间以字符串结尾或 EOS 词元结尾
with an end of string or EOS token in between.

64
00:02:53,546 --> 00:02:56,220
最后,我们可以分块这个长序列
最后,我们可以按照上下文长度分块这个长序列,
Finally, we can chunk this long sequence

65
00:02:56,220 --> 00:02:59,490
使用上下文长度,我们不会丢失太多序列
我们不会丢失太多序列
with the context length and we don't lose too many sequences

66
Expand All @@ -330,17 +330,17 @@ because they're too short anymore.

67
00:03:04,170 --> 00:03:05,760
到目前为止,我们只谈过
到目前为止,我们只介绍了
So far, we have only talked

68
00:03:05,760 --> 00:03:08,370
关于因果语言建模的输入
因果语言建模的输入
about the inputs for causal language modeling,

69
00:03:08,370 --> 00:03:11,850
但不是监督培训所需的标签
但还没有提到监督训练所需的标签
but not the labels needed for supervised training.

70
Expand All @@ -360,17 +360,17 @@ as the input sequences themselves are the labels.

73
00:03:20,610 --> 00:03:24,240
在这个例子中,当我们将 token trans 提供给模型时,
In this example, when we feed the token trans to the model,
在这个例子中,当我们将词元 Trans 提供给模型时,
In this example, when we feed the token Trans to the model,

74
00:03:24,240 --> 00:03:27,510
我们要预测的下一个标记是前者
我们要预测的下一个词元是 formers
the next token we wanted to predict is formers.

75
00:03:27,510 --> 00:03:30,780
在下一步中,我们将 trans 和 formers 提供给模型
在下一步中,我们将 Trans 和 formers 提供给模型
In the next step, we feed trans and formers to the model

76
Expand All @@ -385,17 +385,17 @@ This pattern continues, and as you can see,

78
00:03:38,130 --> 00:03:41,220
输入序列是标签序列
输入序列是前移了一个位置的
the input sequence is the label sequence

79
00:03:41,220 --> 00:03:42,663
只是移动了一个
标签序列
just shifted by one.

80
00:03:43,590 --> 00:03:47,310
由于模型仅在第一个标记之后进行预测
由于模型仅在第一个词元之后进行预测
Since the model only makes prediction after the first token,

81
Expand All @@ -405,32 +405,32 @@ the first element of the input sequence,

82
00:03:49,350 --> 00:03:52,980
在这种情况下,反式不用作标签
in this case, trans, is not used as a label.
在本例中,就是 Trans,不会作为标签使用
in this case, Trans, is not used as a label.

83
00:03:52,980 --> 00:03:55,530
同样,我们没有标签
同样,对于序列中的最后一个词元
Similarly, we don't have a label

84
00:03:55,530 --> 00:03:57,600
对于序列中的最后一个标记
我们也没有标签
for the last token in the sequence

85
00:03:57,600 --> 00:04:00,843
因为序列结束后没有令牌
因为序列结束后没有词元
since there is no token after the sequence ends.

86
00:04:04,110 --> 00:04:06,300
让我们看看我们需要做什么
让我们看看当需要在代码中为因果语言建模创建标签
Let's have a look at what we need to do

87
00:04:06,300 --> 00:04:10,200
在代码中为因果语言建模创建标签
我们需要做什么操作
to create the labels for causal language modeling in code.

88
Expand All @@ -450,12 +450,12 @@ and all the shifting is handled in the model internally.

91
00:04:20,032 --> 00:04:22,170
所以,你看,不涉及任何匹配
所以,你看,在处理因果语言建模的数据时,
So, you see, there's no matching involved

92
00:04:22,170 --> 00:04:24,870
在处理因果语言建模的数据时
不涉及任何匹配
in processing data for causal language modeling,

93
Expand Down