huggingface · lewtun · Dec 23, 2022 · Dec 20, 2022 · Dec 20, 2022 · Dec 20, 2022
diff --git a/subtitles/zh-CN/63_data-processing-for-causal-language-modeling.srt b/subtitles/zh-CN/63_data-processing-for-causal-language-modeling.srt
@@ -5,22 +5,22 @@
 
 2
 00:00:05,364 --> 00:00:08,310
-- 在这个视频中，我们来看看数据处理
+- 在这个视频中，我们来看看
 - In this video, we take a look at the data processing
 
 3
 00:00:08,310 --> 00:00:10,803
-训练因果语言模型所必需的。
+训练因果语言模型所必需的数据处理。
 necessary to train causal language models.
 
 4
 00:00:12,690 --> 00:00:14,400
-因果语言建模是任务
+因果语言建模是
 Causal language modeling is the task
 
 5
 00:00:14,400 --> 00:00:17,820
-基于先前的标记预测下一个标记。
+基于先前的词元预测下一个词元的任务。
 of predicting the next token based on the previous ones.
 
 6
@@ -35,12 +35,12 @@ is autoregressive modeling.
 
 8
 00:00:21,000 --> 00:00:23,940
-在你可以在此处看到的示例中，
+在这里的示例中，
 In the example that you can see here,
 
 9
 00:00:23,940 --> 00:00:25,560
-例如，下一个标记可以是
+例如，下一个词元可以是
 the next token could, for example,
 
 10
@@ -65,7 +65,7 @@ To train models such as GPT,
 
 14
 00:00:38,010 --> 00:00:41,460
-我们通常从大量文本文件开始。
+我们通常从大量文本文件组成的语料库开始。
 we usually start with a large corpus of text files.
 
 15
@@ -90,22 +90,22 @@ like the ones you can see here.
 
 19
 00:00:50,400 --> 00:00:52,680
-作为第一步，我们需要标记这些文件
+作为第一步，我们需要词元化这些文件
 As a first step, we need to tokenize these files
 
 20
 00:00:52,680 --> 00:00:55,380
-这样我们就可以通过模型喂养他们。
+这样我们就可以将它们输入给模型。
 such that we can feed them through the model.
 
 21
 00:00:55,380 --> 00:00:58,500
-在这里，我们将标记化的文本显示为不同长度的条，
+在这里，我们将词元化的文本显示为不同长度的条，
 Here, we show the tokenized texts as bars of various length,
 
 22
 00:00:58,500 --> 00:01:02,188
-说明它们越来越短。
+表明它们有些长一些有些短一些。
 illustrating that they're shorter and longer ones.
 
 23
@@ -120,12 +120,12 @@ However, transform models have a limited context window
 
 25
 00:01:09,270 --> 00:01:10,770
-并根据数据源，
+并根据数据源的不同，
 and depending on the data source,
 
 26
 00:01:10,770 --> 00:01:13,140
-标记化的文本可能
+词元化的文本可能
 it is possible that the tokenized texts
 
 27
@@ -135,28 +135,28 @@ are much longer than this window.
 
 28
 00:01:16,080 --> 00:01:18,870
-在这种情况下，我们可以截断序列
+在这种情况下，我们可以将序列
 In this case, we could just truncate the sequences
 
 29
 00:01:18,870 --> 00:01:20,182
-上下文长度，
+截断为上下文长度，
 to the context length,
 
 30
 00:01:20,182 --> 00:01:22,650
-但这意味着我们将失去一切
+但这意味着在第一个上下文窗口之后
 but this would mean that we lose everything
 
 31
 00:01:22,650 --> 00:01:24,513
-在第一个上下文窗口之后。
+我们将失去一切。
 after the first context window.
 
 32
 00:01:25,500 --> 00:01:28,410
-使用返回溢出令牌标志，
-Using the return overflowing token flag,
+使用 return_overflowing_tokens 标志，
+Using the return overflowing token flag, 
 
 33
 00:01:28,410 --> 00:01:30,960
@@ -165,22 +165,22 @@ we can use the tokenizer to create chunks
 
 34
 00:01:30,960 --> 00:01:33,510
-每个都是上下文长度的大小。
+其中每个块都是上下文长度的大小。
 with each one being the size of the context length.
 
 35
 00:01:34,860 --> 00:01:36,180
-有时，它仍然会发生
+有时，如果没有足够的词元来填充它
 Sometimes, it can still happen
 
 36
 00:01:36,180 --> 00:01:37,590
-最后一块太短了
+仍然会出现
 that the last chunk is too short
 
 37
 00:01:37,590 --> 00:01:39,900
-如果没有足够的令牌来填充它。
+最后一块太短的情况。
 if there aren't enough tokens to fill it.
 
 38
@@ -200,22 +200,22 @@ we also get the length of each chunk from the tokenizer.
 
 41
 00:01:51,960 --> 00:01:53,640
-此函数显示所有步骤
+此函数显示准备数据集
 This function shows all the steps
 
 42
 00:01:53,640 --> 00:01:56,280
-准备数据集所必需的。
+所必需的所有步骤。
 necessary to prepare the dataset.
 
 43
 00:01:56,280 --> 00:01:57,960
-首先，我们标记数据集
+首先，我们用我刚才提到的标志
 First, we tokenize the dataset
 
 44
 00:01:57,960 --> 00:02:00,330
-用我刚才提到的标志。
+词元化数据集。
 with the flags I just mentioned.
 
 45
@@ -250,12 +250,12 @@ that to use batches and remove the existing columns.
 
 51
 00:02:15,450 --> 00:02:17,670
-我们需要删除现有的列，
+我们之所以需要删除现有的列，
 We need to remove the existing columns,
 
 52
 00:02:17,670 --> 00:02:21,330
-因为我们可以为每个文本创建多个样本，
+是因为我们可以为每个文本创建多个样本，
 because we can create multiple samples per text,
 
 53
@@ -290,37 +290,37 @@ are shorter than the context size
 
 59
 00:02:38,400 --> 00:02:41,610
-并且将被以前的方法丢弃。
+并且按照之前的方法处理的话，将会丢弃它。
 and will be discarded with the previous approach.
 
 60
 00:02:41,610 --> 00:02:45,150
-在这种情况下，最好先标记每个样本
+在这种情况下，最好先词元化每个样本
 In this case, it is better to first tokenize each sample
 
 61
 00:02:45,150 --> 00:02:46,590
-没有截断
+而不去截断
 without truncation
 
 62
 00:02:46,590 --> 00:02:49,290
-然后连接标记化样本
+然后连接词元化后的样本
 and then concatenate the tokenized samples
 
 63
 00:02:49,290 --> 00:02:52,353
-中间有字符串结尾或 EOS 令牌。
+并且之间以字符串结尾或 EOS 词元结尾。
 with an end of string or EOS token in between.
 
 64
 00:02:53,546 --> 00:02:56,220
-最后，我们可以分块这个长序列
+最后，我们可以按照上下文长度分块这个长序列，
 Finally, we can chunk this long sequence
 
 65
 00:02:56,220 --> 00:02:59,490
-使用上下文长度，我们不会丢失太多序列
+我们不会丢失太多序列
 with the context length and we don't lose too many sequences
 
 66
@@ -330,17 +330,17 @@ because they're too short anymore.
 
 67
 00:03:04,170 --> 00:03:05,760
-到目前为止，我们只谈过
+到目前为止，我们只介绍了
 So far, we have only talked
 
 68
 00:03:05,760 --> 00:03:08,370
-关于因果语言建模的输入，
+因果语言建模的输入，
 about the inputs for causal language modeling,
 
 69
 00:03:08,370 --> 00:03:11,850
-但不是监督培训所需的标签。
+但还没有提到监督训练所需的标签。
 but not the labels needed for supervised training.
 
 70
@@ -360,17 +360,17 @@ as the input sequences themselves are the labels.
 
 73
 00:03:20,610 --> 00:03:24,240
-在这个例子中，当我们将 token trans 提供给模型时，
-In this example, when we feed the token trans to the model,
+在这个例子中，当我们将词元 Trans 提供给模型时，
+In this example, when we feed the token Trans to the model,
 
 74
 00:03:24,240 --> 00:03:27,510
-我们要预测的下一个标记是前者。
+我们要预测的下一个词元是 formers 。
 the next token we wanted to predict is formers.
 
 75
 00:03:27,510 --> 00:03:30,780
-在下一步中，我们将 trans 和 formers 提供给模型
+在下一步中，我们将 Trans 和 formers 提供给模型
 In the next step, we feed trans and formers to the model
 
 76
@@ -385,17 +385,17 @@ This pattern continues, and as you can see,
 
 78
 00:03:38,130 --> 00:03:41,220
-输入序列是标签序列
+输入序列是前移了一个位置的
 the input sequence is the label sequence
 
 79
 00:03:41,220 --> 00:03:42,663
-只是移动了一个。
+标签序列。
 just shifted by one.
 
 80
 00:03:43,590 --> 00:03:47,310
-由于模型仅在第一个标记之后进行预测，
+由于模型仅在第一个词元之后进行预测，
 Since the model only makes prediction after the first token,
 
 81
@@ -405,32 +405,32 @@ the first element of the input sequence,
 
 82
 00:03:49,350 --> 00:03:52,980
-在这种情况下，反式不用作标签。
-in this case, trans, is not used as a label.
+在本例中，就是 Trans，不会作为标签使用。
+in this case, Trans, is not used as a label.
 
 83
 00:03:52,980 --> 00:03:55,530
-同样，我们没有标签
+同样，对于序列中的最后一个词元
 Similarly, we don't have a label
 
 84
 00:03:55,530 --> 00:03:57,600
-对于序列中的最后一个标记
+我们也没有标签
 for the last token in the sequence
 
 85
 00:03:57,600 --> 00:04:00,843
-因为序列结束后没有令牌。
+因为序列结束后没有词元。
 since there is no token after the sequence ends.
 
 86
 00:04:04,110 --> 00:04:06,300
-让我们看看我们需要做什么
+让我们看看当需要在代码中为因果语言建模创建标签
 Let's have a look at what we need to do
 
 87
 00:04:06,300 --> 00:04:10,200
-在代码中为因果语言建模创建标签。
+我们需要做什么操作。
 to create the labels for causal language modeling in code.
 
 88
@@ -450,12 +450,12 @@ and all the shifting is handled in the model internally.
 
 91
 00:04:20,032 --> 00:04:22,170
-所以，你看，不涉及任何匹配
+所以，你看，在处理因果语言建模的数据时，
 So, you see, there's no matching involved
 
 92
 00:04:22,170 --> 00:04:24,870
-在处理因果语言建模的数据时，
+不涉及任何匹配，
 in processing data for causal language modeling,
 
 93