-
Notifications
You must be signed in to change notification settings - Fork 770
/
Copy path21_preprocessing-sentence-pairs-(pytorch).srt
325 lines (260 loc) · 7.6 KB
/
21_preprocessing-sentence-pairs-(pytorch).srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
1
00:00:00,000 --> 00:00:03,083
(图形嘶嘶作响)
(graphics whooshing)
2
00:00:05,370 --> 00:00:07,413
- 如何预处理句子对。
- How to pre-process pairs of sentences.
3
00:00:09,150 --> 00:00:11,340
我们已经看到了如何分词化单个句子
We have seen how to tokenize single sentences
4
00:00:11,340 --> 00:00:12,877
并将它们放在一起,
and batch them together in the
5
00:00:12,877 --> 00:00:15,810
在 “Batching inputs together” 视频中
"Batching inputs together" video.
6
00:00:15,810 --> 00:00:18,330
如果这段代码对你而言不熟悉,
If this code look unfamiliar to you,
7
00:00:18,330 --> 00:00:20,030
请务必再次查看该视频。
be sure to check that video again.
8
00:00:21,330 --> 00:00:24,543
这里将重点关注对句子进行分类的任务。
Here will focus on tasks that classify pair of sentences.
9
00:00:25,620 --> 00:00:28,470
例如,我们可能想要对两个文本是否被释义
For instance, we may want to classify whether two texts
10
00:00:28,470 --> 00:00:30,360
进行分类。
are paraphrased or not.
11
00:00:30,360 --> 00:00:32,880
这是从 Quora Question Pairs 数据集中获取的示例
Here is an example taken from the Quora Question Pairs dataset
12
00:00:32,880 --> 00:00:37,530
它专注于识别重复的问题。
which focuses on identifying duplicate questions.
13
00:00:37,530 --> 00:00:40,650
在第一对中,两个问题是重复的,
In the first pair, the two questions are duplicates,
14
00:00:40,650 --> 00:00:42,000
在第二个它们不是。
in the second they are not.
15
00:00:43,283 --> 00:00:45,540
另一个对分类问题是
Another pair classification problem is
16
00:00:45,540 --> 00:00:47,400
当我们想知道两个句子是
when we want to know if two sentences are
17
00:00:47,400 --> 00:00:49,590
逻辑上相关与否,
logically related or not,
18
00:00:49,590 --> 00:00:53,970
一个称为自然语言推理或 NLI 的问题。
a problem called natural language inference or NLI.
19
00:00:53,970 --> 00:00:57,000
在这个取自 MultiNLI 数据集的例子中,
In this example, taken from the MultiNLI data set,
20
00:00:57,000 --> 00:00:59,880
对于每个可能的标签,我们都有一对句子。
we have a pair of sentences for each possible label.
21
00:00:59,880 --> 00:01:02,490
矛盾,自然的或蕴涵,
Contradiction, natural or entailment,
22
00:01:02,490 --> 00:01:04,680
这是第一句话的奇特表达方式
which is a fancy way of saying the first sentence
23
00:01:04,680 --> 00:01:05,793
暗示第二种。
implies the second.
24
00:01:06,930 --> 00:01:08,820
所以分类成对的句子是一个
So classifying pairs of sentences is a problem
25
00:01:08,820 --> 00:01:10,260
值得研究的问题。
worth studying.
26
00:01:10,260 --> 00:01:12,630
事实上,在 GLUE 基准中,
In fact, in the GLUE benchmark,
27
00:01:12,630 --> 00:01:15,750
这是文本分类问题的学术界基准
which is an academic benchmark for text classification
28
00:01:15,750 --> 00:01:17,910
10 个数据集中有 8 个是关注
8 of the 10 datasets are focused
29
00:01:17,910 --> 00:01:19,953
使用句子对的任务。
on tasks using pairs of sentences.
30
00:01:20,910 --> 00:01:22,560
这就是为什么像 BERT 这样的模型
That's why models like BERT
31
00:01:22,560 --> 00:01:25,320
通常预先训练有双重目标。
are often pre-trained with a dual objective.
32
00:01:25,320 --> 00:01:27,660
在语言建模目标之上,
On top of the language modeling objective,
33
00:01:27,660 --> 00:01:31,230
他们通常有一个与句子对相关的目标。
they often have an objective related to sentence pairs.
34
00:01:31,230 --> 00:01:34,320
例如,在预训练期间 BERT 见到
For instance, during pretraining BERT is shown
35
00:01:34,320 --> 00:01:36,810
成对的句子,必须同时预测
pairs of sentences and must predict both
36
00:01:36,810 --> 00:01:39,930
随机掩蔽的标记值,以及第二个是否
the value of randomly masked tokens, and whether the second
37
00:01:39,930 --> 00:01:41,830
句子是否接着第一个句子。
sentence follow from the first or not.
38
00:01:43,084 --> 00:01:45,930
幸运的是,来自 Transformers 库的分词器
Fortunately, the tokenizer from the Transformers library
39
00:01:45,930 --> 00:01:49,170
有一个很好的 API 来处理成对的句子。
has a nice API to deal with pairs of sentences.
40
00:01:49,170 --> 00:01:51,270
你只需要将它们作为两个参数传递
You just have to pass them as two arguments
41
00:01:51,270 --> 00:01:52,120
到 tokenizer 。
to the tokenizer.
42
00:01:53,430 --> 00:01:55,470
在我们已经研究过的输入 ID
On top of the input IDs and the attention mask
43
00:01:55,470 --> 00:01:56,970
和注意掩码之上,
we studied already,
44
00:01:56,970 --> 00:01:59,910
它返回一个名为标记类型 ID 的新字段,
it returns a new field called token type IDs,
45
00:01:59,910 --> 00:02:01,790
它告诉模型哪些标记属于
which tells the model which tokens belong
46
00:02:01,790 --> 00:02:03,630
第一句话,
to the first sentence,
47
00:02:03,630 --> 00:02:05,943
哪些属于第二句。
and which ones belong to the second sentence.
48
00:02:07,290 --> 00:02:09,840
放大一点,这里有一个输入 ID
Zooming in a little bit, here has an input IDs
49
00:02:09,840 --> 00:02:12,180
与它们对应的 token 对齐,
aligned with the tokens they correspond to,
50
00:02:12,180 --> 00:02:15,213
它们各自的标记类型 ID 和注意掩码。
their respective token type ID and attention mask.
51
00:02:16,080 --> 00:02:19,260
我们可以看到分词器还添加了特殊标记。
We can see the tokenizer also added special tokens.
52
00:02:19,260 --> 00:02:22,620
所以我们有一个 CLS token ,来自第一句话的 token ,
So we have a CLS token, the tokens from the first sentence,
53
00:02:22,620 --> 00:02:25,770
一个 SEP 标记,第二句话中的标记,
a SEP token, the tokens from the second sentence,
54
00:02:25,770 --> 00:02:27,003
和最终的 SEP 标记。
and a final SEP token.
55
00:02:28,500 --> 00:02:30,570
如果我们有几对句子,
If we have several pairs of sentences,
56
00:02:30,570 --> 00:02:32,840
我们可以通过第一句话的传递列表
we can tokenize them together by passing the list
57
00:02:32,840 --> 00:02:36,630
将它们标记在一起,然后是第二句话的列表
of first sentences, then the list of second sentences
58
00:02:36,630 --> 00:02:39,300
以及我们已经研究过的所有关键字参数
and all the keyword arguments we studied already
59
00:02:39,300 --> 00:02:40,353
例如 padding=True。
like padding=True.
60
00:02:41,940 --> 00:02:43,140
放大结果,
Zooming in at the result,
61
00:02:43,140 --> 00:02:45,030
我们可以看到分词器如何添加填充
we can see how the tokenizer added padding
62
00:02:45,030 --> 00:02:48,090
到第二对句子使得两个输出的
to the second pair sentences to make the two outputs
63
00:02:48,090 --> 00:02:51,360
长度相同,并正确处理标记类型 ID
the same length, and properly dealt with token type IDs
64
00:02:51,360 --> 00:02:53,643
和两个句子的注意力掩码。
and attention masks for the two sentences.
65
00:02:54,900 --> 00:02:57,573
然后就可以通过我们的模型了。
This is then all ready to pass through our model.