-
Notifications
You must be signed in to change notification settings - Fork 776
/
Copy path41_text-embeddings-&-semantic-search.srt
405 lines (324 loc) · 9.9 KB
/
41_text-embeddings-&-semantic-search.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
1
00:00:00,621 --> 00:00:03,204
(欢快的音乐)
(upbeat music)
2
00:00:05,670 --> 00:00:08,520
- 文本嵌入和语义搜索。
- Text embeddings and semantic search.
3
00:00:08,520 --> 00:00:10,770
在本视频中,我们将探索 Transformer 如何建模
In this video we'll explore how Transformer models
4
00:00:10,770 --> 00:00:12,810
将文本表示为嵌入向量
represent text as embedding vectors
5
00:00:12,810 --> 00:00:15,420
以及在语料库中如何使用这些向量
and how these vectors can be used to find similar documents
6
00:00:15,420 --> 00:00:16,293
来查找相似文档。
in a corpus.
7
00:00:17,730 --> 00:00:19,890
文本嵌入只是一种时髦的说法
Text embeddings are just a fancy way of saying
8
00:00:19,890 --> 00:00:22,170
我们可以将文本表示为数字数组
that we can represent text as an array of numbers
9
00:00:22,170 --> 00:00:23,640
称之为矢量。
called a vector.
10
00:00:23,640 --> 00:00:25,710
为了创建这些嵌入,我们通常使用
To create these embeddings we usually use
11
00:00:25,710 --> 00:00:27,393
基于编码器的模型,如 BERT。
an encoder-based model like BERT.
12
00:00:28,530 --> 00:00:31,290
在此示例中,你可以看到我们如何提供三个句子
In this example, you can see how we feed three sentences
13
00:00:31,290 --> 00:00:34,830
到编码器并获得三个向量作为输出。
to the encoder and get three vectors as the output.
14
00:00:34,830 --> 00:00:37,050
读一下输入文本,我们可以看到 walking the dog
Reading the text, we can see that walking the dog
15
00:00:37,050 --> 00:00:39,450
和 walking the cat 从字面上感觉很像,
seems to be most similar to walking the cat,
16
00:00:39,450 --> 00:00:41,350
但让我们看看我们是否可以对此进行量化。
but let's see if we can quantify this.
17
00:00:42,810 --> 00:00:44,040
进行比较的技巧
The trick to do the comparison
18
00:00:44,040 --> 00:00:45,630
是在每对嵌入向量之间
is to compute a similarity metric
19
00:00:45,630 --> 00:00:48,210
计算相似性度量。
between each pair of embedding vectors.
20
00:00:48,210 --> 00:00:51,120
这些向量通常存在于一个非常高维的空间中,
These vectors usually live in a very high-dimensional space,
21
00:00:51,120 --> 00:00:53,190
所以相似性度量可以是任何可用于
so a similarity metric can be anything that measures
22
00:00:53,190 --> 00:00:55,740
衡量矢量之间的某种距离的属性。
some sort of distance between vectors.
23
00:00:55,740 --> 00:00:58,560
一个非常流行的指标是余弦相似度,
One very popular metric is cosine similarity,
24
00:00:58,560 --> 00:01:00,390
它使用两个向量之间的角度
which uses the angle between two vectors
25
00:01:00,390 --> 00:01:02,610
来衡量他们有多接近。
to measure how close they are.
26
00:01:02,610 --> 00:01:05,250
在这个例子中,我们的嵌入向量存在于三维空间中
In this example, our embedding vectors live in 3D
27
00:01:05,250 --> 00:01:07,110
我们可以看到橙色和灰色向量
and we can see that the orange and Grey vectors
28
00:01:07,110 --> 00:01:09,560
彼此靠近并且具有更小的角度。
are close to each other and have a smaller angle.
29
00:01:11,130 --> 00:01:12,510
现在我们必须处理一个问题
Now one problem we have to deal with
30
00:01:12,510 --> 00:01:15,180
是像 BERT 这样的 Transformer 模型实际上会返回
is that Transformer models like BERT will actually return
31
00:01:15,180 --> 00:01:16,983
每个词元一个嵌入向量。
one embedding vector per token.
32
00:01:17,880 --> 00:01:20,700
例如在句子中,“I took my dog for a walk,”
For example in the sentence, "I took my dog for a walk,"
33
00:01:20,700 --> 00:01:23,853
我们可以期待几个嵌入向量,每个词一个。
we can expect several embedding vectors, one for each word.
34
00:01:25,110 --> 00:01:27,870
例如,在这里我们可以看到模型的输出
For example, here we can see the output of our model
35
00:01:27,870 --> 00:01:30,540
每个句子产生了 9 个嵌入向量,
has produced 9 embedding vectors per sentence,
36
00:01:30,540 --> 00:01:33,750
每个向量有 384 个维度。
and each vector has 384 dimensions.
37
00:01:33,750 --> 00:01:36,210
但我们真正想要的是对于每个句子
But what we really want is a single embedding vector
38
00:01:36,210 --> 00:01:37,353
对应一个单一的嵌入向量。
for each sentence.
39
00:01:38,940 --> 00:01:42,060
为了解决这个问题,我们可以使用一种称为 pooling 的技术。
To deal with this, we can use a technique called pooling.
40
00:01:42,060 --> 00:01:43,050
最简单的 pooling 方法
The simplest pooling method
41
00:01:43,050 --> 00:01:44,520
就是把词元嵌入
is to just take the token embedding
42
00:01:44,520 --> 00:01:46,203
特殊的 CLS 词元。
of the special CLS token.
43
00:01:47,100 --> 00:01:49,650
或者,我们可以对词元嵌入进行平均
Alternatively, we can average the token embeddings
44
00:01:49,650 --> 00:01:52,500
这就是所谓的 mean_pooling,也就是我们在这里所做的。
which is called mean_pooling and this is what we do here.
45
00:01:53,370 --> 00:01:55,800
使用 mean_pooling 时我们唯一需要确保的事情
With mean_pooling the only thing we need to make sure
46
00:01:55,800 --> 00:01:58,410
是我们不在平均值中包含 padding 词元,
is that we don't include the padding tokens in the average,
47
00:01:58,410 --> 00:02:01,860
这就是为什么你可以看到这里用到了 attention_mask。
which is why you can see the attention_mask being used here.
48
00:02:01,860 --> 00:02:05,100
这为每个句子提供了一个 384 维向量
This gives us a 384 dimensional vector for each sentence
49
00:02:05,100 --> 00:02:06,600
这正是我们想要的。
which is exactly what we want.
50
00:02:07,920 --> 00:02:09,810
一旦我们有了句子嵌入,
And once we have our sentence embeddings,
51
00:02:09,810 --> 00:02:11,730
我们可以针对每对向量
we can compute the cosine similarity
52
00:02:11,730 --> 00:02:13,113
计算余弦相似度。
for each pair of vectors.
53
00:02:13,993 --> 00:02:16,350
在此示例中,我们使用 scikit-learn 中的函数
In this example we use the function from scikit-learn
54
00:02:16,350 --> 00:02:19,140
你可以看到 “I took my dog for a walk” 这句话
and you can see that the sentence "I took my dog for a walk"
55
00:02:19,140 --> 00:02:22,140
确实与 “I took my cat for a walk” 有很明显的重叠。
has indeed a strong overlap with "I took my cat for a walk".
56
00:02:22,140 --> 00:02:23,240
万岁!我们做到了。
Hooray! We've done it.
57
00:02:25,110 --> 00:02:27,180
我们实际上可以将这个想法更进一步
We can actually take this idea one step further
58
00:02:27,180 --> 00:02:29,220
通过比较问题和文档语料库
by comparing the similarity between a question
59
00:02:29,220 --> 00:02:31,170
之间的相似性。
and a corpus of documents.
60
00:02:31,170 --> 00:02:33,810
例如,假设我们在 Hugging Face 论坛中
For example, suppose we embed every post
61
00:02:33,810 --> 00:02:35,430
嵌入每个帖子。
in the Hugging Face forums.
62
00:02:35,430 --> 00:02:37,800
然后我们可以问一个问题,嵌入它,
We can then ask a question, embed it,
63
00:02:37,800 --> 00:02:40,590
并检查哪些论坛帖子最相似。
and check which forum posts are most similar.
64
00:02:40,590 --> 00:02:42,750
这个过程通常称为语义搜索,
This process is often called semantic search,
65
00:02:42,750 --> 00:02:45,423
因为它允许我们将查询与上下文进行比较。
because it allows us to compare queries with context.
66
00:02:47,040 --> 00:02:48,450
使用 datasets 库
To create a semantic search engine
67
00:02:48,450 --> 00:02:51,030
创建语义搜索引擎其实很简单。
is actually quite simple in the datasets library.
68
00:02:51,030 --> 00:02:53,340
首先我们需要嵌入所有文档。
First we need to embed all the documents.
69
00:02:53,340 --> 00:02:56,070
在这个例子中,我们取了
And in this example, we take a small sample
70
00:02:56,070 --> 00:02:57,780
一个来自 squad 数据集的小样本
from the squad dataset and apply
71
00:02:57,780 --> 00:03:00,180
并按照与以前相同的嵌入逻辑使用。
the same embedding logic as before.
72
00:03:00,180 --> 00:03:02,280
这为我们提供了一个名为 embeddings 的新列,
This gives us a new column called embeddings,
73
00:03:02,280 --> 00:03:04,530
它存储每个段落的嵌入。
which stores the embeddings of every passage.
74
00:03:05,880 --> 00:03:07,260
一旦我们有了嵌入,
Once we have our embeddings,
75
00:03:07,260 --> 00:03:10,200
我们需要一种方法来为查询找到最近的相邻数据。
we need a way to find nearest neighbors for a query.
76
00:03:10,200 --> 00:03:13,170
datasets 库提供了一个名为 FAISS 的特殊对象
The datasets library provides a special object called FAISS
77
00:03:13,170 --> 00:03:16,080
这使你可以快速比较嵌入向量。
which allows you to quickly compare embedding vectors.
78
00:03:16,080 --> 00:03:19,950
所以我们添加 FAISS 索引,嵌入一个问题,瞧,
So we add the FAISS index, embed a question and voila,
79
00:03:19,950 --> 00:03:21,870
我们现在找到了 3 篇最相似的文章
we've now found the 3 most similar articles
80
00:03:21,870 --> 00:03:23,320
其中可能会包含答案。
which might store the answer.
81
00:03:25,182 --> 00:03:27,849
(欢快的音乐)
(upbeat music)