subtitles/zh-CN/41_text-embeddings-&-semantic-search.srt

1
00:00:00,621 --> 00:00:03,204
（欢快的音乐）
(upbeat music)

2
00:00:05,670 --> 00:00:08,520
- 文本嵌入和语义搜索。
- Text embeddings and semantic search.

3
00:00:08,520 --> 00:00:10,770
在本视频中，我们将探索 Transformer 如何建模
In this video we'll explore how Transformer models

4
00:00:10,770 --> 00:00:12,810
将文本表示为嵌入向量
represent text as embedding vectors

5
00:00:12,810 --> 00:00:15,420
以及在语料库中如何使用这些向量
and how these vectors can be used to find similar documents

6
00:00:15,420 --> 00:00:16,293
来查找相似文档。
in a corpus.

7
00:00:17,730 --> 00:00:19,890
文本嵌入只是一种时髦的说法
Text embeddings are just a fancy way of saying

8
00:00:19,890 --> 00:00:22,170
我们可以将文本表示为数字数组
that we can represent text as an array of numbers

9
00:00:22,170 --> 00:00:23,640
称之为矢量。
called a vector.

10
00:00:23,640 --> 00:00:25,710
为了创建这些嵌入，我们通常使用
To create these embeddings we usually use

11
00:00:25,710 --> 00:00:27,393
基于编码器的模型，如 BERT。
an encoder-based model like BERT.

12
00:00:28,530 --> 00:00:31,290
在此示例中，你可以看到我们如何提供三个句子
In this example, you can see how we feed three sentences

13
00:00:31,290 --> 00:00:34,830
到编码器并获得三个向量作为输出。
to the encoder and get three vectors as the output.

14
00:00:34,830 --> 00:00:37,050
读一下输入文本，我们可以看到 walking the dog
Reading the text, we can see that walking the dog

15
00:00:37,050 --> 00:00:39,450
和 walking the cat 从字面上感觉很像，
seems to be most similar to walking the cat,

16
00:00:39,450 --> 00:00:41,350
但让我们看看我们是否可以对此进行量化。
but let's see if we can quantify this.

17
00:00:42,810 --> 00:00:44,040
进行比较的技巧
The trick to do the comparison

18
00:00:44,040 --> 00:00:45,630
是在每对嵌入向量之间
is to compute a similarity metric

19
00:00:45,630 --> 00:00:48,210
计算相似性度量。
between each pair of embedding vectors.

20
00:00:48,210 --> 00:00:51,120
这些向量通常存在于一个非常高维的空间中，
These vectors usually live in a very high-dimensional space,

21
00:00:51,120 --> 00:00:53,190
所以相似性度量可以是任何可用于
so a similarity metric can be anything that measures

22
00:00:53,190 --> 00:00:55,740
衡量矢量之间的某种距离的属性。
some sort of distance between vectors.

23
00:00:55,740 --> 00:00:58,560
一个非常流行的指标是余弦相似度，
One very popular metric is cosine similarity,

24
00:00:58,560 --> 00:01:00,390
它使用两个向量之间的角度
which uses the angle between two vectors

25
00:01:00,390 --> 00:01:02,610
来衡量他们有多接近。
to measure how close they are.

26
00:01:02,610 --> 00:01:05,250
在这个例子中，我们的嵌入向量存在于三维空间中
In this example, our embedding vectors live in 3D

27
00:01:05,250 --> 00:01:07,110
我们可以看到橙色和灰色向量
and we can see that the orange and Grey vectors

28
00:01:07,110 --> 00:01:09,560
彼此靠近并且具有更小的角度。
are close to each other and have a smaller angle.

29
00:01:11,130 --> 00:01:12,510
现在我们必须处理一个问题
Now one problem we have to deal with

30
00:01:12,510 --> 00:01:15,180
是像 BERT 这样的 Transformer 模型实际上会返回
is that Transformer models like BERT will actually return

31
00:01:15,180 --> 00:01:16,983
每个词元一个嵌入向量。
one embedding vector per token.

32
00:01:17,880 --> 00:01:20,700
例如在句子中，“I took my dog for a walk，”
For example in the sentence, "I took my dog for a walk,"

33
00:01:20,700 --> 00:01:23,853
我们可以期待几个嵌入向量，每个词一个。
we can expect several embedding vectors, one for each word.

34
00:01:25,110 --> 00:01:27,870
例如，在这里我们可以看到模型的输出
For example, here we can see the output of our model

35
00:01:27,870 --> 00:01:30,540
每个句子产生了 9 个嵌入向量，
has produced 9 embedding vectors per sentence,

36
00:01:30,540 --> 00:01:33,750
每个向量有 384 个维度。
and each vector has 384 dimensions.

37
00:01:33,750 --> 00:01:36,210
但我们真正想要的是对于每个句子
But what we really want is a single embedding vector

38
00:01:36,210 --> 00:01:37,353
对应一个单一的嵌入向量。
for each sentence.

39
00:01:38,940 --> 00:01:42,060
为了解决这个问题，我们可以使用一种称为 pooling 的技术。
To deal with this, we can use a technique called pooling.

40
00:01:42,060 --> 00:01:43,050
最简单的 pooling 方法
The simplest pooling method

41
00:01:43,050 --> 00:01:44,520
就是把词元嵌入
is to just take the token embedding

42
00:01:44,520 --> 00:01:46,203
特殊的 CLS 词元。
of the special CLS token.

43
00:01:47,100 --> 00:01:49,650
或者，我们可以对词元嵌入进行平均
Alternatively, we can average the token embeddings

44
00:01:49,650 --> 00:01:52,500
这就是所谓的 mean_pooling，也就是我们在这里所做的。
which is called mean_pooling and this is what we do here.

45
00:01:53,370 --> 00:01:55,800
使用 mean_pooling 时我们唯一需要确保的事情
With mean_pooling the only thing we need to make sure

46
00:01:55,800 --> 00:01:58,410
是我们不在平均值中包含 padding 词元，
is that we don't include the padding tokens in the average,

47
00:01:58,410 --> 00:02:01,860
这就是为什么你可以看到这里用到了 attention_mask。
which is why you can see the attention_mask being used here.

48
00:02:01,860 --> 00:02:05,100
这为每个句子提供了一个 384 维向量
This gives us a 384 dimensional vector for each sentence

49
00:02:05,100 --> 00:02:06,600
这正是我们想要的。
which is exactly what we want.

50
00:02:07,920 --> 00:02:09,810
一旦我们有了句子嵌入，
And once we have our sentence embeddings,

51
00:02:09,810 --> 00:02:11,730
我们可以针对每对向量
we can compute the cosine similarity

52
00:02:11,730 --> 00:02:13,113
计算余弦相似度。
for each pair of vectors.

53
00:02:13,993 --> 00:02:16,350
在此示例中，我们使用 scikit-learn 中的函数
In this example we use the function from scikit-learn

54
00:02:16,350 --> 00:02:19,140
你可以看到 “I took my dog for a walk” 这句话
and you can see that the sentence "I took my dog for a walk"

55
00:02:19,140 --> 00:02:22,140
确实与 “I took my cat for a walk” 有很明显的重叠。
has indeed a strong overlap with "I took my cat for a walk".

56
00:02:22,140 --> 00:02:23,240
万岁！我们做到了。
Hooray! We've done it.

57
00:02:25,110 --> 00:02:27,180
我们实际上可以将这个想法更进一步
We can actually take this idea one step further

58
00:02:27,180 --> 00:02:29,220
通过比较问题和文档语料库
by comparing the similarity between a question

59
00:02:29,220 --> 00:02:31,170
之间的相似性。
and a corpus of documents.

60
00:02:31,170 --> 00:02:33,810
例如，假设我们在 Hugging Face 论坛中
For example, suppose we embed every post

61
00:02:33,810 --> 00:02:35,430
嵌入每个帖子。
in the Hugging Face forums.

62
00:02:35,430 --> 00:02:37,800
然后我们可以问一个问题，嵌入它，
We can then ask a question, embed it,

63
00:02:37,800 --> 00:02:40,590
并检查哪些论坛帖子最相似。
and check which forum posts are most similar.

64
00:02:40,590 --> 00:02:42,750
这个过程通常称为语义搜索，
This process is often called semantic search,

65
00:02:42,750 --> 00:02:45,423
因为它允许我们将查询与上下文进行比较。
because it allows us to compare queries with context.

66
00:02:47,040 --> 00:02:48,450
使用 datasets 库
To create a semantic search engine

67
00:02:48,450 --> 00:02:51,030
创建语义搜索引擎其实很简单。
is actually quite simple in the datasets library.

68
00:02:51,030 --> 00:02:53,340
首先我们需要嵌入所有文档。
First we need to embed all the documents.

69
00:02:53,340 --> 00:02:56,070
在这个例子中，我们取了
And in this example, we take a small sample

70
00:02:56,070 --> 00:02:57,780
一个来自 squad 数据集的小样本
from the squad dataset and apply

71
00:02:57,780 --> 00:03:00,180
并按照与以前相同的嵌入逻辑使用。
the same embedding logic as before.

72
00:03:00,180 --> 00:03:02,280
这为我们提供了一个名为 embeddings 的新列，
This gives us a new column called embeddings,

73
00:03:02,280 --> 00:03:04,530
它存储每个段落的嵌入。
which stores the embeddings of every passage.

74
00:03:05,880 --> 00:03:07,260
一旦我们有了嵌入，
Once we have our embeddings,

75
00:03:07,260 --> 00:03:10,200
我们需要一种方法来为查询找到最近的相邻数据。
we need a way to find nearest neighbors for a query.

76
00:03:10,200 --> 00:03:13,170
datasets 库提供了一个名为 FAISS 的特殊对象
The datasets library provides a special object called FAISS

77
00:03:13,170 --> 00:03:16,080
这使你可以快速比较嵌入向量。
which allows you to quickly compare embedding vectors.

78
00:03:16,080 --> 00:03:19,950
所以我们添加 FAISS 索引，嵌入一个问题，瞧，
So we add the FAISS index, embed a question and voila,

79
00:03:19,950 --> 00:03:21,870
我们现在找到了 3 篇最相似的文章
we've now found the 3 most similar articles

80
00:03:21,870 --> 00:03:23,320
其中可能会包含答案。
which might store the answer.

81
00:03:25,182 --> 00:03:27,849
（欢快的音乐）
(upbeat music)