subtitles/zh-CN/20_hugging-face-datasets-overview-(tensorflow).srt

1
00:00:00,170 --> 00:00:03,087
（屏幕呼啸）
(screen whooshing)

2
00:00:05,371 --> 00:00:09,690
- Hugging Face Datasets 库: 快速概览。
- The Hugging Face Datasets library: A Quick overview.

3
00:00:09,690 --> 00:00:10,917
Hugging Face Datasets 库
The Hugging Face Datasets library

4
00:00:10,917 --> 00:00:12,870
是一个库, 提供 API 
is a library that provides an API

5
00:00:12,870 --> 00:00:15,150
来快速下载许多公共数据集
to quickly download many public datasets

6
00:00:15,150 --> 00:00:16,200
并对它们进行预处理。
and pre-process them.

7
00:00:17,070 --> 00:00:19,473
在本视频中，我们将探索如何做到这一点。
In this video we will explore to do that.

8
00:00:20,520 --> 00:00:23,730
下载部分使用 load_dataset 函数很容易，
The downloading part is easy with the load_dataset function,

9
00:00:23,730 --> 00:00:26,010
你可以直接下载并缓存数据集
you can directly download and cache a dataset

10
00:00:26,010 --> 00:00:28,023
来自其在数据集 Hub 的 ID 。
from its identifier on the Dataset hub.

11
00:00:29,160 --> 00:00:33,690
这里我们从 GLUE benchmark 中获取 MRPC 数据集，
Here we fetch the MRPC dataset from the GLUE benchmark,

12
00:00:33,690 --> 00:00:36,030
是一个包含句子对的数据集
is a dataset containing pairs of sentences

13
00:00:36,030 --> 00:00:38,380
任务是确定释义。
where the task is to determine the paraphrases.

14
00:00:39,720 --> 00:00:42,120
load_dataset 函数返回的对象
The object returned by the load_dataset function

15
00:00:42,120 --> 00:00:45,090
是一个 DatasetDict，它是一种字典
is a DatasetDict, which is a sort of dictionary

16
00:00:45,090 --> 00:00:46,940
包含我们数据集的每个拆分。
containing each split of our dataset.

17
00:00:48,600 --> 00:00:51,780
我们可以通过使用其名称进行索引来访问每个拆分。
We can access each split by indexing with its name.

18
00:00:51,780 --> 00:00:54,540
这个拆分然后是 Dataset 类的一个实例，
This split is then an instance of the Dataset class,

19
00:00:54,540 --> 00:00:57,423
有列: sentence1，sentence2，
with columns, here sentence1, sentence2,

20
00:00:58,350 --> 00:01:00,813
label 和 idx，以及行。
label and idx, and rows.

21
00:01:02,160 --> 00:01:05,220
我们可以访问给定的元素, 通过索引。
We can access a given element by its index.

22
00:01:05,220 --> 00:01:08,220
Hugging Face Datasets 库的神奇之处
The amazing thing about the Hugging Face Datasets library

23
00:01:08,220 --> 00:01:11,700
是所有内容都使用 Apache Arrow 保存到磁盘，
is that everything is saved to disk using Apache Arrow,

24
00:01:11,700 --> 00:01:14,460
这意味着即使你的数据集很大
which means that even if your dataset is huge

25
00:01:14,460 --> 00:01:16,219
你不会内存溢出，
you won't get out of RAM,

26
00:01:16,219 --> 00:01:18,769
只有你请求的元素才会加载到内存中。
only the elements you request are loaded in memory.

27
00:01:19,920 --> 00:01:24,510
访问数据集的一个切片就像访问一个元素一样简单。
Accessing a slice of your dataset is as easy as one element.

28
00:01:24,510 --> 00:01:27,150
结果是一个包含值列表的字典
The result is then a dictionary with list of values

29
00:01:27,150 --> 00:01:30,630
对于每个键，这里是标签列表，
for each keys, here the list of labels,

30
00:01:30,630 --> 00:01:32,190
第一句话列表，
the list of first sentences,

31
00:01:32,190 --> 00:01:33,840
和第二句话的列表。
and the list of second sentences.

32
00:01:35,100 --> 00:01:37,080
数据集的特征属性
The features attribute of a Dataset

33
00:01:37,080 --> 00:01:39,840
为我们提供有关其列的更多信息。
gives us more information about its columns.

34
00:01:39,840 --> 00:01:42,150
特别是，我们可以在这里看到它给了我们
In particular, we can see here it gives us

35
00:01:42,150 --> 00:01:43,980
整数之间的对应关系
a correspondence between the integers

36
00:01:43,980 --> 00:01:46,110
和标签的名称。
and names for the labels.

37
00:01:46,110 --> 00:01:49,623
0 代表不相等，1 代表相等。
0 stands for not equivalent and 1 for equivalent.

38
00:01:51,630 --> 00:01:54,090
要预处理数据集的所有元素，
To pre-process all the elements of our dataset,

39
00:01:54,090 --> 00:01:55,980
我们需要将它们 token 化。
we need to tokenize them.

40
00:01:55,980 --> 00:01:58,470
看看视频 “Pre-process sentence pairs”
Have a look at the video "Pre-process sentence pairs"

41
00:01:58,470 --> 00:02:01,800
复习一下，但你只需要发送这两个句子
for a refresher, but you just have to send the two sentences

42
00:02:01,800 --> 00:02:04,833
给 tokenizer, 带有一些额外的关键字参数。
to the tokenizer with some additional keyword arguments.

43
00:02:05,880 --> 00:02:09,300
这里我们表示最大长度为 128
Here we indicate a maximum length of 128

44
00:02:09,300 --> 00:02:11,460
和垫全输入短于这个长度的，
and pad inputs shorter than this length,

45
00:02:11,460 --> 00:02:13,060
截断更长的输入。
truncate inputs that are longer.

46
00:02:14,040 --> 00:02:16,170
我们把所有这些都放在一个 tokenize_function 中
We put all of this in a tokenize_function

47
00:02:16,170 --> 00:02:18,510
我们可以直接应用于所有拆分
that we can directly apply to all the splits

48
00:02:18,510 --> 00:02:20,260
在我们的数据集中使用 map 方法。
in our dataset with the map method.

49
00:02:21,210 --> 00:02:24,120
只要函数返回一个类似字典的对象，
As long as the function returns a dictionary-like object,

50
00:02:24,120 --> 00:02:26,580
map 方法将根据需要添加新列
the map method will add new columns as needed

51
00:02:26,580 --> 00:02:28,113
或更新现有的。
or update existing ones.

52
00:02:30,060 --> 00:02:32,520
为加快预处理并利用
To speed up pre-processing and take advantage

53
00:02:32,520 --> 00:02:35,130
我们的分词器由 Rust 支持的事实
of the fact our tokenizer is backed by Rust

54
00:02:35,130 --> 00:02:38,160
感谢 Hugging Face Tokenizers 库，
thanks to the Hugging Face Tokenizers library,

55
00:02:38,160 --> 00:02:40,590
我们可以同时处理多个元素
we can process several elements at the same time

56
00:02:40,590 --> 00:02:43,923
在我们的 tokenize 函数中，使用 batched=True 参数。
in our tokenize function, using the batched=True argument.

57
00:02:45,300 --> 00:02:46,980
由于 tokenizer 可以处理列表
Since the tokenizer can handle a list

58
00:02:46,980 --> 00:02:50,280
第一句话或第二句话, tokenize_function
of first or second sentences, the tokenize_function

59
00:02:50,280 --> 00:02:52,740
不需要为此改变。
does not need to change for this.

60
00:02:52,740 --> 00:02:55,410
你还可以将多进程与 map 方法一起使用，
You can also use multiprocessing with the map method,

61
00:02:55,410 --> 00:02:57,460
查看下面链接的文档。
check out its documentation linked below.

62
00:02:58,740 --> 00:03:02,130
完成后，我们几乎准备好训练了，
Once this is done, we are almost ready for training,

63
00:03:02,130 --> 00:03:04,020
我们只是删除我们不再需要的列
we just remove the columns we don't need anymore

64
00:03:04,020 --> 00:03:06,120
使用 remove_columns 方法，
with the remove_columns method,

65
00:03:06,120 --> 00:03:08,580
将 label 重命名为 labels ，因为模型是
rename label to labels, since the models

66
00:03:08,580 --> 00:03:11,430
 transformers 库中的，期望如此
from the transformers library expect that,

67
00:03:11,430 --> 00:03:14,040
并将输出格式设置为我们想要的后端，
and set the output format to our desired backend,

68
00:03:14,040 --> 00:03:15,893
torch、tensorflow 或 numpy。
torch, tensorflow or numpy.

69
00:03:16,800 --> 00:03:19,050
如果需要，我们还可以生成一个简短的示例
If needed, we can also generate a short sample

70
00:03:19,050 --> 00:03:21,377
使用 select 方法的数据集。
of a dataset using the select method.

71
00:03:22,817 --> 00:03:25,734
（屏幕呼啸）
(screen whooshing)