-
Notifications
You must be signed in to change notification settings - Fork 770
/
Copy path39_memory-mapping-&-streaming.srt
415 lines (332 loc) · 9.67 KB
/
39_memory-mapping-&-streaming.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
1
00:00:00,511 --> 00:00:01,784
(空气呼啸)
(air whooshing)
2
00:00:01,784 --> 00:00:02,964
(徽标弹出)
(logo popping)
3
00:00:02,964 --> 00:00:05,640
(金属滑动)
(metal sliding)
4
00:00:05,640 --> 00:00:07,203
- 内存映射和流式数据。
- Memory mapping and streaming.
5
00:00:08,040 --> 00:00:09,180
在本视频中,我们将了解
In this video, we'll take a look
6
00:00:09,180 --> 00:00:11,520
Datasets 库的两个核心特性
at two core features of the Datasets library
7
00:00:11,520 --> 00:00:14,220
在不耗尽笔记本电脑的 CPU 资源的前提下
that allow you to load and process huge datasets
8
00:00:14,220 --> 00:00:16,263
允许你加载和处理庞大的数据集。
without blowing up your laptop's CPU.
9
00:00:18,300 --> 00:00:20,280
如今,工作上处理多达数个 GB 体量的数据集
Nowadays, it's not uncommon to find yourself
10
00:00:20,280 --> 00:00:22,950
已经不是什么新鲜事了,
working with multi-GB sized datasets,
11
00:00:22,950 --> 00:00:24,420
特别是如果你打算预训练
especially if you're planning to pretrain
12
00:00:24,420 --> 00:00:28,110
类似 BERT 或 GPT-2 这样的 transformer。
a transformer like BERT or GPT-2 from scratch.
13
00:00:28,110 --> 00:00:31,260
在这些场景下,即使加载数据也可能是一个挑战。
In these cases, even loading the data can be a challenge.
14
00:00:31,260 --> 00:00:34,680
例如,用于预训练 T5 的 c4 语料库
For example, the c4 corpus used to pretrain T5
15
00:00:34,680 --> 00:00:36,903
包含超过 2 TB 的数据。
consists of over two terabytes of data.
16
00:00:38,400 --> 00:00:40,050
为了处理这些大型数据集,
To handle these large datasets,
17
00:00:40,050 --> 00:00:42,990
Datasets 库建立在两个核心特性之上:
the Datasets library is built on two core features:
18
00:00:42,990 --> 00:00:46,350
Apache Arrow 格式和流式 API。
the Apache Arrow format and a streaming API.
19
00:00:46,350 --> 00:00:49,110
Arrow 专为高性能数据处理而设计
Arrow is designed for high-performance data processing
20
00:00:49,110 --> 00:00:51,360
并代表每个类似表格的
and represents each table-like dataset
21
00:00:51,360 --> 00:00:52,773
基于列格式的数据集。
with a column-based format.
22
00:00:53,730 --> 00:00:56,130
正如你在此示例中所见,基于列的格式
As you can see in this example, column-based formats
23
00:00:56,130 --> 00:00:59,280
将表格的元素分组缓存到连续的 RAM 块中
group the elements of a table in consecutive blocks of RAM
24
00:00:59,280 --> 00:01:01,563
这实现了快速访问和处理。
and this unlocks fast access and processing.
25
00:01:02,760 --> 00:01:05,550
Arrow 擅长处理任何规模的数据
Arrow is great at processing data at any scale
26
00:01:05,550 --> 00:01:07,110
但有些数据集很大
but some datasets are so large
27
00:01:07,110 --> 00:01:09,600
你甚至不能把它们完全放在你的硬盘上。
that you can't even fit them on your hard disk.
28
00:01:09,600 --> 00:01:11,730
所以对于这些情况,Datasets 库提供了
So for these cases, the Datasets library provides
29
00:01:11,730 --> 00:01:14,820
允许你逐步下载的流式 API
a streaming API that allows you to progressively download
30
00:01:14,820 --> 00:01:17,700
可以每次下载原始数据的一个元素。
the raw data one element at a time.
31
00:01:17,700 --> 00:01:20,430
结果是一个称为 IterableDataset 的特殊对象
The result is a special object called an IterableDataset
32
00:01:20,430 --> 00:01:22,180
我们接下来就会看到更多细节。
that we'll see in more detail soon.
33
00:01:23,700 --> 00:01:26,670
让我们先来看看为什么 Arrow 如此强大。
Let's start by looking at why Arrow is so powerful.
34
00:01:26,670 --> 00:01:28,860
第一个特点是它将每个数据集
The first feature is that it treats every dataset
35
00:01:28,860 --> 00:01:30,153
作为内存映射文件处理。
as a memory-mapped file.
36
00:01:31,020 --> 00:01:32,430
现在,内存映射是一种机制
Now, memory mapping is a mechanism
37
00:01:32,430 --> 00:01:35,400
映射文件的一部分或整个文件和光盘
that maps a portion of a file or an entire file and disc
38
00:01:35,400 --> 00:01:37,410
到一大块虚拟内存。
to a chunk of virtual memory.
39
00:01:37,410 --> 00:01:38,520
这允许应用程序
This allows applications
40
00:01:38,520 --> 00:01:41,280
访问一个非常大的文件的片段
to access segments of an extremely large file
41
00:01:41,280 --> 00:01:44,080
而无需先将整个文件读入内存。
without having to read the whole file into memory first.
42
00:01:45,150 --> 00:01:48,120
Arrow 内存映射功能的另一个很酷的特性
Another cool feature of Arrow's memory mapping capabilities
43
00:01:48,120 --> 00:01:49,860
是它允许多个进程
is that it allows multiple processes
44
00:01:49,860 --> 00:01:51,840
使用相同的大型数据集
to work with the same large dataset
45
00:01:51,840 --> 00:01:54,333
而无需以任何方式移动或复制它。
without moving it or copying it in any way.
46
00:01:55,680 --> 00:01:57,570
Arrow 的这种零拷贝功能
This zero-copy feature of Arrow
47
00:01:57,570 --> 00:02:00,600
使得迭代数据集的速度非常快。
makes it extremely fast for iterating over a dataset.
48
00:02:00,600 --> 00:02:02,640
这个例子,你可以看到我们使用
And this example, you can see that we iterate
49
00:02:02,640 --> 00:02:05,160
普通的笔记本电脑在一分钟之内迭代
over 15 million rows in about a minute
50
00:02:05,160 --> 00:02:06,780
大约超过 1500 万行数据。
just using a standard laptop.
51
00:02:06,780 --> 00:02:08,080
这还不算太糟糕。
That's not too bad at all.
52
00:02:09,750 --> 00:02:12,660
现在让我们看一下如何流式传输大型数据集。
Let's now take a look at how we can stream a large dataset.
53
00:02:12,660 --> 00:02:14,520
你需要做的唯一修改是
The only change you need to make is to set
54
00:02:14,520 --> 00:02:17,910
设置 load_dataset () 函数中的 streaming=True 参数。
the streaming=True argument in the load_dataset () function.
55
00:02:17,910 --> 00:02:20,580
这将返回一个特殊的 IterableDataset 对象
This will return a special IterableDataset object
56
00:02:20,580 --> 00:02:22,260
这与 Dataset 对象有点不同
which is a bit different to the Dataset objects
57
00:02:22,260 --> 00:02:24,330
我们在其他视频中看到过。
we've seen in other videos.
58
00:02:24,330 --> 00:02:25,980
这个对象是一个可迭代的,
This object is an iterable,
59
00:02:25,980 --> 00:02:28,530
这意味着我们不能索引它来访问元素,
which means we can't index it to access elements,
60
00:02:28,530 --> 00:02:30,180
但相反我们使用 iter
but instead we iterate on it
61
00:02:30,180 --> 00:02:32,850
和 next 方法迭代它。
using the iter and next methods.
62
00:02:32,850 --> 00:02:34,050
这将下载并访问
This will download and access
63
00:02:34,050 --> 00:02:35,850
来自数据集的单个示例,
a single example from the dataset,
64
00:02:35,850 --> 00:02:37,410
这意味着你可以逐步迭代
which means you can progressively iterate
65
00:02:37,410 --> 00:02:40,360
庞大的数据集,而无需提前下载它。
through a huge dataset without having to download it first.
66
00:02:42,150 --> 00:02:43,590
使用 map () 方法标记文本
Tokenizing text with a map () method
67
00:02:43,590 --> 00:02:45,660
也以类似的方式工作。
also works in a similar way.
68
00:02:45,660 --> 00:02:47,160
我们首先流式传输数据集
We first stream the dataset
69
00:02:47,160 --> 00:02:49,830
然后将 map () 方法与分词器一起应用。
and then apply the map () method with the tokenizer.
70
00:02:49,830 --> 00:02:53,283
要获得第一个词元化示例,我们应用 iter 和 next。
To get the first tokenized example, we apply iter and next.
71
00:02:54,750 --> 00:02:57,210
与 IterableDataset 的主要区别在于
The main difference with an IterableDataset is that
72
00:02:57,210 --> 00:02:59,970
并未使用 select () 方法返回示例,
instead of using a select () method to return examples,
73
00:02:59,970 --> 00:03:01,530
而是使用 take () 和 skip () 方法
we use the take () and skip () methods
74
00:03:01,530 --> 00:03:03,573
因为我们无法索引数据集。
because we can't index into the dataset.
75
00:03:04,470 --> 00:03:05,460
take () 方法返回
The take () method returns
76
00:03:05,460 --> 00:03:07,500
数据集中的前 N 个示例,
the first N examples in the dataset,
77
00:03:07,500 --> 00:03:09,270
而 skip (),如你所想,
while skip (), as you can imagine,
78
00:03:09,270 --> 00:03:12,480
跳过第一个 N 并返回其余的。
skips the first N and returns the rest.
79
00:03:12,480 --> 00:03:15,300
你可以看到这两种方法的实际示例
You can see examples of both of these methods in action
80
00:03:15,300 --> 00:03:16,710
我们在哪里创建验证集
where we create a validation set
81
00:03:16,710 --> 00:03:18,660
来自前 1000 个示例
from the first 1000 examples
82
00:03:18,660 --> 00:03:21,010
然后跳过那些来创建训练集。
and then skip those to create the training set.
83
00:03:23,012 --> 00:03:25,762
(空气呼啸)
(air whooshing)