Crazy idea: parallel iterator on top of (buffered) files #529

vorner · 2018-02-12T20:35:00Z

Hello

I was wondering something like this: if I have an array (vec/slice/…), rayon is pretty cool abstraction to writing parallel algorithms. But to do that, everything must fit into memory. But for that, the data must fit into memory. What about data that doesn't fit?

I haven't thought through all the details, but what if I had a file (or some directory structure, to let it create more temporary files as it goes) with fixed-sized records and a function to serialize/deserialize a record. Would it be possible to build a parallel iterator on top of that, considering the file can seek? I guess the parallel iterators operate in chunks, so whenever it would operate on a one, it could be read into a buffer in memory, or something.

I guess this would make sense in its own sub-crate, though.

This is something slightly different than #46 ‒ that would probably need to linearly read the whole thing in, because ordinary iterators don't seek ‒ but the files can (and it is reasonably fast if every seek is followed by a long-enough read or write).

(I'm even thinking about files scattered over multiple computers, but I have no idea if that could work at least in theory)

djc · 2020-06-19T11:05:39Z

Here's a file chunker based on line separators which is used with Rayon. Maybe it would make sense to move the generic part of this code (which is most of it) into Rayon?

https://github.com/djc/topfew-rs/blob/master/src/chunks.rs

I have a bunch of other use cases where something like this would be helpful.

vorner · 2020-06-24T09:38:39Z

Hello

I'm not sure about what exact ideas I had at that time (it's been a while). Now there's the bridge to take the non-parallel iterator and make it parallel. But it adds overhead with a mutex and the actual reading happens single-threaded.

Now, let's say I have a TB large file of fixed-sized items (or some other way to know where things are in there) and I wanted to process it as fast as possible. It would probably be faster if each thread takes a buffer, they split the file into parts and each one starts processing on its own, buffering the data in as needed to avoid fighting over seeking in the file.

One can probably do it by „virtually“ generating bunch of lets say 100MB chunks, putting them into a vector and doing par_iter on these ‒ and the chunk would be just a pointer to the file. But that seems like some more work than needed, one would like something like ParFile::open(path, record_size, decode_fn, encode_fn).map(..).fold(..). Or even ParFile::open_rw(path, record_size, decode_fn, encode_fn).par_sort_with(..).

I guess what you link is something that could be used for the manual part if you needed non-fixed-sized records (which is useful), but in a way a different thing than I was thinking about.

v1gnesh · 2020-06-24T09:53:27Z

@vorner Yaaaaass! This would be great for working with binary files.
Additionally, in the example snip you've shown -- ParFile::open(path, record_size, decode_fn, encode_fn).map(..).fold(..) -- having the option of record_size being an Vec would also cater to cases where the file isn't full of same-sized records.
Oh this would be so useful!

EDIT: Adding relevant link - https://users.rust-lang.org/t/iterate-over-varying-size-chunks-of-binary-data/40708
EDIT2: Not 100% sure, but I think this may be useful for parallelizing I/O - https://github.com/stjepang/blocking ... until a time when one of RAPIDS's repositories (I think it was a part of cuDF) for parallelizing I/O operations adds Rust support.

Stock84-dev · 2021-09-09T18:08:13Z

You could use memory mapped files with memmap2 crate.
You can find example in #885.

v1gnesh mentioned this issue Jul 3, 2020

[FEA] cuIO for custom data types rapidsai/cudf#5633

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crazy idea: parallel iterator on top of (buffered) files #529

Crazy idea: parallel iterator on top of (buffered) files #529

vorner commented Feb 12, 2018

djc commented Jun 19, 2020

vorner commented Jun 24, 2020

v1gnesh commented Jun 24, 2020 •

edited

Loading

Stock84-dev commented Sep 9, 2021

Crazy idea: parallel iterator on top of (buffered) files #529

Crazy idea: parallel iterator on top of (buffered) files #529

Comments

vorner commented Feb 12, 2018

djc commented Jun 19, 2020

vorner commented Jun 24, 2020

v1gnesh commented Jun 24, 2020 • edited Loading

Stock84-dev commented Sep 9, 2021

v1gnesh commented Jun 24, 2020 •

edited

Loading