Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crazy idea: parallel iterator on top of (buffered) files #529

Open
vorner opened this issue Feb 12, 2018 · 4 comments
Open

Crazy idea: parallel iterator on top of (buffered) files #529

vorner opened this issue Feb 12, 2018 · 4 comments

Comments

@vorner
Copy link
Contributor

vorner commented Feb 12, 2018

Hello

I was wondering something like this: if I have an array (vec/slice/…), rayon is pretty cool abstraction to writing parallel algorithms. But to do that, everything must fit into memory. But for that, the data must fit into memory. What about data that doesn't fit?

I haven't thought through all the details, but what if I had a file (or some directory structure, to let it create more temporary files as it goes) with fixed-sized records and a function to serialize/deserialize a record. Would it be possible to build a parallel iterator on top of that, considering the file can seek? I guess the parallel iterators operate in chunks, so whenever it would operate on a one, it could be read into a buffer in memory, or something.

I guess this would make sense in its own sub-crate, though.

This is something slightly different than #46 ‒ that would probably need to linearly read the whole thing in, because ordinary iterators don't seek ‒ but the files can (and it is reasonably fast if every seek is followed by a long-enough read or write).

(I'm even thinking about files scattered over multiple computers, but I have no idea if that could work at least in theory)

@djc
Copy link

djc commented Jun 19, 2020

Here's a file chunker based on line separators which is used with Rayon. Maybe it would make sense to move the generic part of this code (which is most of it) into Rayon?

https://github.com/djc/topfew-rs/blob/master/src/chunks.rs

I have a bunch of other use cases where something like this would be helpful.

@vorner
Copy link
Contributor Author

vorner commented Jun 24, 2020

Hello

I'm not sure about what exact ideas I had at that time (it's been a while). Now there's the bridge to take the non-parallel iterator and make it parallel. But it adds overhead with a mutex and the actual reading happens single-threaded.

Now, let's say I have a TB large file of fixed-sized items (or some other way to know where things are in there) and I wanted to process it as fast as possible. It would probably be faster if each thread takes a buffer, they split the file into parts and each one starts processing on its own, buffering the data in as needed to avoid fighting over seeking in the file.

One can probably do it by „virtually“ generating bunch of lets say 100MB chunks, putting them into a vector and doing par_iter on these ‒ and the chunk would be just a pointer to the file. But that seems like some more work than needed, one would like something like ParFile::open(path, record_size, decode_fn, encode_fn).map(..).fold(..). Or even ParFile::open_rw(path, record_size, decode_fn, encode_fn).par_sort_with(..).

I guess what you link is something that could be used for the manual part if you needed non-fixed-sized records (which is useful), but in a way a different thing than I was thinking about.

@v1gnesh
Copy link

v1gnesh commented Jun 24, 2020

@vorner Yaaaaass! This would be great for working with binary files.
Additionally, in the example snip you've shown -- ParFile::open(path, record_size, decode_fn, encode_fn).map(..).fold(..) -- having the option of record_size being an Vec would also cater to cases where the file isn't full of same-sized records.
Oh this would be so useful!

EDIT: Adding relevant link - https://users.rust-lang.org/t/iterate-over-varying-size-chunks-of-binary-data/40708
EDIT2: Not 100% sure, but I think this may be useful for parallelizing I/O - https://github.com/stjepang/blocking ... until a time when one of RAPIDS's repositories (I think it was a part of cuDF) for parallelizing I/O operations adds Rust support.

@Stock84-dev
Copy link

You could use memory mapped files with memmap2 crate.
You can find example in #885.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants