-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crazy idea: parallel iterator on top of (buffered) files #529
Comments
Here's a file chunker based on line separators which is used with Rayon. Maybe it would make sense to move the generic part of this code (which is most of it) into Rayon? https://github.com/djc/topfew-rs/blob/master/src/chunks.rs I have a bunch of other use cases where something like this would be helpful. |
Hello I'm not sure about what exact ideas I had at that time (it's been a while). Now there's the bridge to take the non-parallel iterator and make it parallel. But it adds overhead with a mutex and the actual reading happens single-threaded. Now, let's say I have a TB large file of fixed-sized items (or some other way to know where things are in there) and I wanted to process it as fast as possible. It would probably be faster if each thread takes a buffer, they split the file into parts and each one starts processing on its own, buffering the data in as needed to avoid fighting over seeking in the file. One can probably do it by „virtually“ generating bunch of lets say 100MB chunks, putting them into a vector and doing par_iter on these ‒ and the chunk would be just a pointer to the file. But that seems like some more work than needed, one would like something like I guess what you link is something that could be used for the manual part if you needed non-fixed-sized records (which is useful), but in a way a different thing than I was thinking about. |
@vorner Yaaaaass! This would be great for working with binary files. EDIT: Adding relevant link - https://users.rust-lang.org/t/iterate-over-varying-size-chunks-of-binary-data/40708 |
You could use memory mapped files with memmap2 crate. |
Hello
I was wondering something like this: if I have an array (vec/slice/…), rayon is pretty cool abstraction to writing parallel algorithms. But to do that, everything must fit into memory. But for that, the data must fit into memory. What about data that doesn't fit?
I haven't thought through all the details, but what if I had a file (or some directory structure, to let it create more temporary files as it goes) with fixed-sized records and a function to serialize/deserialize a record. Would it be possible to build a parallel iterator on top of that, considering the file can seek? I guess the parallel iterators operate in chunks, so whenever it would operate on a one, it could be read into a buffer in memory, or something.
I guess this would make sense in its own sub-crate, though.
This is something slightly different than #46 ‒ that would probably need to linearly read the whole thing in, because ordinary iterators don't seek ‒ but the files can (and it is reasonably fast if every seek is followed by a long-enough read or write).
(I'm even thinking about files scattered over multiple computers, but I have no idea if that could work at least in theory)
The text was updated successfully, but these errors were encountered: