-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize filestore verify #5286
Conversation
c81ec24
to
f113ed6
Compare
Also: * Remove sorting from the filestore functions. This feels like something that belongs in the commands. * Deduplicate command logic. * Switch to commands-lib 1.0 License: MIT Signed-off-by: Steven Allen <[email protected]>
f113ed6
to
98d204b
Compare
This is mostly my code, please give me a chance to review this before accepting. Thank you. |
I am not sure I agree with this. What is the reason for this other than "feels like something that belongs in the commands". |
It's how we implement all of our other commands that sort output, as far as I know. That is, it feels like a display concern. |
we might as well work ahead a bit License: MIT Signed-off-by: Steven Allen <[email protected]>
Simple benchmarks indicate that this is about 2x faster on my machine (it only has two real cores so it can probably only do 2 sha256 operations at a time). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the output is the same, this looks good.
The main reason for doing the file-sort is to access files sequentially from disk, see #3922 |
I forgot about that. Blocking until this is addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is important that we keep the functional of reading files sequentially to minimize seeking. Because of the this is feels wrong for the client to handle this. See #3922.
I see. I assumed that the filestore would internally take care of this (caching, etc.) but I guess that would fail for random access on large datasets. It would be nice if we could avoid opening files multiple times but I guess we can't do that :(. |
I think that the most important thing with sequential access here is that it is done "one file at a time", rather then "one IPFS block at a time, all over the place" which is what I observed when running without file-sort
|
Unfortunately, we don't store enough information do to that. We only index blocks by there hash, not the underlying file. |
But we can construct this by sorting. That's what I'm currently working on. However, the index will have to fit in memory, for now. We could also try adding an on-disk reverse-index but that feels expensive. |
And that is exactly what |
The missing pieces are:
My plan is to have one thread do a sequential read while a set of workers hash the file chunks in parallel. However, after thinking about this, maybe we should just build the reverse index. We could also use it a lot of work when calling |
we should measure is this is really a problem. My guess is that the overhead should be negligible, but I could be wrong. |
For the filestore, it may be (given the 256kib chunks) for the urlstore, I'm not so sure. My primary concern is really hashing in parallel. |
So for me it is a slowdown (My guess is that BTRFS is mainly to blame for this difference, I might be able to recreate this on EXT4 instead)
|
Closing due to bitrot (and brainrot, mine specifically). |
This works by spinning up ncores * 2 workers and having them iterate over the
datastore in parallel.
reading from the datastore, not hashing.
Also: