-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug(buffers): disk buffer v2 has inconsistent buffer size after restarting #10436
Comments
(debugging notes) I've scattered a lot of print statements through the code, here's an example of the ledger state on the second run, from the writer's perspective:
This all looks reasonable: 1 record in the buffer (writer next ID - reader last ID == 1, which is right) with a size of ~4k bytes, around the peak of the record size we'd expect for Then when it starts to actually read, we eventually hit a well-placed
This corresponds with a check that takes the return of One possible next step would be to enable trace-level logging on the second run so we can actually see every read/write ledger change to understand how we diverge from the starting state of the ledger. This wouldn't help if the ledger itself was messed up due to the first run abruptly stopping it, though, so we may need to construct something that can show us the buffer state, including reading the data files in a "strict" mode, to tell us if the buffer as it exists on disk is valid... since it could be that loading the buffer the second time modifies the state in a way that is obscuring the root cause of the issue. That could just be a useful bit of code/a useful utility in general, to be honest. |
I'm pretty sure this is fixed as of #10479, but I need to double check to make sure. Essentially, we eschewed the notion of trying to keep a valid record count/total buffer size in the ledger because we didn't have a great way to atomically update it short of moving to a structure that involved a more transactional/write-ahead-log/journal-based approach, where we could replay deltas to recreate our state. The new approach is a bit more simple: derive record count from the writer/reader ID offsets, and recalculate the total buffer size when we load a buffer from disk, and then keep a running tally at runtime. |
After going back over the changes mentioned above, this issue should indeed be resolved at this point. |
There's an issue where sometimes the buffer size, stored in the ledger, appears to desynchronize with the actual state of the buffer based on the available data files.
Essentially, we run
buffer_perf
such that it's heavily writing to a buffer, and then stop the program by sending itSIGINT
, aka "Ctrl-C". We then immediately rerun the program, which will appear simply to stall right after logging that its loading the configurations.This is due to a wrapping
fetch_sub
, which incorrectly wraps the buffer size back past zero, leaving it at a value that is practically indistinguishable with 2^64. In turn, when the writer attempts to open the current data file and make sure its up-to-date and ready to write, this triggers the logic that buffer is over its maximum allowed size, and the writer must wait for the reader... which it can't, because we haven't created the reader yet, and thus we've stalled the process.This obviously should not happen.
The text was updated successfully, but these errors were encountered: