bug(buffers): disk buffer v2 has inconsistent buffer size after restarting #10436

tobz · 2021-12-14T03:56:17Z

There's an issue where sometimes the buffer size, stored in the ledger, appears to desynchronize with the actual state of the buffer based on the available data files.

Essentially, we run buffer_perf such that it's heavily writing to a buffer, and then stop the program by sending it SIGINT, aka "Ctrl-C". We then immediately rerun the program, which will appear simply to stall right after logging that its loading the configurations.

This is due to a wrapping fetch_sub, which incorrectly wraps the buffer size back past zero, leaving it at a value that is practically indistinguishable with 2^64. In turn, when the writer attempts to open the current data file and make sure its up-to-date and ready to write, this triggers the logic that buffer is over its maximum allowed size, and the writer must wait for the reader... which it can't, because we haven't created the reader yet, and thus we've stalled the process.

This obviously should not happen.

The text was updated successfully, but these errors were encountered:

tobz · 2021-12-14T04:05:53Z

(debugging notes)

I've scattered a lot of print statements through the code, here's an example of the ledger state on the second run, from the writer's perspective:

ArchivedLedgerState {
    total_records: 1,
    total_buffer_size: 3936,
    writer_next_record_id: 8889687,
    writer_current_data_file_id: 156,
    reader_current_data_file_id: 156,
    reader_last_record_id: 8889685
}

This all looks reasonable: 1 record in the buffer (writer next ID - reader last ID == 1, which is right) with a size of ~4k bytes, around the peak of the record size we'd expect for buffer_perf so doesn't seem too suspect. Same file IDs for reader/writer, again, normal for a buffer where the reader has kept up with the writer.

Then when it starts to actually read, we eventually hit a well-placed panic!:

thread 'tokio-runtime-worker' panicked at 'unexpected buffer size regression putting it over 4GB; last=1288, total_record_size=3216', lib/vector-core/buffers/src/disk_v2/ledger.rs

This corresponds with a check that takes the return of fetch_sub and panics if it was smaller than what we just subtracted from it. Obviously, yeah, we shouldn't be able to subtract 3216 from 1288, so something is up.

One possible next step would be to enable trace-level logging on the second run so we can actually see every read/write ledger change to understand how we diverge from the starting state of the ledger. This wouldn't help if the ledger itself was messed up due to the first run abruptly stopping it, though, so we may need to construct something that can show us the buffer state, including reading the data files in a "strict" mode, to tell us if the buffer as it exists on disk is valid... since it could be that loading the buffer the second time modifies the state in a way that is obscuring the root cause of the issue.

That could just be a useful bit of code/a useful utility in general, to be honest.

tobz · 2022-01-04T18:21:08Z

I'm pretty sure this is fixed as of #10479, but I need to double check to make sure.

Essentially, we eschewed the notion of trying to keep a valid record count/total buffer size in the ledger because we didn't have a great way to atomically update it short of moving to a structure that involved a more transactional/write-ahead-log/journal-based approach, where we could replay deltas to recreate our state.

The new approach is a bit more simple: derive record count from the writer/reader ID offsets, and recalculate the total buffer size when we load a buffer from disk, and then keep a running tally at runtime.

tobz · 2022-01-13T02:40:59Z

After going back over the changes mentioned above, this issue should indeed be resolved at this point.

tobz added type: bug A code related bug. domain: buffers Anything related to Vector's memory/disk buffers labels Dec 14, 2021

tobz mentioned this issue Jan 10, 2022

Buffer improvements #9476

Closed

18 tasks

tobz closed this as completed Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(buffers): disk buffer v2 has inconsistent buffer size after restarting #10436

bug(buffers): disk buffer v2 has inconsistent buffer size after restarting #10436

tobz commented Dec 14, 2021

tobz commented Dec 14, 2021

tobz commented Jan 4, 2022

tobz commented Jan 13, 2022

bug(buffers): disk buffer v2 has inconsistent buffer size after restarting #10436

bug(buffers): disk buffer v2 has inconsistent buffer size after restarting #10436

Comments

tobz commented Dec 14, 2021

tobz commented Dec 14, 2021

tobz commented Jan 4, 2022

tobz commented Jan 13, 2022