Force read repair via aae_fold #1793

martinsumner · 2021-09-02T11:16:11Z

The TictacAAE solution adds the aae_fold capability. however, when compared to legacy AAE it is slower to resolve deltas between vnodes - by default being limited to fixing 256 deltas per exchange.

Generally if a node goes down, most repair happens via hinted handoff, and most immediate issues are resolved through natural read repair. The max results can also be uplifted at run-time (but uplifting beyond 2048 tends to lead to skipped exchanges due to the time taken).

A bigger problem is when a node is recovered from backup, and then returned to the cluster - i.e. where there may be a big gap between the time of the backup, and the time the node went down (e.g. when fallbacks gathered PUTs to return via hinted handoff). This may take a long time to resolve via AAE, and there may be no natural application action to resolve through read repair.

Proposal is for a new aae_fold - repair_keys_range (similar in form to repl_keys_range), this can then be used by an administrator if there is a known gap to accelerate recovery. e.g. If we know the time of the backup, and the time of the node failure - a LMD range can be used between those times in the aae_fold to prompt immediate read repair for everything within the range. As well as the usual filter, it would be beneficial to be able to specify a node (the recovering node), so that only object where the node is in the preflist have read repair prompted.

Currently there is a test where it takes 24 hours for AAE to recover a gap between a backup timestamp and a node failure time (with > 10M PUTs between these times in an 8-node cluster). The aim here is to allow for this recovery to be accelerated by an administrator to < 1 hour.

The text was updated successfully, but these errors were encountered:

martinsumner · 2021-10-06T10:57:32Z

#1795

martinsumner closed this as completed Oct 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force read repair via aae_fold #1793

Force read repair via aae_fold #1793

martinsumner commented Sep 2, 2021 •

edited

Loading

martinsumner commented Oct 6, 2021

Force read repair via aae_fold #1793

Force read repair via aae_fold #1793

Comments

martinsumner commented Sep 2, 2021 • edited Loading

martinsumner commented Oct 6, 2021

martinsumner commented Sep 2, 2021 •

edited

Loading