Mas i1805 hintedrepair #1806

martinsumner · 2021-11-18T16:44:51Z

In Tictac AAE, repairs are slow relative to the previous (and still default) kv_index_hashtree solution. Comparisons to discover broken AAE segments are fast, but then a fetch_clocks query has to be run.

The fetch_clocks query is a full keystore scan - but each slot in the store has a segment index, so whole slots can be skipped when they contain no interesting segments. The skipping improves speed, but there are issues:

the higher tictcaaae_maxresults (the segment count) the less skipping can be done, and the slower the query.
in order to address issues whereby a key is misrepresented in the cached AAE tree, extra work must be done to rebuild the segments in the AAE store as part of the query.
to avoid issues with items in the key-store but not in the Journal (when running native leveled) the query has a journal presence check enabled.

What is implemented here is prompted repairs. There is as before, scheduled exchanges, however these by default only repair 128 segments not 256 (to reduce overheads). These exchanges are otherwise unchanged.

If the exchanges highlight keys that need repairing, they are repaired as before. However, the riak_kv_tictcaaae_repairs:analyse_repairs/2 function examines those repaired keys - looking at the modified dates, buckets and (optionally) the key ranges. The analyse_repairs/2 will result potentially then in a new exchange being prompted, but this time with a filter that will accelerate the time taken to run the resulting fetch_clocks query (with either a bucket and modified time range, or just a modified time range).

These prompted exchanges should be able to find repairs (via fetch_clocks) an order of magnitude faster than standard exchanges, as:

the additional constraints should greatly reduce the amount of the key space to be covered (one would normally expect a large delta to have a small range of last_modified times in particular);
as normal exchanges are handling the outlier cases (no journal presence, broken cached tree), the extra activity (check presence and segment rebuild) is not required on these prompted exchanges.

There will be tictcaaae_repairloops (default 4) run following an exchange, until the loop count is exhausted or a loop finds insufficient keys to warrant continuing (less than 50% of requested max results). These loops will have the tictacaae_maxresults boosted by a tictacaae_rangeboost integer (default 2 i.e. go to 256).

Allow for range-based exchanges to be used when standard exchanges identify a demand for further AAE, and a localised AAE issue.

Should be able to track what is happening via both stats and logs.

If all nodes in the cluster are not yet at this version, then they may not support the necessary aae_folds required by the range-based repairs.

martinsumner · 2021-11-18T16:46:44Z

basho/riak_test#1360

martinsumner · 2021-11-18T16:48:02Z

#1805

As the pause system exists to space out any repairs, the proposal in the issue to have a new read-repair-manager process has been dropped.

martinsumner · 2021-11-18T16:49:28Z

martinsumner/kv_index_tictactree#101

martinsumner · 2021-11-18T16:51:01Z

Test evidence of any performance/efficiency improvements still required before PR is merged

martinsumner · 2021-11-19T13:00:28Z

For initial performance testing, this riak_test can be used for comparisons.

https://github.com/basho/riak_test/blob/171476caa2b2c2ba3a65906ca2c488089d4a0df6/tests/verify_tictac_aae_load.erl

With these settings (switching the MAX_RESULTS back to 256 when testing previous), approximately 60-70% reduction in repair time with this branch

martinsumner · 2021-11-22T10:10:04Z

In a full environment volume test, the new code was able to repair deltas discovered via AAE at 3.5 x the rate of the previous code. At this rate, it was using more CPU, but roughly only 30% - 50% more. so the improvement is CPU efficient.

It could be de-tuned to gain the extra benefit at the equivalent speed (such as by reducing tictacaae_maxresults further, down to 64).

The volume test was done with a ring-size of 512, on an 8-node cluster with about 180M keys. On one node a backup is taken, then more load is added, the node is stopped, and then more load is added - then the node is recovered from backup.

When the node is recovered, there is a hinted handoff of recent writes (since the stop), but then the previous delta (between the backup and the stop) needs to be recovered via AAE. In this case, this delta amounts to about 1.8M keys.

In Riak 3.0.9, recovering the delta completely takes just over 24 hours, although the majority is recovered within 16 hours. The peak rate of recovery is just over 3K objects per minute. During the recovery, about 10% of available CPU is consumed. Chart shows decline in mismatched segments over 24 hours:

With this branch, recovering the delta completely takes about 10 hours, and the majority is recovered within 4 hours. The peak rate of recovery is 11K objects per minute. During the recovery, about 15% of available CPU is consumed.

If the tictacaae_maxresults is reduced further, halved to 64, with hinted repair - the CPU utilisation reverts to that found in Riak 3.0.9, but the recovery time is faster (though not as fats as with the higher tictacaae_maxresults:

With this setting, recovering the delta completely takes 17 hours, and the majority is recovered within 8 hours. The peak rate of recovery is 7K objects per minute. During the recovery less than 10% of available CPU is consumed.

martinsumner · 2021-11-24T10:07:19Z

Note that there appear to be two "stripes" in the recovery. There are two forms of delta created by the test:

genuine deltas as a result of the node being started prior at a state prior to recent updates being received (that are not covered in handoffs)
resurrected keys, as a result of running Riak in a non-keep delete mode, and as a side effect of the journal-compaction/backup/recovery process within leveled. These resurrected keys are fewer in number, but concentrated in a smaller subset of vnodes (and hence take longer to clear).

Following performance testing of new feature, AAE throughput can be improved at this level on current settings, without increasing CPU load. Configuration option no longer hidden, as it may require turning by users who wish for AAE deltas to be closed more rapidly

ThomasArts

not done yet with review... but want to save the state

priv/riak_kv.schema

src/riak_kv_tictacaae_repairs.erl

martinsumner added 3 commits November 17, 2021 12:28

Range-based AAE

dd3dd28

Allow for range-based exchanges to be used when standard exchanges identify a demand for further AAE, and a localised AAE issue.

Add stats and configuration

68085df

Should be able to track what is happening via both stats and logs.

Add capability check

dd713e7

If all nodes in the cluster are not yet at this version, then they may not support the necessary aae_folds required by the range-based repairs.

martinsumner requested a review from ThomasArts November 22, 2021 11:56

tictacaae_maxresults

6633d6a

Following performance testing of new feature, AAE throughput can be improved at this level on current settings, without increasing CPU load. Configuration option no longer hidden, as it may require turning by users who wish for AAE deltas to be closed more rapidly

ThomasArts reviewed Nov 25, 2021

View reviewed changes

priv/riak_kv.schema Show resolved Hide resolved

priv/riak_kv.schema Outdated Show resolved Hide resolved

priv/riak_kv.schema Show resolved Hide resolved

Fix typos

d39c571

ThomasArts reviewed Nov 25, 2021

View reviewed changes

src/riak_kv_tictacaae_repairs.erl Outdated Show resolved Hide resolved

ThomasArts reviewed Nov 25, 2021

View reviewed changes

src/riak_kv_tictacaae_repairs.erl Outdated Show resolved Hide resolved

ThomasArts approved these changes Nov 25, 2021

View reviewed changes

martinsumner added 3 commits November 25, 2021 15:41

Update following review

4fa641f

Clear-up schema comments after review

39512c5

Branch switch

60bdfbc

martinsumner merged commit 58b3fc2 into develop-3.0 Nov 29, 2021

martinsumner deleted the mas-i1805-hintedrepair branch November 29, 2021 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mas i1805 hintedrepair #1806

Mas i1805 hintedrepair #1806

martinsumner commented Nov 18, 2021 •

edited

Loading

martinsumner commented Nov 18, 2021

martinsumner commented Nov 18, 2021

martinsumner commented Nov 18, 2021

martinsumner commented Nov 18, 2021

martinsumner commented Nov 19, 2021

martinsumner commented Nov 22, 2021 •

edited

Loading

martinsumner commented Nov 24, 2021

ThomasArts left a comment

Mas i1805 hintedrepair #1806

Mas i1805 hintedrepair #1806

Conversation

martinsumner commented Nov 18, 2021 • edited Loading

martinsumner commented Nov 18, 2021

martinsumner commented Nov 18, 2021

martinsumner commented Nov 18, 2021

martinsumner commented Nov 18, 2021

martinsumner commented Nov 19, 2021

martinsumner commented Nov 22, 2021 • edited Loading

martinsumner commented Nov 24, 2021

ThomasArts left a comment

Choose a reason for hiding this comment

martinsumner commented Nov 18, 2021 •

edited

Loading

martinsumner commented Nov 22, 2021 •

edited

Loading