Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mas i1805 hintedrepair #1806

Merged
merged 8 commits into from
Nov 29, 2021
Merged

Mas i1805 hintedrepair #1806

merged 8 commits into from
Nov 29, 2021

Conversation

martinsumner
Copy link
Contributor

@martinsumner martinsumner commented Nov 18, 2021

In Tictac AAE, repairs are slow relative to the previous (and still default) kv_index_hashtree solution. Comparisons to discover broken AAE segments are fast, but then a fetch_clocks query has to be run.

The fetch_clocks query is a full keystore scan - but each slot in the store has a segment index, so whole slots can be skipped when they contain no interesting segments. The skipping improves speed, but there are issues:

  • the higher tictcaaae_maxresults (the segment count) the less skipping can be done, and the slower the query.
  • in order to address issues whereby a key is misrepresented in the cached AAE tree, extra work must be done to rebuild the segments in the AAE store as part of the query.
  • to avoid issues with items in the key-store but not in the Journal (when running native leveled) the query has a journal presence check enabled.

What is implemented here is prompted repairs. There is as before, scheduled exchanges, however these by default only repair 128 segments not 256 (to reduce overheads). These exchanges are otherwise unchanged.

If the exchanges highlight keys that need repairing, they are repaired as before. However, the riak_kv_tictcaaae_repairs:analyse_repairs/2 function examines those repaired keys - looking at the modified dates, buckets and (optionally) the key ranges. The analyse_repairs/2 will result potentially then in a new exchange being prompted, but this time with a filter that will accelerate the time taken to run the resulting fetch_clocks query (with either a bucket and modified time range, or just a modified time range).

These prompted exchanges should be able to find repairs (via fetch_clocks) an order of magnitude faster than standard exchanges, as:

  • the additional constraints should greatly reduce the amount of the key space to be covered (one would normally expect a large delta to have a small range of last_modified times in particular);
  • as normal exchanges are handling the outlier cases (no journal presence, broken cached tree), the extra activity (check presence and segment rebuild) is not required on these prompted exchanges.

There will be tictcaaae_repairloops (default 4) run following an exchange, until the loop count is exhausted or a loop finds insufficient keys to warrant continuing (less than 50% of requested max results). These loops will have the tictacaae_maxresults boosted by a tictacaae_rangeboost integer (default 2 i.e. go to 256).

Allow for range-based exchanges to be used when standard exchanges identify a demand for further AAE, and a localised AAE issue.
Should be able to track what is happening via both stats and logs.
If all nodes in the cluster are not yet at this version, then they may not support the necessary aae_folds required by the range-based repairs.
@martinsumner
Copy link
Contributor Author

basho/riak_test#1360

@martinsumner
Copy link
Contributor Author

#1805

As the pause system exists to space out any repairs, the proposal in the issue to have a new read-repair-manager process has been dropped.

@martinsumner
Copy link
Contributor Author

@martinsumner
Copy link
Contributor Author

Test evidence of any performance/efficiency improvements still required before PR is merged

@martinsumner
Copy link
Contributor Author

For initial performance testing, this riak_test can be used for comparisons.

https://github.com/basho/riak_test/blob/171476caa2b2c2ba3a65906ca2c488089d4a0df6/tests/verify_tictac_aae_load.erl

With these settings (switching the MAX_RESULTS back to 256 when testing previous), approximately 60-70% reduction in repair time with this branch

@martinsumner
Copy link
Contributor Author

martinsumner commented Nov 22, 2021

In a full environment volume test, the new code was able to repair deltas discovered via AAE at 3.5 x the rate of the previous code. At this rate, it was using more CPU, but roughly only 30% - 50% more. so the improvement is CPU efficient.

It could be de-tuned to gain the extra benefit at the equivalent speed (such as by reducing tictacaae_maxresults further, down to 64).

The volume test was done with a ring-size of 512, on an 8-node cluster with about 180M keys. On one node a backup is taken, then more load is added, the node is stopped, and then more load is added - then the node is recovered from backup.

When the node is recovered, there is a hinted handoff of recent writes (since the stop), but then the previous delta (between the backup and the stop) needs to be recovered via AAE. In this case, this delta amounts to about 1.8M keys.

In Riak 3.0.9, recovering the delta completely takes just over 24 hours, although the majority is recovered within 16 hours. The peak rate of recovery is just over 3K objects per minute. During the recovery, about 10% of available CPU is consumed. Chart shows decline in mismatched segments over 24 hours:

MismatchedSegments_Riak309

With this branch, recovering the delta completely takes about 10 hours, and the majority is recovered within 4 hours. The peak rate of recovery is 11K objects per minute. During the recovery, about 15% of available CPU is consumed.

MismatchedSegments_HintedRepair

If the tictacaae_maxresults is reduced further, halved to 64, with hinted repair - the CPU utilisation reverts to that found in Riak 3.0.9, but the recovery time is faster (though not as fats as with the higher tictacaae_maxresults:

MismatchedSegments_HintedRepair_MR64

With this setting, recovering the delta completely takes 17 hours, and the majority is recovered within 8 hours. The peak rate of recovery is 7K objects per minute. During the recovery less than 10% of available CPU is consumed.

@martinsumner
Copy link
Contributor Author

Note that there appear to be two "stripes" in the recovery. There are two forms of delta created by the test:

  • genuine deltas as a result of the node being started prior at a state prior to recent updates being received (that are not covered in handoffs)
  • resurrected keys, as a result of running Riak in a non-keep delete mode, and as a side effect of the journal-compaction/backup/recovery process within leveled. These resurrected keys are fewer in number, but concentrated in a smaller subset of vnodes (and hence take longer to clear).

Following performance testing of new feature, AAE throughput can be improved at this level on current settings, without increasing CPU load.

Configuration option no longer hidden, as it may require turning by users who wish for AAE deltas to be closed more rapidly
Copy link
Contributor

@ThomasArts ThomasArts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not done yet with review... but want to save the state

priv/riak_kv.schema Show resolved Hide resolved
priv/riak_kv.schema Outdated Show resolved Hide resolved
priv/riak_kv.schema Show resolved Hide resolved
@martinsumner martinsumner merged commit 58b3fc2 into develop-3.0 Nov 29, 2021
@martinsumner martinsumner deleted the mas-i1805-hintedrepair branch November 29, 2021 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants