Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DBNode] - Make repairs actually repair data #1849

Merged
merged 15 commits into from
Aug 6, 2019
Merged

Conversation

richardartoul
Copy link
Contributor

@richardartoul richardartoul commented Aug 2, 2019

What this PR does / why we need it:
This P.R changes the background repairs feature from simply emitting metrics/logs about data mismatches between node to actually repairing the mismatches.

It also re-writes the repair scheduling logic to a "repairing all the time" model that is more congruent with the fact that M3DB now supports out of order writes.

Includes a combination of unit tests, several integration tests, and a docker integration test.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:
No.

Does this PR require updating code package or user-facing documentation?:
No.

@codecov
Copy link

codecov bot commented Aug 5, 2019

Codecov Report

Merging #1849 into master will decrease coverage by 7.9%.
The diff coverage is 67%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #1849      +/-   ##
=========================================
- Coverage    72.2%   64.2%      -8%     
=========================================
  Files         987     725     -262     
  Lines       83823   66579   -17244     
=========================================
- Hits        60559   42779   -17780     
- Misses      19236   20582    +1346     
+ Partials     4028    3218     -810
Flag Coverage Δ
#aggregator 65.6% <ø> (-16.8%) ⬇️
#cluster 83% <ø> (-2.6%) ⬇️
#collector 47.9% <ø> (-15.8%) ⬇️
#dbnode 70% <67%> (-9.8%) ⬇️
#m3em 68.4% <ø> (-4.9%) ⬇️
#m3ninx 73.1% <ø> (-1.1%) ⬇️
#m3nsch 51.1% <ø> (ø) ⬆️
#metrics 17.5% <ø> (ø) ⬆️
#msg 74.7% <ø> (ø) ⬆️
#query 67.5% <ø> (-0.3%) ⬇️
#x 73% <ø> (-12.6%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d860d9c...f69e768. Read the comment docs.

@richardartoul richardartoul changed the title [DBNode][WIP] - Make repairs actually repair data [DBNode] - Make repairs actually repair data Aug 5, 2019
scripts/docker-integration-tests/common.sh Show resolved Hide resolved
scripts/docker-integration-tests/repair/test.sh Outdated Show resolved Hide resolved
scripts/docker-integration-tests/repair/test.sh Outdated Show resolved Hide resolved
src/dbnode/integration/repair_test.go Outdated Show resolved Hide resolved
src/dbnode/network/server/tchannelthrift/node/service.go Outdated Show resolved Hide resolved
src/dbnode/storage/repair.go Outdated Show resolved Hide resolved
src/dbnode/storage/repair.go Outdated Show resolved Hide resolved
src/dbnode/storage/repair.go Outdated Show resolved Hide resolved
src/dbnode/storage/repair.go Show resolved Hide resolved
src/dbnode/storage/repair.go Outdated Show resolved Hide resolved
src/cmd/services/m3dbnode/main/main_index_test.go Outdated Show resolved Hide resolved
src/cmd/services/m3dbnode/main/main_test.go Outdated Show resolved Hide resolved
src/dbnode/integration/repair_test.go Outdated Show resolved Hide resolved
blockStates BootstrappedBlockStateSnapshot,
) {
for _, block := range bootstrappedBlocks.AllBlocks() {
for _, block := range blocksToLoad.AllBlocks() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned offline, this is fine now, but in the future we should measure the performance of cold flushes when there is no cold data. The ColdFlushEnabled flag should probably just control whether cold writes are accepted or not - other functions like bootstrapping and flushes should be free to use the cold flush mechanism regardless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, maybe lets add it as a cold writes follow up issue or track it as part of plan to combine warm flush and cold flush into one process?

@@ -310,9 +310,26 @@ func (enc *encoder) LastEncoded() (ts.Datapoint, error) {
return result, nil
}

// Len returns the length of the data stream.
// Len returns the length of the final data stream that would be generated
// by a call to Stream().
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is slightly redefining what Len() is returning, right? Are all other uses of this function okay with this new definition? Can you also change the comment for this function in its interface (in dbnode/encoding/types.go) if appropriate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this isn't used for much and all I've done is make it more accurate. Updated the comment

Copy link
Collaborator

@justinjc justinjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@richardartoul richardartoul merged commit 7f9a6be into master Aug 6, 2019
@richardartoul richardartoul deleted the ra/actual-repairs branch August 6, 2019 19:34

echo "Wait for the data to become available (via repairs) from dbnode03"
ATTEMPTS=10 MAX_TIMEOUT=4 TIMEOUT=1 retry_with_backoff \
read_all "coldWritesRepairAndNoIndex" "foo" 1 9022
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be port 9032 for the third DB node?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants