Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle host not available scenario during peer bootstrap more gracefully #1677

Merged
merged 10 commits into from
Jun 6, 2019

Conversation

richardartoul
Copy link
Contributor

What this PR does / why we need it:
Fixes #1667

@codecov
Copy link

codecov bot commented May 30, 2019

Codecov Report

Merging #1677 into master will decrease coverage by <.1%.
The diff coverage is 92.8%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #1677     +/-   ##
========================================
- Coverage    71.8%   71.8%   -0.1%     
========================================
  Files         968     968             
  Lines       81169   81181     +12     
========================================
+ Hits        58345   58351      +6     
- Misses      18990   18992      +2     
- Partials     3834    3838      +4
Flag Coverage Δ
#aggregator 82.4% <ø> (ø) ⬆️
#cluster 85.7% <ø> (ø) ⬆️
#collector 63.9% <ø> (ø) ⬆️
#dbnode 79.9% <92.8%> (-0.1%) ⬇️
#m3em 73.2% <ø> (ø) ⬆️
#m3ninx 74% <ø> (ø) ⬆️
#m3nsch 51.1% <ø> (ø) ⬆️
#metrics 17.6% <ø> (ø) ⬆️
#msg 74.7% <ø> (ø) ⬆️
#query 66.3% <ø> (ø) ⬆️
#x 85.8% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c982091...c3f38c9. Read the comment docs.

@richardartoul richardartoul changed the title [WIP] - Handle host not available scenario during peer bootstrap more gracefull [WIP] - Handle host not available scenario during peer bootstrap more gracefully May 30, 2019
@m3db m3db deleted a comment from codecov bot Jun 2, 2019
@m3db m3db deleted a comment from codecov bot Jun 2, 2019
@robskillington
Copy link
Collaborator

robskillington commented Jun 2, 2019

This will be a good improvement, unfortunate this scenario wasn't considered when implementing bootstrap consistency (flaw with my initial implementation).

@@ -2833,13 +2846,6 @@ func (s *session) streamBlocksBatchFromPeer(
result, attemptErr = client.FetchBlocksRaw(tctx, req)
})
err := xerrors.FirstError(borrowErr, attemptErr)
// Do not retry if cannot borrow the connection or
// if the connection pool has no connections
switch err {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted this since this will be non-retryable by default

@richardartoul richardartoul changed the title [WIP] - Handle host not available scenario during peer bootstrap more gracefully Handle host not available scenario during peer bootstrap more gracefully Jun 3, 2019
@richardartoul richardartoul force-pushed the ra/fix-slow-peer-bootstrap branch from c3f38c9 to 8533d75 Compare June 3, 2019 21:55
@richardartoul
Copy link
Contributor Author

@robskillington Yeah no worries, its a pretty difficult thing to consider. We only caught it because of the way Odin does node adds was triggering it in a reproducible way. Also I addressed all your feedback and added a regression test if you have a moment to review again

// raw (retryable) hostNotAvailableError since the error is technically
// retryable but was wrapped to prevent the exponential backoff in the
// actual retrier.
if isHostNotAvailableError(err) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better to get back a bool and the inner error from this function call? can avoid the extra fn call that way and i prefer the encapsulation of the innerNonRetryableError stuff to be hidden behind the hostNotAvailableError utility method

gaugeReportInterval = 500 * time.Millisecond
blockMetadataChBufSize = 4096
shardResultCapacity = 4096
streamBlocksMetadataFromPeerErrorSleepInterval = 1 * time.Millisecond
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why so low, why not 10/100 ms?

@@ -2151,6 +2155,12 @@ func (s *session) streamBlocksMetadataFromPeers(
atomic.AddInt32(&success, 1)
return
}

// Prevent the loop from spinning too aggressively if
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about sleeping for a longer duration only if the specialised error condition is triggered

@richardartoul richardartoul force-pushed the ra/fix-slow-peer-bootstrap branch from b31157f to 7525d5a Compare June 6, 2019 01:37
Copy link
Collaborator

@robskillington robskillington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@richardartoul richardartoul merged commit e249c30 into master Jun 6, 2019
@richardartoul richardartoul deleted the ra/fix-slow-peer-bootstrap branch June 6, 2019 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replacing a dead node can be very slow
3 participants