Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use IsBootstrappedAndDurable instead of IsBootstrapped in health checks #1287

Merged
merged 12 commits into from
Jan 11, 2019

Conversation

richardartoul
Copy link
Contributor

@richardartoul richardartoul commented Jan 9, 2019

The existing health checks use IsBootstrapped() to determine if a node is healthy and bootstrapped which is fine in the normal case, but in the situation where a topology change has occurred this is problematic.

The reason its problematic in that scenario is that after a topology change, as a result of #1183 the node will not mark its shards as available until a snapshot has occurred. This is an issue for our automated tooling which will assume that once a node is "healthy" it can proceed with adding more nodes because we don't want the tooling to proceed until the previous node has marked all its shards as available otherwise we can end up in weird shards state (I.E R.F = 3 but we have 2 shard states AVAILABLE, 1 LEAVING, and 2 Initializing and 2/5 will not achieve quorum so writes / reads will begin to fail)

@codecov
Copy link

codecov bot commented Jan 9, 2019

Codecov Report

Merging #1287 into master will decrease coverage by 1%.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #1287     +/-   ##
========================================
- Coverage    70.6%   69.6%   -1.1%     
========================================
  Files         761     756      -5     
  Lines       64108   63908    -200     
========================================
- Hits        45292   44482    -810     
- Misses      15885   16472    +587     
- Partials     2931    2954     +23
Flag Coverage Δ
#aggregator 81.6% <ø> (-0.1%) ⬇️
#cluster 85.6% <ø> (-0.2%) ⬇️
#collector 78.4% <ø> (ø) ⬆️
#dbnode 78.4% <100%> (-2.5%) ⬇️
#m3em 73.2% <ø> (ø) ⬆️
#m3ninx 74.6% <ø> (-0.1%) ⬇️
#m3nsch 51.1% <ø> (ø) ⬆️
#metrics 18.2% <ø> (ø) ⬆️
#msg 74.9% <ø> (ø) ⬆️
#query 60.1% <ø> (ø) ⬆️
#x 75.2% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d56e81...cdf1e8d. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 9, 2019

Codecov Report

Merging #1287 into master will decrease coverage by <.1%.
The diff coverage is 92.3%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #1287     +/-   ##
========================================
- Coverage    70.6%   70.6%   -0.1%     
========================================
  Files         761     761             
  Lines       64108   64125     +17     
========================================
- Hits        45292   45278     -14     
- Misses      15885   15903     +18     
- Partials     2931    2944     +13
Flag Coverage Δ
#aggregator 81.6% <ø> (-0.1%) ⬇️
#cluster 85.6% <ø> (-0.2%) ⬇️
#collector 78.4% <ø> (ø) ⬆️
#dbnode 80.8% <92.3%> (-0.1%) ⬇️
#m3em 73.2% <ø> (ø) ⬆️
#m3ninx 74.6% <ø> (ø) ⬆️
#m3nsch 51.1% <ø> (ø) ⬆️
#metrics 18.2% <ø> (ø) ⬆️
#msg 74.9% <ø> (ø) ⬆️
#query 60.1% <ø> (ø) ⬆️
#x 75.2% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d56e81...65835a2. Read the comment docs.

// If the topology is not dynamic, then the only thing we care
// about is whether the node is bootstrapped or not because the
// concept of durability as it relates to shard state doesn't
// make sense if we're using a static topology.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm is this true? Should we perhaps be calling d.Database.IsBootstrappedAndDurable() so that at least in the case where the topology is not dynamic it's had at least one snapshot, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spoke offline, existing implementation should be good but I added more context in the comments.

Copy link
Collaborator

@robskillington robskillington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@richardartoul richardartoul merged commit d12f6fa into master Jan 11, 2019
cw9 added a commit that referenced this pull request Jan 11, 2019
* master:
  Change namespace URLs to be scoped by service (#1292)
  fix a typo (#1291)
  Use IsBootstrappedAndDurable instead of IsBootstrapped in health checks (#1287)
  Fix Typo in m3dbnode.service. (#1288)
  [buildkite] set timeout properly (#1290)
  Add CompleteTags method for M3 storage and by default enable it (#1258)
  Refix flaky test. (#1285)
  Ignore autogenerated grammar peg file from metalinting (#1284)
  Deletes extraneous print line (#1282)
  Ignore copyright year changes in gen'd files (#1275)
  [query] Refactor encoded block conversion to use options (#1276)
  Update api query docs (#1279)
  Change query String() method to add any negated queries (#1262)
@justinjc justinjc deleted the ra/bootstrapped_and_durable_health branch January 22, 2019 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants