-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use IsBootstrappedAndDurable instead of IsBootstrapped in health checks #1287
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1287 +/- ##
========================================
- Coverage 70.6% 69.6% -1.1%
========================================
Files 761 756 -5
Lines 64108 63908 -200
========================================
- Hits 45292 44482 -810
- Misses 15885 16472 +587
- Partials 2931 2954 +23
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #1287 +/- ##
========================================
- Coverage 70.6% 70.6% -0.1%
========================================
Files 761 761
Lines 64108 64125 +17
========================================
- Hits 45292 45278 -14
- Misses 15885 15903 +18
- Partials 2931 2944 +13
Continue to review full report at Codecov.
|
// If the topology is not dynamic, then the only thing we care | ||
// about is whether the node is bootstrapped or not because the | ||
// concept of durability as it relates to shard state doesn't | ||
// make sense if we're using a static topology. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm is this true? Should we perhaps be calling d.Database.IsBootstrappedAndDurable()
so that at least in the case where the topology is not dynamic it's had at least one snapshot, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spoke offline, existing implementation should be good but I added more context in the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* master: Change namespace URLs to be scoped by service (#1292) fix a typo (#1291) Use IsBootstrappedAndDurable instead of IsBootstrapped in health checks (#1287) Fix Typo in m3dbnode.service. (#1288) [buildkite] set timeout properly (#1290) Add CompleteTags method for M3 storage and by default enable it (#1258) Refix flaky test. (#1285) Ignore autogenerated grammar peg file from metalinting (#1284) Deletes extraneous print line (#1282) Ignore copyright year changes in gen'd files (#1275) [query] Refactor encoded block conversion to use options (#1276) Update api query docs (#1279) Change query String() method to add any negated queries (#1262)
The existing health checks use IsBootstrapped() to determine if a node is healthy and bootstrapped which is fine in the normal case, but in the situation where a topology change has occurred this is problematic.
The reason its problematic in that scenario is that after a topology change, as a result of #1183 the node will not mark its shards as available until a snapshot has occurred. This is an issue for our automated tooling which will assume that once a node is "healthy" it can proceed with adding more nodes because we don't want the tooling to proceed until the previous node has marked all its shards as available otherwise we can end up in weird shards state (I.E R.F = 3 but we have 2 shard states AVAILABLE, 1 LEAVING, and 2 Initializing and 2/5 will not achieve quorum so writes / reads will begin to fail)