Use IsBootstrappedAndDurable instead of IsBootstrapped in health checks #1287

richardartoul · 2019-01-09T15:05:25Z

The existing health checks use IsBootstrapped() to determine if a node is healthy and bootstrapped which is fine in the normal case, but in the situation where a topology change has occurred this is problematic.

The reason its problematic in that scenario is that after a topology change, as a result of #1183 the node will not mark its shards as available until a snapshot has occurred. This is an issue for our automated tooling which will assume that once a node is "healthy" it can proceed with adding more nodes because we don't want the tooling to proceed until the previous node has marked all its shards as available otherwise we can end up in weird shards state (I.E R.F = 3 but we have 2 shard states AVAILABLE, 1 LEAVING, and 2 Initializing and 2/5 will not achieve quorum so writes / reads will begin to fail)

codecov · 2019-01-09T16:13:11Z

Codecov Report

Merging #1287 into master will decrease coverage by 1%.
The diff coverage is 100%.

@@           Coverage Diff            @@
##           master   #1287     +/-   ##
========================================
- Coverage    70.6%   69.6%   -1.1%     
========================================
  Files         761     756      -5     
  Lines       64108   63908    -200     
========================================
- Hits        45292   44482    -810     
- Misses      15885   16472    +587     
- Partials     2931    2954     +23

Flag	Coverage Δ
#aggregator	`81.6% <ø> (-0.1%)`	⬇️
#cluster	`85.6% <ø> (-0.2%)`	⬇️
#collector	`78.4% <ø> (ø)`	⬆️
#dbnode	`78.4% <100%> (-2.5%)`	⬇️
#m3em	`73.2% <ø> (ø)`	⬆️
#m3ninx	`74.6% <ø> (-0.1%)`	⬇️
#m3nsch	`51.1% <ø> (ø)`	⬆️
#metrics	`18.2% <ø> (ø)`	⬆️
#msg	`74.9% <ø> (ø)`	⬆️
#query	`60.1% <ø> (ø)`	⬆️
#x	`75.2% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d56e81...cdf1e8d. Read the comment docs.

codecov · 2019-01-09T16:13:28Z

Codecov Report

Merging #1287 into master will decrease coverage by <.1%.
The diff coverage is 92.3%.

@@           Coverage Diff            @@
##           master   #1287     +/-   ##
========================================
- Coverage    70.6%   70.6%   -0.1%     
========================================
  Files         761     761             
  Lines       64108   64125     +17     
========================================
- Hits        45292   45278     -14     
- Misses      15885   15903     +18     
- Partials     2931    2944     +13

Flag	Coverage Δ
#aggregator	`81.6% <ø> (-0.1%)`	⬇️
#cluster	`85.6% <ø> (-0.2%)`	⬇️
#collector	`78.4% <ø> (ø)`	⬆️
#dbnode	`80.8% <92.3%> (-0.1%)`	⬇️
#m3em	`73.2% <ø> (ø)`	⬆️
#m3ninx	`74.6% <ø> (ø)`	⬆️
#m3nsch	`51.1% <ø> (ø)`	⬆️
#metrics	`18.2% <ø> (ø)`	⬆️
#msg	`74.9% <ø> (ø)`	⬆️
#query	`60.1% <ø> (ø)`	⬆️
#x	`75.2% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d56e81...65835a2. Read the comment docs.

robskillington · 2019-01-11T15:13:34Z

src/dbnode/storage/cluster/database.go

+		// If the topology is not dynamic, then the only thing we care
+		// about is whether the node is bootstrapped or not because the
+		// concept of durability as it relates to shard state doesn't
+		// make sense if we're using a static topology.


Hm is this true? Should we perhaps be calling d.Database.IsBootstrappedAndDurable() so that at least in the case where the topology is not dynamic it's had at least one snapshot, etc?

spoke offline, existing implementation should be good but I added more context in the comments.

robskillington

LGTM

* master: Change namespace URLs to be scoped by service (#1292) fix a typo (#1291) Use IsBootstrappedAndDurable instead of IsBootstrapped in health checks (#1287) Fix Typo in m3dbnode.service. (#1288) [buildkite] set timeout properly (#1290) Add CompleteTags method for M3 storage and by default enable it (#1258) Refix flaky test. (#1285) Ignore autogenerated grammar peg file from metalinting (#1284) Deletes extraneous print line (#1282) Ignore copyright year changes in gen'd files (#1275) [query] Refactor encoded block conversion to use options (#1276) Update api query docs (#1279) Change query String() method to add any negated queries (#1262)

…ks (#1287)

use IsBootstrappedAndDurable in health checks

7030e68

richardartoul requested review from robskillington and schallert January 9, 2019 15:05

Richard Artoul added 4 commits January 9, 2019 10:21

fix ci

fc23837

Fix integration tests

f58265c

fix comment

34bafe1

format comment

cdf1e8d

Richard Artoul added 6 commits January 9, 2019 11:13

dont add space

0b1845c

Add IsBootstrappedAndDurable to cluster

6121853

fix integration tests

f209369

fix flaky test

ccf4dc8

cleanup servers in tests

6b5a5df

remove unnecessary code

79ca143

richardartoul added PR: Awaiting Review [db] PR: Review Priority High labels Jan 11, 2019

robskillington reviewed Jan 11, 2019

View reviewed changes

robskillington approved these changes Jan 11, 2019

View reviewed changes

improve IsBootstrappedAndDurable comment

65835a2

richardartoul merged commit d12f6fa into master Jan 11, 2019

richardartoul mentioned this pull request Jan 14, 2019

Map of buckets in buffer; setup for out of order writes #1083

Merged

justinjc deleted the ra/bootstrapped_and_durable_health branch January 22, 2019 15:50

robskillington pushed a commit that referenced this pull request Jan 28, 2019

Use IsBootstrappedAndDurable instead of IsBootstrapped in health chec…

4d8f768

…ks (#1287)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use IsBootstrappedAndDurable instead of IsBootstrapped in health checks #1287

Use IsBootstrappedAndDurable instead of IsBootstrapped in health checks #1287

richardartoul commented Jan 9, 2019 •

edited

Loading

codecov bot commented Jan 9, 2019 •

edited

Loading

codecov bot commented Jan 9, 2019 •

edited

Loading

robskillington Jan 11, 2019

richardartoul Jan 11, 2019

robskillington left a comment

Use IsBootstrappedAndDurable instead of IsBootstrapped in health checks #1287

Use IsBootstrappedAndDurable instead of IsBootstrapped in health checks #1287

Conversation

richardartoul commented Jan 9, 2019 • edited Loading

codecov bot commented Jan 9, 2019 • edited Loading

Codecov Report

codecov bot commented Jan 9, 2019 • edited Loading

Codecov Report

robskillington Jan 11, 2019

Choose a reason for hiding this comment

richardartoul Jan 11, 2019

Choose a reason for hiding this comment

robskillington left a comment

Choose a reason for hiding this comment

richardartoul commented Jan 9, 2019 •

edited

Loading

codecov bot commented Jan 9, 2019 •

edited

Loading

codecov bot commented Jan 9, 2019 •

edited

Loading