-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent Race Conditions Between Tablet Deletes and Updates #9237
Merged
deepthi
merged 5 commits into
vitessio:main
from
planetscale:TabletHealthcheckCorrectness
Nov 21, 2021
Merged
Prevent Race Conditions Between Tablet Deletes and Updates #9237
deepthi
merged 5 commits into
vitessio:main
from
planetscale:TabletHealthcheckCorrectness
Nov 21, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mattlord
force-pushed
the
TabletHealthcheckCorrectness
branch
4 times, most recently
from
November 15, 2021 23:18
04da7cc
to
554aa72
Compare
There was a race condition between deleting a tablet's healthcheck record from the authoritative map (hc.healthByAlias) and updating the same tablet's health data (hc.healthData) record. This could cause us to effectively re-add a an updated copy of the tablet healthcheck record after it's been deleted. This then leads to "zombie" tablet records in the SHOW VITESS_TABLETS output as it is based on what is in the hc.healthData map: https://github.com/vitessio/vitess/blob/693c5dbdeacdd7a705b46ebce6776a5256c8cfef/go/vt/discovery/healthcheck.go#L537-L557 And purge all potential healthcheck records for a tablet alias by type on delete. Signed-off-by: Matt Lord <[email protected]>
mattlord
force-pushed
the
TabletHealthcheckCorrectness
branch
from
November 15, 2021 23:49
554aa72
to
17737ab
Compare
mattlord
changed the title
Purge all potential healthcheck records for a tablet alias on delete
Prevent Race Conditions Between Tablet Healthcheck Deletes and Updates
Nov 15, 2021
mattlord
changed the title
Prevent Race Conditions Between Tablet Healthcheck Deletes and Updates
Prevent Race Condition Between Tablet Deletes and Updates
Nov 15, 2021
deepthi
reviewed
Nov 16, 2021
mattlord
force-pushed
the
TabletHealthcheckCorrectness
branch
2 times, most recently
from
November 16, 2021 00:38
f5ce43a
to
2ff2eeb
Compare
Signed-off-by: Matt Lord <[email protected]>
mattlord
force-pushed
the
TabletHealthcheckCorrectness
branch
from
November 16, 2021 00:42
2ff2eeb
to
175f400
Compare
Signed-off-by: Matt Lord <[email protected]>
mattlord
force-pushed
the
TabletHealthcheckCorrectness
branch
from
November 16, 2021 02:44
135863d
to
a9e1b47
Compare
mattlord
requested review from
frouioui,
harshit-gangal and
systay
as code owners
November 16, 2021 03:05
Signed-off-by: Matt Lord <[email protected]>
mattlord
force-pushed
the
TabletHealthcheckCorrectness
branch
from
November 19, 2021 19:57
411f9ee
to
5106024
Compare
mattlord
force-pushed
the
TabletHealthcheckCorrectness
branch
from
November 19, 2021 20:05
6f3bc97
to
a6f3a52
Compare
mattlord
changed the title
Prevent Race Condition Between Tablet Deletes and Updates
Prevent Race Conditions Between Tablet Deletes and Updates
Nov 19, 2021
deepthi
reviewed
Nov 19, 2021
deepthi
approved these changes
Nov 19, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a tiny typo in comments, otherwise LGTM.
Nice work!
This ensures we are not assuming any default value and the tests will run in less time. Signed-off-by: Matt Lord <[email protected]>
mattlord
force-pushed
the
TabletHealthcheckCorrectness
branch
from
November 19, 2021 20:24
a6f3a52
to
cfbc6e8
Compare
sougou
approved these changes
Nov 19, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
There was a race condition between deleting a tablet's healthcheck record from the authoritative map (
hc.healthByAlias
) and updating the same tablet's health data record (hc.healthData
).This could cause us to effectively re-add an updated copy of the tablet healthcheck record after it's been deleted due to the open stream with the tablet encounting an error -- often a context being cancelled or a gRPC goaway error seen due to the tablet shutting down. This then leads to "zombie" tablet records in the
SHOW VITESS_TABLETS
output as it is based on what is in thehc.healthData
map and it had records for tablets that were no longer inhc.healthByAlias
(so the authoritative map was no longer authoritative):vitess/go/vt/vtgate/executor.go
Line 1108 in 693c5db
...
vitess/go/vt/discovery/healthcheck.go
Lines 537 to 557 in 693c5db
Changes
hc.deleteTablet()
-- if there's mismatching endpoints for the tablet indicating that the healthcheck record is no longer valid (this was the case before Remove tablet healthcheck cache record on error #9106)hc.deleteTablet()
andhc.updateHealth()
by confirming that the tablet record still exists in the authoritativehc.healthByAlias
map before updating the health of the tablet in the siblinghc.healthData
map because the mutex does not enforce order of operations. If it does NOT exist there then the update is a no-op (the old tablet health data record copy made here or here will go out of scope and be GC'd).vtgate/tablet_healthcheck_cache/correctness
end2end test was updated to verify the correct behavior (unfortunately the test originally added in Remove tablet healthcheck cache record on error #9106 was simply verifying bad behavior)hc.healthData
map -- if it doesn't exist -- when updating the health record for a tablet that we have confirmed exists in the authoritativehc.healthByAlias
mapWith these changes, I could no longer repeat the zombie tablet record using the steps shown here, and we see the expected log messages indicating that we've avoided the race condition:
Related Issue(s)
Fixes: #9238
Properly Fixes: #8465
Checklist