-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vtctld
/vtorc
: improve reparenting stats
#13723
Merged
GuptaManan100
merged 9 commits into
vitessio:main
from
timvaillancourt:stats-reparent_shard_operation_timings
Sep 28, 2023
Merged
Changes from 7 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
5d92fb6
vtctld/vtorc: improve reparenting stats
timvaillancourt 594c88a
Merge branch 'main' into stats-reparent_shard_operation_timings
timvaillancourt 21eb2f9
Merge remote-tracking branch 'origin/main' into stats-reparent_shard_…
timvaillancourt 51aa3aa
Fix memorytopo unit test issue
timvaillancourt 152d1bc
Merge remote-tracking branch 'upstream/main' into stats-reparent_shar…
GuptaManan100 df77a39
feat: address review comments and update release notes
GuptaManan100 35c99f0
feat: augment e2e tests to also test for this metric
GuptaManan100 1fbdf96
feat: move the failureResult and successResult to utils.go file since…
GuptaManan100 ee99f03
test: fix conversion of value to an integer in the tests
GuptaManan100 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -69,9 +69,14 @@ type EmergencyReparentOptions struct { | |
|
||
// counters for Emergency Reparent Shard | ||
var ( | ||
ersCounter = stats.NewGauge("ers_counter", "Number of times Emergency Reparent Shard has been run") | ||
ersSuccessCounter = stats.NewGauge("ers_success_counter", "Number of times Emergency Reparent Shard has succeeded") | ||
ersFailureCounter = stats.NewGauge("ers_failure_counter", "Number of times Emergency Reparent Shard has failed") | ||
// TODO(timvaillancourt): remove legacyERS* gauges in v19+. | ||
legacyERSCounter = stats.NewGauge("ers_counter", "Number of times Emergency Reparent Shard has been run") | ||
legacyERSSuccessCounter = stats.NewGauge("ers_success_counter", "Number of times Emergency Reparent Shard has succeeded") | ||
legacyERSFailureCounter = stats.NewGauge("ers_failure_counter", "Number of times Emergency Reparent Shard has failed") | ||
|
||
ersCounter = stats.NewCountersWithMultiLabels("emergency_reparent_counts", "Number of times Emergency Reparent Shard has been run", | ||
[]string{"Keyspace", "Shard", "Result"}, | ||
) | ||
) | ||
|
||
// NewEmergencyReparenter returns a new EmergencyReparenter object, ready to | ||
|
@@ -99,26 +104,33 @@ func NewEmergencyReparenter(ts *topo.Server, tmc tmclient.TabletManagerClient, l | |
// keyspace and shard. | ||
func (erp *EmergencyReparenter) ReparentShard(ctx context.Context, keyspace string, shard string, opts EmergencyReparentOptions) (*events.Reparent, error) { | ||
var err error | ||
statsLabels := []string{keyspace, shard} | ||
|
||
opts.lockAction = erp.getLockAction(opts.NewPrimaryAlias) | ||
// First step is to lock the shard for the given operation, if not already locked | ||
if err = topo.CheckShardLocked(ctx, keyspace, shard); err != nil { | ||
var unlock func(*error) | ||
ctx, unlock, err = erp.ts.LockShard(ctx, keyspace, shard, opts.lockAction) | ||
if err != nil { | ||
ersCounter.Add(append(statsLabels, failureResult), 1) | ||
return nil, err | ||
} | ||
defer unlock(&err) | ||
} | ||
|
||
// dispatch success or failure of ERS | ||
startTime := time.Now() | ||
ev := &events.Reparent{} | ||
defer func() { | ||
reparentShardOpTimings.Add("EmergencyReparentShard", time.Since(startTime)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we track the timings of all operations or success only? 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. May as well do all of them. |
||
switch err { | ||
case nil: | ||
ersSuccessCounter.Add(1) | ||
legacyERSSuccessCounter.Add(1) | ||
ersCounter.Add(append(statsLabels, successResult), 1) | ||
event.DispatchUpdate(ev, "finished EmergencyReparentShard") | ||
default: | ||
ersFailureCounter.Add(1) | ||
legacyERSFailureCounter.Add(1) | ||
ersCounter.Add(append(statsLabels, failureResult), 1) | ||
event.DispatchUpdate(ev, "failed EmergencyReparentShard: "+err.Error()) | ||
} | ||
}() | ||
|
@@ -142,7 +154,7 @@ func (erp *EmergencyReparenter) getLockAction(newPrimaryAlias *topodatapb.Tablet | |
func (erp *EmergencyReparenter) reparentShardLocked(ctx context.Context, ev *events.Reparent, keyspace, shard string, opts EmergencyReparentOptions) (err error) { | ||
// log the starting of the operation and increment the counter | ||
erp.logger.Infof("will initiate emergency reparent shard in keyspace - %s, shard - %s", keyspace, shard) | ||
ersCounter.Add(1) | ||
legacyERSCounter.Add(1) | ||
|
||
var ( | ||
stoppedReplicationSnapshot *replicationSnapshot | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently the metric produces floats and not ints, so when we compare
successCount == countExpected
it was always false. It was only theassert.EqualValues(t, countExpected, successCount)
which was passing. This meant that we always waited for 15 seconds whenever we entered this function!I have fixed this too as part of this PR.