Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: add PRS stats and reparent timings to vtctld #13725

Closed
timvaillancourt opened this issue Aug 5, 2023 · 1 comment · Fixed by #13723
Closed

Feature Request: add PRS stats and reparent timings to vtctld #13725

timvaillancourt opened this issue Aug 5, 2023 · 1 comment · Fixed by #13723

Comments

@timvaillancourt
Copy link
Contributor

timvaillancourt commented Aug 5, 2023

Feature Description

This FR is to add more visibility into reparenting operations in vtctld. Currently there are 3 x counters to explain Emergency Reparent Shard, but not Planned Parent Reshard:

  1. ers_counter
  2. ers_failure_counter
  3. ers_success_counter

Also, these metrics don't explain the performance of the reshard operation, which is important in some investigations we've ran into in Production

I would like to add:

  1. Counters for PRS operations (same ones as ERS)
    • prs_counter
    • prs_failure_counter
    • prs_success_counter
  2. Add the Keyspace label to all ERS and PRS counters
    • Shard would be nice too but the metric cardinality would be an issue
  3. Timings for both ERS and PRS operations

Finally, an opinion/question: I don't think names prefixed with ers_ and prs_ are very friendly and something more verbose, such as emergency_reparent_shard_ and planned_reparent_shard_ would be easier to search in graphing tools, etc. This may be a good time to rename if this is a concern to others, but it's totally optional

Use Case(s)

  • Observing the frequency of ERS + PRS (by Keyspace)
  • Observing the timing/performance of ERS + PRS
    • This is useful during investigations where the timeline and performance of these operations needs investigation
@timvaillancourt timvaillancourt added Needs Triage This issue needs to be correctly labelled and triaged Type: Feature labels Aug 5, 2023
@rohit-nayak-ps rohit-nayak-ps added Component: Cluster management and removed Needs Triage This issue needs to be correctly labelled and triaged labels Aug 7, 2023
@deepthi
Copy link
Member

deepthi commented Aug 10, 2023

This is a good idea. We should add Shard as a label as well, or combine keyspace/shard into one label (cluster?) regardless of cardinality because that is the level at which the operation happens and having just keyspace may not be sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants