Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl: support displaying database & table counts for serverless clusters #71573

Closed
matthewtodd opened this issue Oct 14, 2021 · 2 comments
Closed
Assignees
Labels
A-sql-observability Related to observability of the SQL layer C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@matthewtodd
Copy link
Contributor

matthewtodd commented Oct 14, 2021

In CockroachCloud console, designs surface the number of databases and tables in a serverless cluster.

In implementing this (small!) feature, it is important that we

  • do not prevent the autoscaler from spinning down an otherwise idle SQL pod
  • do not consume our customer's RU (request unit) budget

In internal discussions, there was general agreement that SQL pods, when running, could publish these numbers as metrics in Prometheus to achieve these goals. (There was some discussion whether to use counters (create-drop) or gauges. We converged on gauges as being easier to keep true.) Alternatively, maintaining a cache in the cloud control plane could have worked but would have faced the usual invalidation issues.

When to update the metrics? We considered simply polling for changes, publishing new values every 1-5 minutes. (It is acceptable for the values to drift, so long as they are eventually true.) But we rejected polling because, though the implementation would be simple and easy to reason about, it seemed an inelegant use of resources, given that schema changes are relatively rare. So we decided to instead hook into the various schema change events that would create or drop databases or tables.

#72938 takes this approach, truing up database and table count gauges on the relevant schema changes, but it is not without its problems:

  • Despite the initial convergence on Prometheus, it is unclear if these values are really "metrics." It is a stretch to see them as time series values.
  • It is unclear (to this author) whether we can accept the cost of a synchronous query to crdb_internal.databases and .tables during a schema change -- and, frankly, what that cost is.
  • Without further intervention, only the SQL pod that ran the most recent schema change (and any SQL pods started thereafter) will publish true values for these metrics. While we do have a resolution scheme, it is inelegant in the client, requiring a separate query to Prometheus to see which pods to trust. And reaching for pod-to-pod communication is unacceptably dangerous in the midst of these transactions.

Where to go from here?

  • Is server: export metrics for database and table count #72938 salvageable?
    • Would its API work?
      • We would probably want to true up the metrics across pods somehow.
    • Would we adjust things about its implementation?
      • Perhaps a more regular architectural hook, rather than these sprinkled calls?
  • Or do we wait and contribute our use case to the upcoming log-based telemetry work?
  • Or do we retreat back up to the cloud console?
    • Using some cache this time?
    • Exploring queries that don't cost RUs?
    • Making sure we're clear on the cost of keeping a pod running.

Epic: CC-5060

Jira issue: CRDB-10657

@matthewtodd matthewtodd added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-observability Related to observability of the SQL layer T-sql-observability labels Oct 14, 2021
@matthewtodd matthewtodd changed the title server: export metrics for database & table count ccl: support displaying database & table counts for serverless clusters Dec 1, 2021
@ajwerner
Copy link
Contributor

ajwerner commented Dec 2, 2021

In #67679 we're adding a job that watches for all changes to descriptors and materializes a view of them into the host tenant as zone configs. That thing can rather trivially track and checkpoint the counters we need here. Let's just make the reconciler job populate the relevant gauges. That's a cheap fix to an independently hard problem.

@matthewtodd
Copy link
Contributor Author

Dropping as won't fix. Can reopen if we change our minds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-observability Related to observability of the SQL layer C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

4 participants