ccl: support displaying database & table counts for serverless clusters #71573

matthewtodd · 2021-10-14T16:17:38Z

In CockroachCloud console, designs surface the number of databases and tables in a serverless cluster.

In implementing this (small!) feature, it is important that we

do not prevent the autoscaler from spinning down an otherwise idle SQL pod
do not consume our customer's RU (request unit) budget

In internal discussions, there was general agreement that SQL pods, when running, could publish these numbers as metrics in Prometheus to achieve these goals. (There was some discussion whether to use counters (create-drop) or gauges. We converged on gauges as being easier to keep true.) Alternatively, maintaining a cache in the cloud control plane could have worked but would have faced the usual invalidation issues.

When to update the metrics? We considered simply polling for changes, publishing new values every 1-5 minutes. (It is acceptable for the values to drift, so long as they are eventually true.) But we rejected polling because, though the implementation would be simple and easy to reason about, it seemed an inelegant use of resources, given that schema changes are relatively rare. So we decided to instead hook into the various schema change events that would create or drop databases or tables.

#72938 takes this approach, truing up database and table count gauges on the relevant schema changes, but it is not without its problems:

Despite the initial convergence on Prometheus, it is unclear if these values are really "metrics." It is a stretch to see them as time series values.
It is unclear (to this author) whether we can accept the cost of a synchronous query to crdb_internal.databases and .tables during a schema change -- and, frankly, what that cost is.
Without further intervention, only the SQL pod that ran the most recent schema change (and any SQL pods started thereafter) will publish true values for these metrics. While we do have a resolution scheme, it is inelegant in the client, requiring a separate query to Prometheus to see which pods to trust. And reaching for pod-to-pod communication is unacceptably dangerous in the midst of these transactions.

Where to go from here?

Is server: export metrics for database and table count #72938 salvageable?
- Would its API work?
  - We would probably want to true up the metrics across pods somehow.
- Would we adjust things about its implementation?
  - Perhaps a more regular architectural hook, rather than these sprinkled calls?
Or do we wait and contribute our use case to the upcoming log-based telemetry work?
Or do we retreat back up to the cloud console?
- Using some cache this time?
- Exploring queries that don't cost RUs?
- Making sure we're clear on the cost of keeping a pod running.

Epic: CC-5060

Jira issue: CRDB-10657

The text was updated successfully, but these errors were encountered:

ajwerner · 2021-12-02T21:25:19Z

In #67679 we're adding a job that watches for all changes to descriptors and materializes a view of them into the host tenant as zone configs. That thing can rather trivially track and checkpoint the counters we need here. Let's just make the reconciler job populate the relevant gauges. That's a cheap fix to an independently hard problem.

matthewtodd · 2022-05-03T14:55:42Z

Dropping as won't fix. Can reopen if we change our minds.

matthewtodd added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-observability Related to observability of the SQL layer T-sql-observability labels Oct 14, 2021

maryliag assigned matthewtodd and abarganier Oct 19, 2021

exalate-issue-sync bot unassigned abarganier Nov 5, 2021

matthewtodd mentioned this issue Nov 18, 2021

server: export metrics for database and table count #72938

Closed

matthewtodd changed the title ~~server: export metrics for database & table count~~ ccl: support displaying database & table counts for serverless clusters Dec 1, 2021

matthewtodd closed this as completed May 3, 2022

jlinder added the sync-me-3 label May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccl: support displaying database & table counts for serverless clusters #71573

ccl: support displaying database & table counts for serverless clusters #71573

matthewtodd commented Oct 14, 2021 •

edited by cockroach-jira-scripts

Loading

ajwerner commented Dec 2, 2021

matthewtodd commented May 3, 2022

ccl: support displaying database & table counts for serverless clusters #71573

ccl: support displaying database & table counts for serverless clusters #71573

Comments

matthewtodd commented Oct 14, 2021 • edited by cockroach-jira-scripts Loading

ajwerner commented Dec 2, 2021

matthewtodd commented May 3, 2022

matthewtodd commented Oct 14, 2021 •

edited by cockroach-jira-scripts

Loading