-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SQL query telemetry collection with an adaptive sampling rate #70553
Labels
A-sql-logging-and-telemetry
Issues related to slow query log, SQL audit log, SQL internal logging telemetry, etc.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
GA-blocker
Comments
kevin-v-ngo
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
A-sql-logging-and-telemetry
Issues related to slow query log, SQL audit log, SQL internal logging telemetry, etc.
labels
Sep 22, 2021
xinhaoz
added a commit
to xinhaoz/cockroach
that referenced
this issue
Sep 27, 2021
Resolves cockroachdb#70553 Previously, telemetry logging used a configurable QPS threshold and sampling rate, for which we would log all statements if we were under the QPS threshold, and then start sampling at the given rate once at the threshold. Using this technique meant that we will often see a sharp decreaes in telemetry logging once the sampling rate increases, at sampling rates would typically need to be at low values to accomodate a high QPS. This commit replaces the above technique with an adaptive sampling rate which merely logs events to telemetry at some fixed rate. Rather than relying on QPS, we will simply track when we have last logged to the telemtry channel, and decide whether or not to log the event accordingly. Release note (sql change): The cluster setting `sql.telemetry.query_sampling.qps_threshold` has been removed. The default sampling rate for telemtry logging, `sql.telemetry.query_sampling.sample_rate` has been set to 0.1, i.e. 10 events (log lines) per second.
xinhaoz
added a commit
to xinhaoz/cockroach
that referenced
this issue
Sep 28, 2021
Resolves cockroachdb#70553 Previously, telemetry logging used a configurable QPS threshold and sampling rate, for which we would log all statements if we were under the QPS threshold, and then start sampling at the given rate once at the threshold. Using this technique meant that we will often see a sharp decreaes in telemetry logging once the sampling rate increases, at sampling rates would typically need to be at low values to accomodate a high QPS. This commit replaces the above technique with an adaptive sampling rate which merely logs events to telemetry at some fixed rate. Rather than relying on QPS, we will simply track when we have last logged to the telemtry channel, and decide whether or not to log the event accordingly. Release note (sql change): The cluster setting `sql.telemetry.query_sampling.qps_threshold` has been removed. The default sampling rate for telemtry logging, `sql.telemetry.query_sampling.sample_rate` has been set to 0.1, i.e. 10 events (log lines) per second.
xinhaoz
added a commit
to xinhaoz/cockroach
that referenced
this issue
Sep 29, 2021
Resolves cockroachdb#70553 Previously, telemetry logging used a configurable QPS threshold and sampling rate, for which we would log all statements if we were under the QPS threshold, and then start sampling at the given rate once at the threshold. Using this technique meant that we will often see a sharp decreaes in telemetry logging once the sampling rate increases, as sampling rates would typically need to be at low values to accomodate a high QPS. This commit replaces the above technique with an adaptive sampling rate which merely logs events to telemetry at a maximum frequency. Rather than relying on QPS, we will simply track when we have last logged to the telemtry channel, and decide whether or not to log a given event accordingly. Release note (sql change): The cluster setting `sql.telemetry.query_sampling.qps_threshold`, and `sql.telemetry.query_sampling.sample_rate` have been removed. A new setting, `sql.telemetry.query_logging.max_event_frequency` has been introduced, with default value of 10 events per second. Additionally, `sql.telemetry.query_sampling.enabled` has been renamed to `sql.telemetry.query_logging.enabled`.
xinhaoz
added a commit
to xinhaoz/cockroach
that referenced
this issue
Sep 29, 2021
Resolves cockroachdb#70553 Previously, telemetry logging used a configurable QPS threshold and sampling rate, for which we would log all statements if we were under the QPS threshold, and then start sampling at the given rate once at the threshold. Using this technique meant that we will often see a sharp decreaes in telemetry logging once the sampling rate increases, as sampling rates would typically need to be at low values to accomodate a high QPS. This commit replaces the above technique with an adaptive sampling rate which merely logs events to telemetry at a maximum frequency. Rather than relying on QPS, we will simply track when we have last logged to the telemtry channel, and decide whether or not to log a given event accordingly. Release note (sql change): The cluster setting `sql.telemetry.query_sampling.qps_threshold`, and `sql.telemetry.query_sampling.sample_rate` have been removed. A new setting, `sql.telemetry.query_sampling.max_event_frequency` has been introduced, with default value of 10 events per second.
xinhaoz
added a commit
to xinhaoz/cockroach
that referenced
this issue
Sep 29, 2021
Resolves cockroachdb#70553 Previously, telemetry logging used a configurable QPS threshold and sampling rate, for which we would log all statements if we were under the QPS threshold, and then start sampling at the given rate once at the threshold. Using this technique meant that we will often see a sharp decreaes in telemetry logging once the sampling rate increases, as sampling rates would typically need to be at low values to accomodate a high QPS. This commit replaces the above technique with an adaptive sampling rate which merely logs events to telemetry at a maximum frequency. Rather than relying on QPS, we will simply track when we have last logged to the telemtry channel, and decide whether or not to log a given event accordingly. Release note (sql change): The cluster setting `sql.telemetry.query_sampling.qps_threshold`, and `sql.telemetry.query_sampling.sample_rate` have been removed. A new setting, `sql.telemetry.query_sampling.max_event_frequency` has been introduced, with default value of 10 events per second.
Hi @xinhaoz, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
craig bot
pushed a commit
that referenced
this issue
Sep 30, 2021
70651: sql: deflake TestPerfLogging in event_log_test r=michae2 a=michae2 Fixes: #70649 One of our testcases was checking for no SQL_PERF log events of any kind, which is a bit too strict. Add a logRe so that we're specifically checking for no txn_rows_read_limit messages. Release note: None 70786: sql: use adaptive sampling rate for telemetry logging r=knz,yuzefovich,maryliag a=xinhaoz Resolves #70553 Previously, telemetry logging used a configurable QPS threshold and sampling rate, for which we would log all statements if we were under the QPS threshold, and then start sampling at the given rate once at the threshold. Using this technique meant that we will often see a sharp decreaes in telemetry logging once the sampling rate increases, as sampling rates would typically need to be at low values to accomodate a high QPS. This commit replaces the above technique with an adaptive sampling rate which merely logs events to telemetry at a maximum frequency. Rather than relying on QPS, we will simply track when we have last logged to the telemtry channel, and decide whether or not to log a given event accordingly. Release note (sql change): The cluster setting `sql.telemetry.query_sampling.qps_threshold`, and `sql.telemetry.query_sampling.sample_rate` have been removed. A new setting, `sql.telemetry.query_sampling.max_event_frequency` has been introduced, with default value of 10 events per second. ---------------------------- ### roachrod results cluster created with: ``` roachprod create $CLUSTER -n 4 --gce-machine-type=n1-standard-16 ``` results of tpcc 1000 ### Telemetry OFF ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 7200.0s 0 151075 21.0 53.9 48.2 67.1 109.1 5905.6 delivery _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 7200.0s 0 1509788 209.7 42.7 25.2 39.8 83.9 9663.7 newOrder _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 7200.0s 0 151161 21.0 7.8 5.0 7.6 13.6 5905.6 orderStatus _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 7200.0s 0 1511404 209.9 22.4 13.6 23.1 46.1 9663.7 payment _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 7200.0s 0 151079 21.0 14.2 10.5 17.8 27.3 7247.8 stockLevel _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result 7200.0s 0 3474507 482.6 31.6 19.9 46.1 75.5 9663.7 Audit check 9.2.1.7: PASS Audit check 9.2.2.5.1: PASS Audit check 9.2.2.5.2: PASS Audit check 9.2.2.5.5: PASS Audit check 9.2.2.5.6: PASS Audit check 9.2.2.5.3: PASS Audit check 9.2.2.5.4: PASS _elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 7200.0s 12581.5 97.8% 42.7 25.2 33.6 39.8 83.9 9663.7 ``` #### Telemetry ON ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 7200.0s 0 150953 21.0 61.9 54.5 71.3 96.5 5637.1 delivery _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 7200.0s 0 1509498 209.7 36.8 26.2 39.8 60.8 6174.0 newOrder _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 7200.0s 0 151110 21.0 9.2 5.5 7.9 14.2 4831.8 orderStatus _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 7200.0s 0 1510864 209.8 34.0 14.2 23.1 35.7 8321.5 payment _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 7200.0s 0 151026 21.0 17.0 11.5 18.9 29.4 5637.1 stockLevel _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result 7200.0s 0 3473451 482.4 34.6 21.0 46.1 67.1 8321.5 Audit check 9.2.1.7: PASS Audit check 9.2.2.5.1: PASS Audit check 9.2.2.5.2: PASS Audit check 9.2.2.5.3: PASS Audit check 9.2.2.5.4: PASS Audit check 9.2.2.5.5: PASS Audit check 9.2.2.5.6: PASS _elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 7200.0s 12579.1 97.8% 36.8 26.2 33.6 39.8 60.8 6174.0 ``` Co-authored-by: Michael Erickson <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]>
xinhaoz
added a commit
to xinhaoz/cockroach
that referenced
this issue
Sep 30, 2021
Resolves cockroachdb#70553 Release justification: category 4 Previously, telemetry logging used a configurable QPS threshold and sampling rate, for which we would log all statements if we were under the QPS threshold, and then start sampling at the given rate once at the threshold. Using this technique meant that we will often see a sharp decreaes in telemetry logging once the sampling rate increases, as sampling rates would typically need to be at low values to accomodate a high QPS. This commit replaces the above technique with an adaptive sampling rate which merely logs events to telemetry at a maximum frequency. Rather than relying on QPS, we will simply track when we have last logged to the telemtry channel, and decide whether or not to log a given event accordingly. Release note (sql change): The cluster setting `sql.telemetry.query_sampling.qps_threshold`, and `sql.telemetry.query_sampling.sample_rate` have been removed. A new setting, `sql.telemetry.query_sampling.max_event_frequency` has been introduced, with default value of 10 events per second.
andy-kimball
pushed a commit
that referenced
this issue
Oct 6, 2021
Resolves #70553 Release justification: category 4 Previously, telemetry logging used a configurable QPS threshold and sampling rate, for which we would log all statements if we were under the QPS threshold, and then start sampling at the given rate once at the threshold. Using this technique meant that we will often see a sharp decreaes in telemetry logging once the sampling rate increases, as sampling rates would typically need to be at low values to accomodate a high QPS. This commit replaces the above technique with an adaptive sampling rate which merely logs events to telemetry at a maximum frequency. Rather than relying on QPS, we will simply track when we have last logged to the telemtry channel, and decide whether or not to log a given event accordingly. Release note (sql change): The cluster setting `sql.telemetry.query_sampling.qps_threshold`, and `sql.telemetry.query_sampling.sample_rate` have been removed. A new setting, `sql.telemetry.query_sampling.max_event_frequency` has been introduced, with default value of 10 events per second.
11 tasks
2 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-sql-logging-and-telemetry
Issues related to slow query log, SQL audit log, SQL internal logging telemetry, etc.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
GA-blocker
In the new telemetry logging channel, we should update the sampling rate to ensure we maximize representative data of the customer workload while limiting the volume of data sent through the pipeline (< 10 events per second per node).
The proposal is outlined in the following doc: https://docs.google.com/document/d/1evo7ZtZZpsk8hKC888A_8f6dhM_utFCBPoMsWxr5YyA/edit
The text was updated successfully, but these errors were encountered: