Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SQL query telemetry collection with an adaptive sampling rate #70553

Closed
kevin-v-ngo opened this issue Sep 22, 2021 · 1 comment · Fixed by #70786
Closed

Improve SQL query telemetry collection with an adaptive sampling rate #70553

kevin-v-ngo opened this issue Sep 22, 2021 · 1 comment · Fixed by #70786
Assignees
Labels
A-sql-logging-and-telemetry Issues related to slow query log, SQL audit log, SQL internal logging telemetry, etc. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) GA-blocker

Comments

@kevin-v-ngo
Copy link

In the new telemetry logging channel, we should update the sampling rate to ensure we maximize representative data of the customer workload while limiting the volume of data sent through the pipeline (< 10 events per second per node).

The proposal is outlined in the following doc: https://docs.google.com/document/d/1evo7ZtZZpsk8hKC888A_8f6dhM_utFCBPoMsWxr5YyA/edit

@kevin-v-ngo kevin-v-ngo added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-logging-and-telemetry Issues related to slow query log, SQL audit log, SQL internal logging telemetry, etc. labels Sep 22, 2021
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Sep 27, 2021
Resolves cockroachdb#70553

Previously, telemetry logging used a configurable QPS threshold
and sampling rate, for which we would log all statements if we
were under the QPS threshold, and then start sampling at the
given rate once at the threshold. Using this technique meant
that we will often see a sharp decreaes in telemetry logging
once the sampling rate increases, at sampling rates would typically
need to be at low values to accomodate a high QPS.

This commit replaces the above technique with an adaptive sampling
rate which merely logs events to telemetry at some fixed rate.
Rather than relying on QPS, we will simply track when we have
last logged to the telemtry channel, and decide whether or not to
log the event accordingly.

Release note (sql change): The cluster setting
`sql.telemetry.query_sampling.qps_threshold` has been removed.
The default sampling rate for telemtry logging,
`sql.telemetry.query_sampling.sample_rate` has been set to 0.1,
i.e. 10 events (log lines) per second.
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Sep 28, 2021
Resolves cockroachdb#70553

Previously, telemetry logging used a configurable QPS threshold
and sampling rate, for which we would log all statements if we
were under the QPS threshold, and then start sampling at the
given rate once at the threshold. Using this technique meant
that we will often see a sharp decreaes in telemetry logging
once the sampling rate increases, at sampling rates would typically
need to be at low values to accomodate a high QPS.

This commit replaces the above technique with an adaptive sampling
rate which merely logs events to telemetry at some fixed rate.
Rather than relying on QPS, we will simply track when we have
last logged to the telemtry channel, and decide whether or not to
log the event accordingly.

Release note (sql change): The cluster setting
`sql.telemetry.query_sampling.qps_threshold` has been removed.
The default sampling rate for telemtry logging,
`sql.telemetry.query_sampling.sample_rate` has been set to 0.1,
i.e. 10 events (log lines) per second.
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Sep 29, 2021
Resolves cockroachdb#70553

Previously, telemetry logging used a configurable QPS threshold
and sampling rate, for which we would log all statements if we
were under the QPS threshold, and then start sampling at the
given rate once at the threshold. Using this technique meant
that we will often see a sharp decreaes in telemetry logging
once the sampling rate increases, as sampling rates would typically
need to be at low values to accomodate a high QPS.

This commit replaces the above technique with an adaptive sampling
rate which merely logs events to telemetry at a maximum frequency.
Rather than relying on QPS, we will simply track when we have
last logged to the telemtry channel, and decide whether or not to
log a given event accordingly.

Release note (sql change): The cluster setting
`sql.telemetry.query_sampling.qps_threshold`, and
`sql.telemetry.query_sampling.sample_rate` have been removed.
A new setting, `sql.telemetry.query_logging.max_event_frequency`
has been introduced, with default value of 10 events per second.
Additionally, `sql.telemetry.query_sampling.enabled` has been
renamed to `sql.telemetry.query_logging.enabled`.
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Sep 29, 2021
Resolves cockroachdb#70553

Previously, telemetry logging used a configurable QPS threshold
and sampling rate, for which we would log all statements if we
were under the QPS threshold, and then start sampling at the
given rate once at the threshold. Using this technique meant
that we will often see a sharp decreaes in telemetry logging
once the sampling rate increases, as sampling rates would typically
need to be at low values to accomodate a high QPS.

This commit replaces the above technique with an adaptive sampling
rate which merely logs events to telemetry at a maximum frequency.
Rather than relying on QPS, we will simply track when we have
last logged to the telemtry channel, and decide whether or not to
log a given event accordingly.

Release note (sql change): The cluster setting
`sql.telemetry.query_sampling.qps_threshold`, and
`sql.telemetry.query_sampling.sample_rate` have been removed.
A new setting, `sql.telemetry.query_sampling.max_event_frequency`
has been introduced, with default value of 10 events per second.
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Sep 29, 2021
Resolves cockroachdb#70553

Previously, telemetry logging used a configurable QPS threshold
and sampling rate, for which we would log all statements if we
were under the QPS threshold, and then start sampling at the
given rate once at the threshold. Using this technique meant
that we will often see a sharp decreaes in telemetry logging
once the sampling rate increases, as sampling rates would typically
need to be at low values to accomodate a high QPS.

This commit replaces the above technique with an adaptive sampling
rate which merely logs events to telemetry at a maximum frequency.
Rather than relying on QPS, we will simply track when we have
last logged to the telemtry channel, and decide whether or not to
log a given event accordingly.

Release note (sql change): The cluster setting
`sql.telemetry.query_sampling.qps_threshold`, and
`sql.telemetry.query_sampling.sample_rate` have been removed.
A new setting, `sql.telemetry.query_sampling.max_event_frequency`
has been introduced, with default value of 10 events per second.
@blathers-crl
Copy link

blathers-crl bot commented Sep 30, 2021

Hi @xinhaoz, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

craig bot pushed a commit that referenced this issue Sep 30, 2021
70651: sql: deflake TestPerfLogging in event_log_test r=michae2 a=michae2

Fixes: #70649

One of our testcases was checking for no SQL_PERF log events of any
kind, which is a bit too strict. Add a logRe so that we're specifically
checking for no txn_rows_read_limit messages.

Release note: None

70786: sql: use adaptive sampling rate for telemetry logging r=knz,yuzefovich,maryliag a=xinhaoz

Resolves #70553

Previously, telemetry logging used a configurable QPS threshold
and sampling rate, for which we would log all statements if we
were under the QPS threshold, and then start sampling at the
given rate once at the threshold. Using this technique meant
that we will often see a sharp decreaes in telemetry logging
once the sampling rate increases, as sampling rates would typically
need to be at low values to accomodate a high QPS.

This commit replaces the above technique with an adaptive sampling
rate which merely logs events to telemetry at a maximum frequency.
Rather than relying on QPS, we will simply track when we have
last logged to the telemtry channel, and decide whether or not to
log a given event accordingly.

Release note (sql change): The cluster setting
`sql.telemetry.query_sampling.qps_threshold`, and
`sql.telemetry.query_sampling.sample_rate` have been removed.
A new setting, `sql.telemetry.query_sampling.max_event_frequency`
has been introduced, with default value of 10 events per second.

----------------------------
### roachrod results
cluster created with:
```
roachprod create $CLUSTER -n 4 --gce-machine-type=n1-standard-16
```
results of tpcc 1000
### Telemetry OFF
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151075           21.0     53.9     48.2     67.1    109.1   5905.6  delivery

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1509788          209.7     42.7     25.2     39.8     83.9   9663.7  newOrder

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151161           21.0      7.8      5.0      7.6     13.6   5905.6  orderStatus

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1511404          209.9     22.4     13.6     23.1     46.1   9663.7  payment

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151079           21.0     14.2     10.5     17.8     27.3   7247.8  stockLevel

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 7200.0s        0        3474507          482.6     31.6     19.9     46.1     75.5   9663.7
Audit check 9.2.1.7: PASS
Audit check 9.2.2.5.1: PASS
Audit check 9.2.2.5.2: PASS
Audit check 9.2.2.5.5: PASS
Audit check 9.2.2.5.6: PASS
Audit check 9.2.2.5.3: PASS
Audit check 9.2.2.5.4: PASS

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 7200.0s    12581.5  97.8%     42.7     25.2     33.6     39.8     83.9   9663.7
```
#### Telemetry ON
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         150953           21.0     61.9     54.5     71.3     96.5   5637.1  delivery

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1509498          209.7     36.8     26.2     39.8     60.8   6174.0  newOrder

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151110           21.0      9.2      5.5      7.9     14.2   4831.8  orderStatus

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1510864          209.8     34.0     14.2     23.1     35.7   8321.5  payment

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151026           21.0     17.0     11.5     18.9     29.4   5637.1  stockLevel

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 7200.0s        0        3473451          482.4     34.6     21.0     46.1     67.1   8321.5
Audit check 9.2.1.7: PASS
Audit check 9.2.2.5.1: PASS
Audit check 9.2.2.5.2: PASS
Audit check 9.2.2.5.3: PASS
Audit check 9.2.2.5.4: PASS
Audit check 9.2.2.5.5: PASS
Audit check 9.2.2.5.6: PASS

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 7200.0s    12579.1  97.8%     36.8     26.2     33.6     39.8     60.8   6174.0
```


Co-authored-by: Michael Erickson <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
@craig craig bot closed this as completed in ffaf876 Sep 30, 2021
xinhaoz added a commit to xinhaoz/cockroach that referenced this issue Sep 30, 2021
Resolves cockroachdb#70553

Release justification: category 4

Previously, telemetry logging used a configurable QPS threshold
and sampling rate, for which we would log all statements if we
were under the QPS threshold, and then start sampling at the
given rate once at the threshold. Using this technique meant
that we will often see a sharp decreaes in telemetry logging
once the sampling rate increases, as sampling rates would typically
need to be at low values to accomodate a high QPS.

This commit replaces the above technique with an adaptive sampling
rate which merely logs events to telemetry at a maximum frequency.
Rather than relying on QPS, we will simply track when we have
last logged to the telemtry channel, and decide whether or not to
log a given event accordingly.

Release note (sql change): The cluster setting
`sql.telemetry.query_sampling.qps_threshold`, and
`sql.telemetry.query_sampling.sample_rate` have been removed.
A new setting, `sql.telemetry.query_sampling.max_event_frequency`
has been introduced, with default value of 10 events per second.
andy-kimball pushed a commit that referenced this issue Oct 6, 2021
Resolves #70553

Release justification: category 4

Previously, telemetry logging used a configurable QPS threshold
and sampling rate, for which we would log all statements if we
were under the QPS threshold, and then start sampling at the
given rate once at the threshold. Using this technique meant
that we will often see a sharp decreaes in telemetry logging
once the sampling rate increases, as sampling rates would typically
need to be at low values to accomodate a high QPS.

This commit replaces the above technique with an adaptive sampling
rate which merely logs events to telemetry at a maximum frequency.
Rather than relying on QPS, we will simply track when we have
last logged to the telemtry channel, and decide whether or not to
log a given event accordingly.

Release note (sql change): The cluster setting
`sql.telemetry.query_sampling.qps_threshold`, and
`sql.telemetry.query_sampling.sample_rate` have been removed.
A new setting, `sql.telemetry.query_sampling.max_event_frequency`
has been introduced, with default value of 10 events per second.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-logging-and-telemetry Issues related to slow query log, SQL audit log, SQL internal logging telemetry, etc. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) GA-blocker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants