-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70110
Comments
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ ed81878bb4697c3a0a09e69b8a5053566297b142:
Same failure on other branches
|
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 0babf97f52ed9e44036851b2a9868e17eeee95ed:
Same failure on other branches
|
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 95c3da44a6c38aa1ae22fe9de5549632b7527aa5:
Same failure on other branches
|
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 82e8782453b1cf14460d93b6bf8328a7b2964575:
Same failure on other branches
|
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 3cf3b0ea3b08cb24e4d6b84c6c237b856ce6b411:
Same failure on other branches
|
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 2ef24098cc88bab3958ad905f46b0ed7e74bd7db:
Same failure on other branches
|
My bisect for this just completed, and is pointing pretty conclusively at 5519e6d, which landed on the release-21.2 branch in #70759, 15 days ago. That lines up with the timing of this failure. Interestingly, it also lines up with the timing of #70977. We left that test with an understanding of how to improve the situation but without an understanding of what had caused the regression in the first place. So what's the common thread between this commit and these two tests? Could they all be related? I think so. The common thread is that both of these tests enable tracing:
cockroach/pkg/cmd/roachtest/tests/kv.go Line 274 in 9968d34
And the commit in question modified tracing in a way that likely made it more expensive. @tbg do you have thoughts on why this commit would make tracing more expensive, and how much more expensive? As a small additional data point (which I'm surprised I still had lying around), while investigating #70977 I did grab the following CPU profile at one point, which shows a significant amount of time in |
If we wanted to fix this (I think we have to, as some users are likely running with tracing on across the board): cockroach/pkg/util/tracing/tracingpb/recorded_span.pb.go Lines 32 to 42 in 0e20ab8
We need to introduce two Message fields here, one for redacted information and one for unredacted information. We also add a cluster setting that determines whether traces are to be redacted. Depending on the setting, we populate one field or the other (never both). The code that returns the contents of the verbose trace selects whichever field is nonzero. |
Note that the cluster setting would be KV-level. Tenants don't get the choice - they get whatever the KV-side setting is and need to live with that. In the future, tenants may also redact their own traces and use the setting for that, but none of the infra exists. The only redaction that exists right now is at the BatchRequest boundary on the KV side: Lines 1025 to 1040 in a814577
If the cluster setting is disabled, tenants will get nothing from the KV layer except for messages that have been redacted in their entirety (i.e. contain no information except the timestamp). So in Serverless, we expect to run with redaction enabled. |
I am 👍 for reducing the frequency of tracing to address this. However, I need @knz to confirm that such a reduction would adequately solve the issue as they are more familiar this logic than I. |
@tbg, what's missing here is an explanation of why your change caused these regressions, as well as what the impact on customers would be if we do whatever it is you're suggesting. Can you give a more detailed analysis that doesn't assume knowledge of this code? |
The regression here is for verbose tracing and is due to the newly added functionality to redact log messages. I'm unsure at what level Serverless uses tracing - ideally not at all, and ideally tenants also cannot turn it on for all of their transactions. Turning tracing on globally was never a good idea due to the overhead, but some customers get away with it if they have enough headroom. If we turn off redactability (we can do that by reverting the PR that introduces it) we have to in turn completely blank out the KV side of traces (for tenants). I was originally trying to avoid this but at this point that seems like the way to go, at least until we can understand how expensive redaction is better. Even with that done, it remains useful to restrict how much tenants can enable verbose tracing (cluster settings or even just SET tracing = on in each session). This has always been expensive and while limited to a few tenants each it's unlikely to overwhelm our system in the same way it could do on a dedicated cluster, it's something to keep an eye on. My suggestion is that if we're going to strip the KV-side traces anyway, we may as well make them non-verbose, so that the perf overhead is limited to the SQL pods, where it affects the tenant only. In summary planning to do the following tomorrow:
|
I think we should have some sort of measurement of the performance impact of the redaction in case verbose tracing is on before reverting. From the profile that Nathan posted, it doesn't look that great... Some latency percentile in the follower reads test has shifted - which sounds bad - but I think we should quantify by how much. |
I think it might have to be the opposite. If we sit down now and take measurements on the regression and discuss what is acceptable and what isn't, all while we are afraid to release, we are stopping the train for no reason. (Note that we're not primarly afraid about serverless, but for anyone who will run 21.2 GA). We should move the problem out of the way and then quantify the overhead of tracing. |
When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes all verbose traces *redactable* and by redacting them at the KV-Tenant boundary. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled) and it does not seem likely that we can get it to the low single digits, at any rate not in time for 21.2.0. This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. We still slightly regress in the (fairly sensitive) "select single row" benchmark, but it's a lot more palatable now: ``` $ benchstat old.txt new.txt name old time/op new time/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 1.06ms ± 1% 1.07ms ± 1% +1.25% (p=0.002 n=7+7) name old alloc/op new alloc/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 155kB ± 2% 164kB ± 1% +5.50% (p=0.001 n=7+6) name old allocs/op new allocs/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 2.27k ± 1% 2.33k ± 1% +2.39% (p=0.001 n=7+7) ``` I did take a quick glance at the CPU profiles but nothing obvious turned up. I might have to look at an allocation profile. Touches cockroachdb#70562. [significant]: cockroachdb#70110 (comment) Release note: None
There's a bit of the perf measurements over in #71610 now. I'd say it's pretty bad. Anyway we have moved to revert anyway, so the rest of the action is going to be on |
When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes all verbose traces *redactable* and by redacting them at the KV-Tenant boundary. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled) and it does not seem likely that we can get it to the low single digits, at any rate not in time for 21.2.0. This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. We still slightly regress in the (fairly sensitive) "select single row" benchmark, but it's a lot more palatable now: ``` $ benchstat old.txt new.txt name old time/op new time/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 1.06ms ± 1% 1.07ms ± 1% +1.25% (p=0.002 n=7+7) name old alloc/op new alloc/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 155kB ± 2% 164kB ± 1% +5.50% (p=0.001 n=7+6) name old allocs/op new allocs/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 2.27k ± 1% 2.33k ± 1% +2.39% (p=0.001 n=7+7) ``` I did take a quick glance at the CPU profiles but nothing obvious turned up. I might have to look at an allocation profile. Touches cockroachdb#70562. [significant]: cockroachdb#70110 (comment) Release note: None
When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes all verbose traces *redactable* and by redacting them at the KV-Tenant boundary. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled) and it does not seem likely that we can get it to the low single digits, at any rate not in time for 21.2.0. This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. We still slightly regress in the (fairly sensitive) "select single row" benchmark, but it's a lot more palatable now: ``` $ benchstat old.txt new.txt name old time/op new time/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 1.06ms ± 1% 1.07ms ± 1% +1.25% (p=0.002 n=7+7) name old alloc/op new alloc/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 155kB ± 2% 164kB ± 1% +5.50% (p=0.001 n=7+6) name old allocs/op new allocs/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 2.27k ± 1% 2.33k ± 1% +2.39% (p=0.001 n=7+7) ``` I did take a quick glance at the CPU profiles but nothing obvious turned up. I might have to look at an allocation profile. Touches cockroachdb#70562. [significant]: cockroachdb#70110 (comment) Release note: None
When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes all verbose traces *redactable* and by redacting them at the KV-Tenant boundary. Serverless has shipped with this redactability in place in order to prevent data leakage across the boundary with KV and now needs to get back on the `release-21.2` branch to sync with the rest of development. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled). This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes via the additional costs of redacting log strings via the `redact.Sprintf` function. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. We still slightly regress in the (fairly sensitive) "select single row" benchmark, but it's a lot more palatable now: ``` $ benchstat old.txt new.txt name old time/op new time/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 1.06ms ± 1% 1.07ms ± 1% +1.25% (p=0.002 n=7+7) name old alloc/op new alloc/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 155kB ± 2% 164kB ± 1% +5.50% (p=0.001 n=7+6) name old allocs/op new allocs/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 2.27k ± 1% 2.33k ± 1% +2.39% (p=0.001 n=7+7) ``` I did take a quick glance at the CPU profiles but nothing obvious turned up. I might have to look at an allocation profile. Touches cockroachdb#70562. Original issue investigation the kicked off concerns about performance impact of redaction: cockroachdb#70110 (comment) Release note: None
When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes all verbose traces *redactable* and by redacting them at the KV-Tenant boundary. Serverless has shipped with this redactability in place in order to prevent data leakage across the boundary with KV and now needs to get back on the `release-21.2` branch to sync with the rest of development. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled). This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes via the additional costs of redacting log strings via the `redact.Sprintf` function. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. We still slightly regress in the (fairly sensitive) "select single row" benchmark, but it's a lot more palatable now: ``` $ benchstat old.txt new.txt name old time/op new time/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 1.06ms ± 1% 1.07ms ± 1% +1.25% (p=0.002 n=7+7) name old alloc/op new alloc/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 155kB ± 2% 164kB ± 1% +5.50% (p=0.001 n=7+6) name old allocs/op new allocs/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 2.27k ± 1% 2.33k ± 1% +2.39% (p=0.001 n=7+7) ``` I did take a quick glance at the CPU profiles but nothing obvious turned up. I might have to look at an allocation profile. Touches cockroachdb#70562. Original issue investigation the kicked off concerns about performance impact of redaction: cockroachdb#70110 (comment) Release note: None
When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes all verbose traces *redactable* and by redacting them at the KV-Tenant boundary. Serverless has shipped with this redactability in place in order to prevent data leakage across the boundary with KV and now needs to get back on the `release-21.2` branch to sync with the rest of development. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled). This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes via the additional costs of redacting log strings via the `redact.Sprintf` function. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. The approach taken is that tracers and the spans they generate can be marked as "redactable". If the flag is enabled, logging events to the span will use full redaction via `redact.Sprintf`. If the flag is not enabled, we implement a manual "coarse redaction" of the entire message by surrounding it with redaction markers. We take a "fast path" by checking to see if there are any existing redaction markers to escape in the message, making the assumption that usually there aren't. This approach maintains redactability (to varying degrees) of traces while reducing the performance hit of tracing in the case where we don't want detailed redactability. A further commit will enable redactability on tenant-bound traces since we do want to maintain that functionality in multi-tenant deployments. Original issue investigation that kicked off concerns about performance impact of redaction: cockroachdb#70110 (comment) ---- Benchmarks The base for all 3 below benchmarks that's used is the code on `release-21.2` with the "drop traces for tenants" commit reverted. This reproduces a reasonable "base" state where tracing redaction isn't really introduced at all. Additionally, all benchmarks are run with modifications introduced in the first two commits in cockroachdb#71610 which force tracing on in benchmarks by running `SET CLUSTER SETTING trace.debug.enable = true` in the regular and multinode scenarios shown below. Below are tracing benchmarks for the conditional redaction that is in **this** commit. ``` ~/go/src/github.com/cockroachdb/cockroach make-traces-redactable-and-default-to-off--v2* ≡ 19:36:43 ❯ benchstat 83fc452-no-tracing-bench.txt 2bea5a2ac5-conditional-redaction-simple.txt name old time/op new time/op delta Tracing/Cockroach/tracing=f/Scan1-8 612µs ± 9% 591µs ±16% ~ (p=0.393 n=10+10) Tracing/Cockroach/tracing=f/Insert-8 785µs ± 4% 767µs ± 6% ~ (p=0.360 n=8+10) Tracing/Cockroach/tracing=t/Scan1-8 597µs ± 1% 542µs ± 1% -9.23% (p=0.000 n=10+9) Tracing/Cockroach/tracing=t/Insert-8 762µs ± 3% 739µs ± 8% ~ (p=0.315 n=10+10) Tracing/MultinodeCockroach/tracing=f/Scan1-8 1.05ms ±30% 0.92ms ± 5% -12.57% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=f/Insert-8 1.19ms ±15% 1.13ms ± 9% -5.39% (p=0.035 n=10+10) Tracing/MultinodeCockroach/tracing=t/Scan1-8 979µs ± 3% 945µs ± 6% -3.42% (p=0.024 n=9+9) Tracing/MultinodeCockroach/tracing=t/Insert-8 1.54ms ±48% 1.19ms ±16% -22.76% (p=0.011 n=10+10) name old alloc/op new alloc/op delta Tracing/Cockroach/tracing=f/Scan1-8 86.5kB ± 1% 90.2kB ± 3% +4.29% (p=0.000 n=9+9) Tracing/Cockroach/tracing=f/Insert-8 109kB ± 3% 111kB ± 0% +2.68% (p=0.005 n=9+7) Tracing/Cockroach/tracing=t/Scan1-8 86.0kB ± 1% 89.8kB ± 0% +4.42% (p=0.000 n=10+9) Tracing/Cockroach/tracing=t/Insert-8 108kB ± 3% 112kB ± 3% +3.05% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=f/Scan1-8 156kB ±37% 149kB ± 1% ~ (p=0.050 n=9+9) Tracing/MultinodeCockroach/tracing=f/Insert-8 154kB ±10% 154kB ± 3% ~ (p=0.382 n=8+8) Tracing/MultinodeCockroach/tracing=t/Scan1-8 145kB ± 1% 150kB ± 5% +3.61% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=t/Insert-8 157kB ±13% 183kB ±40% ~ (p=0.237 n=8+10) name old allocs/op new allocs/op delta Tracing/Cockroach/tracing=f/Scan1-8 1.44k ± 0% 1.50k ± 0% +4.22% (p=0.000 n=9+8) Tracing/Cockroach/tracing=f/Insert-8 1.64k ± 0% 1.72k ± 0% +4.39% (p=0.000 n=9+9) Tracing/Cockroach/tracing=t/Scan1-8 1.44k ± 0% 1.50k ± 0% +4.33% (p=0.000 n=8+7) Tracing/Cockroach/tracing=t/Insert-8 1.64k ± 0% 1.72k ± 0% +4.34% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=f/Scan1-8 2.41k ±11% 2.38k ± 1% ~ (p=0.150 n=10+9) Tracing/MultinodeCockroach/tracing=f/Insert-8 2.15k ± 1% 2.22k ± 1% +3.35% (p=0.000 n=8+8) Tracing/MultinodeCockroach/tracing=t/Scan1-8 2.31k ± 1% 2.40k ± 4% +4.19% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=t/Insert-8 2.17k ± 6% 2.33k ±12% +7.43% (p=0.012 n=8+10) ``` These statistics are the *base* hit we take on the redaction work **prior** to any conditional redaction being added. This is what we're trying to improve on. ``` ~/go/src/github.com/cockroachdb/cockroach make-traces-redactable-and-default-to-off--v2* ≡ 19:36:49 ❯ benchstat 83fc452-no-tracing-bench.txt dc42e37-after-redaction.txt name old time/op new time/op delta Tracing/Cockroach/tracing=f/Scan1-8 612µs ± 9% 693µs ± 2% +13.22% (p=0.000 n=10+10) Tracing/Cockroach/tracing=f/Insert-8 785µs ± 4% 878µs ± 1% +11.86% (p=0.000 n=8+9) Tracing/Cockroach/tracing=t/Scan1-8 597µs ± 1% 687µs ± 3% +15.05% (p=0.000 n=10+9) Tracing/Cockroach/tracing=t/Insert-8 762µs ± 3% 878µs ± 1% +15.35% (p=0.000 n=10+8) Tracing/MultinodeCockroach/tracing=f/Scan1-8 1.05ms ±30% 1.06ms ± 2% ~ (p=0.050 n=9+9) Tracing/MultinodeCockroach/tracing=f/Insert-8 1.19ms ±15% 1.28ms ±11% +7.85% (p=0.035 n=10+10) Tracing/MultinodeCockroach/tracing=t/Scan1-8 979µs ± 3% 1066µs ± 1% +8.87% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=t/Insert-8 1.54ms ±48% 1.26ms ± 2% ~ (p=0.515 n=10+8) name old alloc/op new alloc/op delta Tracing/Cockroach/tracing=f/Scan1-8 86.5kB ± 1% 98.0kB ± 1% +13.33% (p=0.000 n=9+9) Tracing/Cockroach/tracing=f/Insert-8 109kB ± 3% 125kB ± 1% +15.09% (p=0.000 n=9+9) Tracing/Cockroach/tracing=t/Scan1-8 86.0kB ± 1% 97.9kB ± 1% +13.90% (p=0.000 n=10+9) Tracing/Cockroach/tracing=t/Insert-8 108kB ± 3% 125kB ± 1% +15.53% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=f/Scan1-8 156kB ±37% 162kB ± 1% ~ (p=0.059 n=9+8) Tracing/MultinodeCockroach/tracing=f/Insert-8 154kB ±10% 197kB ±43% +28.54% (p=0.000 n=8+10) Tracing/MultinodeCockroach/tracing=t/Scan1-8 145kB ± 1% 162kB ± 1% +11.81% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=t/Insert-8 157kB ±13% 172kB ± 5% +9.84% (p=0.007 n=8+8) name old allocs/op new allocs/op delta Tracing/Cockroach/tracing=f/Scan1-8 1.44k ± 0% 1.66k ± 0% +15.59% (p=0.000 n=9+9) Tracing/Cockroach/tracing=f/Insert-8 1.64k ± 0% 1.95k ± 0% +18.48% (p=0.000 n=9+8) Tracing/Cockroach/tracing=t/Scan1-8 1.44k ± 0% 1.66k ± 0% +15.46% (p=0.000 n=8+9) Tracing/Cockroach/tracing=t/Insert-8 1.64k ± 0% 1.95k ± 1% +18.52% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=f/Scan1-8 2.41k ±11% 2.58k ± 1% ~ (p=0.105 n=10+8) Tracing/MultinodeCockroach/tracing=f/Insert-8 2.15k ± 1% 2.58k ±12% +20.04% (p=0.000 n=8+10) Tracing/MultinodeCockroach/tracing=t/Scan1-8 2.31k ± 1% 2.59k ± 1% +12.09% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=t/Insert-8 2.17k ± 6% 2.47k ± 1% +13.76% (p=0.000 n=8+8) ``` Touches cockroachdb#70562. Release note: None
When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with them at the KV-Tenant boundary. Serverless has shipped with this redactability in place in order to prevent data leakage across the boundary with KV and now needs to get back on the `release-21.2` branch to sync with the rest of development. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled). This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes via the additional costs of redacting log strings via the `redact.Sprintf` function. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. The approach taken is that tracers and the spans they generate can be marked as "redactable". If the flag is enabled, logging events to the span will use full redaction via `redact.Sprintf`. If the flag is not enabled a coarse redation of the entire message is done instead. A further commit will enable redactability on tenant-bound traces since we do want to maintain that functionality in multi-tenant deployments. Original issue investigation that kicked off concerns about performance impact of redaction: cockroachdb#70110 (comment) ---- Benchmarks TODO(davidh): Need to re-do these on `master` Touches cockroachdb#70562. Release note: None
When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with them at the KV-Tenant boundary. Serverless has shipped with this redactability in place in order to prevent data leakage across the boundary with KV and now needs to get back on the `release-21.2` branch to sync with the rest of development. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled). This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes via the additional costs of redacting log strings via the `redact.Sprintf` function. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. The approach taken is that tracers and the spans they generate can be marked as "redactable". If the flag is enabled, logging events to the span will use full redaction via `redact.Sprintf`. If the flag is not enabled a coarse redation of the entire message is done instead. A further commit will enable redactability on tenant-bound traces since we do want to maintain that functionality in multi-tenant deployments. Original issue investigation that kicked off concerns about performance impact of redaction: cockroachdb#70110 (comment) Touches cockroachdb#70562. Release note: None
When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with them at the KV-Tenant boundary. Serverless has shipped with this redactability in place in order to prevent data leakage across the boundary with KV and now needs to get back on the `release-21.2` branch to sync with the rest of development. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled). This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes via the additional costs of redacting log strings via the `redact.Sprintf` function. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. The approach taken is that tracers and the spans they generate can be marked as "redactable". If the flag is enabled, logging events to the span will use full redaction via `redact.Sprintf`. If the flag is not enabled a coarse redation of the entire message is done instead. A further commit will enable redactability on tenant-bound traces since we do want to maintain that functionality in multi-tenant deployments. Original issue investigation that kicked off concerns about performance impact of redaction: cockroachdb#70110 (comment) Touches cockroachdb#70562. Release note: None
When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with them at the KV-Tenant boundary. Serverless has shipped with this redactability in place in order to prevent data leakage across the boundary with KV and now needs to get back on the `release-21.2` branch to sync with the rest of development. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled). This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes via the additional costs of redacting log strings via the `redact.Sprintf` function. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. The approach taken is that tracers and the spans they generate can be marked as "redactable". If the flag is enabled, logging events to the span will use full redaction via `redact.Sprintf`. If the flag is not enabled a coarse redation of the entire message is done instead. A further commit will enable redactability on tenant-bound traces since we do want to maintain that functionality in multi-tenant deployments. Original issue investigation that kicked off concerns about performance impact of redaction: cockroachdb#70110 (comment) Touches cockroachdb#70562. Release note: None
When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with them at the KV-Tenant boundary. Serverless has shipped with this redactability in place in order to prevent data leakage across the boundary with KV and now needs to get back on the `release-21.2` branch to sync with the rest of development. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled). This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes via the additional costs of redacting log strings via the `redact.Sprintf` function. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. The approach taken is that tracers and the spans they generate can be marked as "redactable". If the flag is enabled, logging events to the span will use full redaction via `redact.Sprintf`. If the flag is not enabled a coarse redation of the entire message is done instead. A further commit will enable redactability on tenant-bound traces since we do want to maintain that functionality in multi-tenant deployments. Original issue investigation that kicked off concerns about performance impact of redaction: cockroachdb#70110 (comment) Touches cockroachdb#70562. Release note: None
roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 8d491ced731e13da2377cda9961f577d7487d6a0:
Reproduce
See: roachtest README
Same failure on other branches
This test on roachdash | Improve this report!
Epic CRDB-10559
The text was updated successfully, but these errors were encountered: