roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70110

cockroach-teamcity · 2021-09-13T09:46:38Z

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 8d491ced731e13da2377cda9961f577d7487d6a0:

The test failed on branch=release-21.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:657,follower_reads.go:319,follower_reads.go:57,test_runner.go:777: too many intervals with more than 2 nodes with low follower read ratios: 5 intervals > 4 threshold. Bad intervals:
		interval 09:39:50-09:40:00: n1 ratio: 0.229 n2 ratio: 0.000 n3 ratio: 0.132 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 09:40:30-09:40:40: n1 ratio: 0.342 n2 ratio: 0.163 n3 ratio: 0.795 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 09:41:10-09:41:20: n1 ratio: 0.187 n2 ratio: 0.000 n3 ratio: 0.812 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 0.994 
		interval 09:41:30-09:41:40: n1 ratio: 0.000 n2 ratio: 0.000 n3 ratio: 0.812 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 09:42:00-09:42:10: n1 ratio: 0.000 n2 ratio: 0.000 n3 ratio: 0.767 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 1.000

Reproduce

See: roachtest README

Same failure on other branches

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70011 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed [C-test-failure O-roachtest O-robot branch-master release-blocker]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!

Epic CRDB-10559}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2021-09-16T09:50:25Z

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ ed81878bb4697c3a0a09e69b8a5053566297b142:

The test failed on branch=release-21.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:657,follower_reads.go:319,follower_reads.go:57,test_runner.go:777: too many intervals with more than 2 nodes with low follower read ratios: 5 intervals > 4 threshold. Bad intervals:
		interval 09:44:10-09:44:20: n1 ratio: 0.599 n2 ratio: 0.000 n3 ratio: 1.378 n4 ratio: 0.831 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 09:44:20-09:44:30: n1 ratio: 2.254 n2 ratio: 0.409 n3 ratio: 0.509 n4 ratio: 0.000 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 09:45:20-09:45:30: n1 ratio: 0.069 n2 ratio: 1.000 n3 ratio: 0.170 n4 ratio: 1.000 n5 ratio: 0.802 n6 ratio: 1.000 
		interval 09:46:10-09:46:20: n1 ratio: 0.221 n2 ratio: 1.187 n3 ratio: 0.146 n4 ratio: 1.000 n5 ratio: 0.818 n6 ratio: 1.000 
		interval 09:47:10-09:47:20: n1 ratio: 0.013 n2 ratio: 1.000 n3 ratio: 0.159 n4 ratio: 1.000 n5 ratio: 0.792 n6 ratio: 1.000

Reproduce

See: roachtest README

Same failure on other branches

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70011 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed [C-test-failure O-roachtest O-robot branch-master release-blocker]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-09-17T10:12:41Z

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 0babf97f52ed9e44036851b2a9868e17eeee95ed:

The test failed on branch=release-21.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:657,follower_reads.go:319,follower_reads.go:57,test_runner.go:777: too many intervals with more than 2 nodes with low follower read ratios: 5 intervals > 4 threshold. Bad intervals:
		interval 10:05:40-10:05:50: n1 ratio: 0.219 n2 ratio: 0.086 n3 ratio: 0.131 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 10:06:00-10:06:10: n1 ratio: 0.000 n2 ratio: 1.179 n3 ratio: 0.000 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 0.827 
		interval 10:06:10-10:06:20: n1 ratio: 0.183 n2 ratio: 1.000 n3 ratio: 0.480 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 0.859 
		interval 10:07:10-10:07:20: n1 ratio: 0.457 n2 ratio: 1.100 n3 ratio: 0.000 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 0.640 
		interval 10:08:10-10:08:20: n1 ratio: 0.162 n2 ratio: 1.000 n3 ratio: 0.418 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 0.791

Reproduce

See: roachtest README

Same failure on other branches

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70011 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed [C-test-failure O-roachtest O-robot branch-master release-blocker]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-10-05T09:52:42Z

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 95c3da44a6c38aa1ae22fe9de5549632b7527aa5:

The test failed on branch=release-21.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:545,follower_reads.go:327,follower_reads.go:57,test_runner.go:777: 11 latency values ([20.840442ms 20.840442ms 20.316154ms 20.250618ms 20.250618ms 20.447226ms 20.512762ms 20.840442ms 21.299194ms 20.643834ms 20.643834ms]) are above target latency 20ms, 4 permitted

Reproduce

See: roachtest README

Same failure on other branches

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70011 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed [C-test-failure O-roachtest O-robot branch-master]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-10-07T09:47:22Z

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 82e8782453b1cf14460d93b6bf8328a7b2964575:

The test failed on branch=release-21.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:545,follower_reads.go:327,follower_reads.go:57,test_runner.go:777: 22 latency values ([22.544378ms 21.889018ms 22.478842ms 22.544378ms 22.544378ms 22.02009ms 22.02009ms 21.495802ms 21.430266ms 20.971514ms 20.38169ms 20.971514ms 21.495802ms 21.561338ms 21.561338ms 21.561338ms 21.03705ms 21.03705ms 21.430266ms 21.430266ms 21.561338ms 21.561338ms]) are above target latency 20ms, 4 permitted

Reproduce

See: roachtest README

Same failure on other branches

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed #71244 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [C-test-failure O-roachtest O-robot branch-master release-blocker]
roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70011 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed [C-test-failure O-roachtest O-robot branch-master]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-10-08T10:10:27Z

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 3cf3b0ea3b08cb24e4d6b84c6c237b856ce6b411:

The test failed on branch=release-21.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:545,follower_reads.go:327,follower_reads.go:57,test_runner.go:777: 22 latency values ([21.495802ms 21.626874ms 21.561338ms 21.69241ms 21.495802ms 21.823482ms 21.823482ms 21.823482ms 21.889018ms 21.954554ms 21.430266ms 21.299194ms 21.430266ms 21.561338ms 21.561338ms 22.085626ms 21.823482ms 21.823482ms 22.34777ms 22.216698ms 22.282234ms 21.757946ms]) are above target latency 20ms, 4 permitted

Reproduce

See: roachtest README

Same failure on other branches

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed #71244 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [C-test-failure O-roachtest O-robot branch-master]
roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70011 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed [C-test-failure O-roachtest O-robot branch-master]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-10-12T10:39:01Z

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 2ef24098cc88bab3958ad905f46b0ed7e74bd7db:

The test failed on branch=release-21.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:545,follower_reads.go:327,follower_reads.go:57,test_runner.go:777: 14 latency values ([20.447226ms 20.05401ms 21.233658ms 21.102586ms 20.70937ms 20.70937ms 20.774906ms 20.774906ms 20.971514ms 21.102586ms 21.495802ms 21.495802ms 21.03705ms 20.971514ms]) are above target latency 20ms, 4 permitted

Reproduce

See: roachtest README

Same failure on other branches

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed #71244 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [C-test-failure O-roachtest O-robot branch-master]
roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70011 roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed [C-test-failure O-roachtest O-robot T-kv branch-master]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

nvanbenschoten · 2021-10-13T15:07:18Z

My bisect for this just completed, and is pointing pretty conclusively at 5519e6d, which landed on the release-21.2 branch in #70759, 15 days ago. That lines up with the timing of this failure. Interestingly, it also lines up with the timing of #70977. We left that test with an understanding of how to improve the situation but without an understanding of what had caused the regression in the first place.

So what's the common thread between this commit and these two tests? Could they all be related? I think so. The common thread is that both of these tests enable tracing:

kv/contention/nodes=4

cockroach/pkg/cmd/roachtest/tests/kv.go

Line 274 in 9968d34

SET CLUSTER SETTING trace.debug.enable = true;

follower-reads/survival=region/locality=regional/reads=bounded-staleness

cockroach/pkg/cmd/roachtest/tests/follower_reads.go

Line 250 in 9968d34

    
           _, err := db.ExecContext(ctx, "SET CLUSTER SETTING sql.trace.stmt.enable_threshold = '50ms'")

And the commit in question modified tracing in a way that likely made it more expensive. @tbg do you have thoughts on why this commit would make tracing more expensive, and how much more expensive?

As a small additional data point (which I'm surprised I still had lying around), while investigating #70977 I did grab the following CPU profile at one point, which shows a significant amount of time in log.vEventf:

tbg · 2021-10-13T15:34:07Z

If we wanted to fix this (I think we have to, as some users are likely running with tracing on across the board):

cockroach/pkg/util/tracing/tracingpb/recorded_span.pb.go

Lines 32 to 42 in 0e20ab8

    
           // LogRecord is a log message recorded in a traced span. 
        
           type LogRecord struct { 
        
           	// Time of the log record. 
        
           	Time time.Time `protobuf:"bytes,1,opt,name=time,proto3,stdtime" json:"time"` 
        
           	// Fields with values converted to strings. In 22.1, the `message` field 
        
           	// contains the log message, and this field is only used for compatibility 
        
           	// with 21.2 nodes. 
        
           	DeprecatedFields []LogRecord_Field `protobuf:"bytes,2,rep,name=deprecated_fields,json=deprecatedFields,proto3" json:"deprecated_fields"` 
        
           	// The log message. 
        
           	Message github_com_cockroachdb_redact.RedactableString `protobuf:"bytes,3,opt,name=message,proto3,customtype=github.com/cockroachdb/redact.RedactableString" json:"message"` 
        
           }

We need to introduce two Message fields here, one for redacted information and one for unredacted information. We also add a cluster setting that determines whether traces are to be redacted. Depending on the setting, we populate one field or the other (never both). The code that returns the contents of the verbose trace selects whichever field is nonzero.

tbg · 2021-10-13T15:35:38Z

Note that the cluster setting would be KV-level. Tenants don't get the choice - they get whatever the KV-side setting is and need to live with that. In the future, tenants may also redact their own traces and use the setting for that, but none of the infra exists. The only redaction that exists right now is at the BatchRequest boundary on the KV side:

cockroach/pkg/server/node.go

Lines 1025 to 1040 in a814577

    
           if grpcSpan != nil { 
        
           	// If our local span descends from a parent on the other 
        
           	// end of the RPC (i.e. the !isLocalRequest) case, 
        
           	// attach the span recording to the batch response. 
        
           	// Tenants get a redacted recording, i.e. with anything 
        
           	// sensitive stripped out of the verbose messages. However, 
        
           	// structured payloads stay untouched. 
        
           	if rec := grpcSpan.GetRecording(); rec != nil { 
        
           		err := redactRecordingForTenant(tenID, rec) 
        
           		if err == nil { 
        
           			br.CollectedSpans = append(br.CollectedSpans, rec...) 
        
           		} else { 
        
           			log.Errorf(ctx, "error redacting trace recording: %s", err) 
        
           		} 
        
           	} 
        
           }

If the cluster setting is disabled, tenants will get nothing from the KV layer except for messages that have been redacted in their entirety (i.e. contain no information except the timestamp). So in Serverless, we expect to run with redaction enabled.

andy-kimball · 2021-10-13T16:18:46Z

Can we back off the level and/or frequency of tracing to address this?

CC @knz, @logston

logston · 2021-10-13T18:17:38Z

I am 👍 for reducing the frequency of tracing to address this. However, I need @knz to confirm that such a reduction would adequately solve the issue as they are more familiar this logic than I.

andy-kimball · 2021-10-14T13:55:26Z

@tbg, what's missing here is an explanation of why your change caused these regressions, as well as what the impact on customers would be if we do whatever it is you're suggesting. Can you give a more detailed analysis that doesn't assume knowledge of this code?

tbg · 2021-10-14T14:20:59Z

The regression here is for verbose tracing and is due to the newly added functionality to redact log messages. I'm unsure at what level Serverless uses tracing - ideally not at all, and ideally tenants also cannot turn it on for all of their transactions. Turning tracing on globally was never a good idea due to the overhead, but some customers get away with it if they have enough headroom.

If we turn off redactability (we can do that by reverting the PR that introduces it) we have to in turn completely blank out the KV side of traces (for tenants). I was originally trying to avoid this but at this point that seems like the way to go, at least until we can understand how expensive redaction is better.

Even with that done, it remains useful to restrict how much tenants can enable verbose tracing (cluster settings or even just SET tracing = on in each session). This has always been expensive and while limited to a few tenants each it's unlikely to overwhelm our system in the same way it could do on a dedicated cluster, it's something to keep an eye on. My suggestion is that if we're going to strip the KV-side traces anyway, we may as well make them non-verbose, so that the perf overhead is limited to the SQL pods, where it affects the tenant only.

In summary planning to do the following tomorrow:

revert release-21.2: log,kvserver: hand redacted KV traces to tenants #70759 on the release branch (and on the serverless release branch)
instead disable trace verbosity for tenant's incoming KV requests (also on both release branches)

andreimatei · 2021-10-14T16:14:23Z

I think we should have some sort of measurement of the performance impact of the redaction in case verbose tracing is on before reverting. From the profile that Nathan posted, it doesn't look that great... Some latency percentile in the follower reads test has shifted - which sounds bad - but I think we should quantify by how much.

tbg · 2021-10-15T08:54:27Z

I think it might have to be the opposite. If we sit down now and take measurements on the regression and discuss what is acceptable and what isn't, all while we are afraid to release, we are stopping the train for no reason. (Note that we're not primarly afraid about serverless, but for anyone who will run 21.2 GA). We should move the problem out of the way and then quantify the overhead of tracing.

When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes all verbose traces *redactable* and by redacting them at the KV-Tenant boundary. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled) and it does not seem likely that we can get it to the low single digits, at any rate not in time for 21.2.0. This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. We still slightly regress in the (fairly sensitive) "select single row" benchmark, but it's a lot more palatable now: ``` $ benchstat old.txt new.txt name old time/op new time/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 1.06ms ± 1% 1.07ms ± 1% +1.25% (p=0.002 n=7+7) name old alloc/op new alloc/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 155kB ± 2% 164kB ± 1% +5.50% (p=0.001 n=7+6) name old allocs/op new allocs/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 2.27k ± 1% 2.33k ± 1% +2.39% (p=0.001 n=7+7) ``` I did take a quick glance at the CPU profiles but nothing obvious turned up. I might have to look at an allocation profile. Touches cockroachdb#70562. [significant]: cockroachdb#70110 (comment) Release note: None

tbg · 2021-10-19T08:45:11Z

There's a bit of the perf measurements over in #71610 now. I'd say it's pretty bad. Anyway we have moved to revert anyway, so the rest of the action is going to be on master.

tbg · 2021-10-20T13:20:49Z

The perf regression has been removed on 21.2. It is still present on master here.

When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes all verbose traces *redactable* and by redacting them at the KV-Tenant boundary. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled) and it does not seem likely that we can get it to the low single digits, at any rate not in time for 21.2.0. This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. We still slightly regress in the (fairly sensitive) "select single row" benchmark, but it's a lot more palatable now: ``` $ benchstat old.txt new.txt name old time/op new time/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 1.06ms ± 1% 1.07ms ± 1% +1.25% (p=0.002 n=7+7) name old alloc/op new alloc/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 155kB ± 2% 164kB ± 1% +5.50% (p=0.001 n=7+6) name old allocs/op new allocs/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 2.27k ± 1% 2.33k ± 1% +2.39% (p=0.001 n=7+7) ``` I did take a quick glance at the CPU profiles but nothing obvious turned up. I might have to look at an allocation profile. Touches cockroachdb#70562. [significant]: cockroachdb#70110 (comment) Release note: None

When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes all verbose traces *redactable* and by redacting them at the KV-Tenant boundary. Serverless has shipped with this redactability in place in order to prevent data leakage across the boundary with KV and now needs to get back on the `release-21.2` branch to sync with the rest of development. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled). This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes via the additional costs of redacting log strings via the `redact.Sprintf` function. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. We still slightly regress in the (fairly sensitive) "select single row" benchmark, but it's a lot more palatable now: ``` $ benchstat old.txt new.txt name old time/op new time/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 1.06ms ± 1% 1.07ms ± 1% +1.25% (p=0.002 n=7+7) name old alloc/op new alloc/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 155kB ± 2% 164kB ± 1% +5.50% (p=0.001 n=7+6) name old allocs/op new allocs/op delta Tracing/MultinodeCockroach/tracing=t/Scan1-16 2.27k ± 1% 2.33k ± 1% +2.39% (p=0.001 n=7+7) ``` I did take a quick glance at the CPU profiles but nothing obvious turned up. I might have to look at an allocation profile. Touches cockroachdb#70562. Original issue investigation the kicked off concerns about performance impact of redaction: cockroachdb#70110 (comment) Release note: None

When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes all verbose traces *redactable* and by redacting them at the KV-Tenant boundary. Serverless has shipped with this redactability in place in order to prevent data leakage across the boundary with KV and now needs to get back on the `release-21.2` branch to sync with the rest of development. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled). This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes via the additional costs of redacting log strings via the `redact.Sprintf` function. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. The approach taken is that tracers and the spans they generate can be marked as "redactable". If the flag is enabled, logging events to the span will use full redaction via `redact.Sprintf`. If the flag is not enabled, we implement a manual "coarse redaction" of the entire message by surrounding it with redaction markers. We take a "fast path" by checking to see if there are any existing redaction markers to escape in the message, making the assumption that usually there aren't. This approach maintains redactability (to varying degrees) of traces while reducing the performance hit of tracing in the case where we don't want detailed redactability. A further commit will enable redactability on tenant-bound traces since we do want to maintain that functionality in multi-tenant deployments. Original issue investigation that kicked off concerns about performance impact of redaction: cockroachdb#70110 (comment) ---- Benchmarks The base for all 3 below benchmarks that's used is the code on `release-21.2` with the "drop traces for tenants" commit reverted. This reproduces a reasonable "base" state where tracing redaction isn't really introduced at all. Additionally, all benchmarks are run with modifications introduced in the first two commits in cockroachdb#71610 which force tracing on in benchmarks by running `SET CLUSTER SETTING trace.debug.enable = true` in the regular and multinode scenarios shown below. Below are tracing benchmarks for the conditional redaction that is in **this** commit. ``` ~/go/src/github.com/cockroachdb/cockroach make-traces-redactable-and-default-to-off--v2* ≡ 19:36:43 ❯ benchstat 83fc452-no-tracing-bench.txt 2bea5a2ac5-conditional-redaction-simple.txt name old time/op new time/op delta Tracing/Cockroach/tracing=f/Scan1-8 612µs ± 9% 591µs ±16% ~ (p=0.393 n=10+10) Tracing/Cockroach/tracing=f/Insert-8 785µs ± 4% 767µs ± 6% ~ (p=0.360 n=8+10) Tracing/Cockroach/tracing=t/Scan1-8 597µs ± 1% 542µs ± 1% -9.23% (p=0.000 n=10+9) Tracing/Cockroach/tracing=t/Insert-8 762µs ± 3% 739µs ± 8% ~ (p=0.315 n=10+10) Tracing/MultinodeCockroach/tracing=f/Scan1-8 1.05ms ±30% 0.92ms ± 5% -12.57% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=f/Insert-8 1.19ms ±15% 1.13ms ± 9% -5.39% (p=0.035 n=10+10) Tracing/MultinodeCockroach/tracing=t/Scan1-8 979µs ± 3% 945µs ± 6% -3.42% (p=0.024 n=9+9) Tracing/MultinodeCockroach/tracing=t/Insert-8 1.54ms ±48% 1.19ms ±16% -22.76% (p=0.011 n=10+10) name old alloc/op new alloc/op delta Tracing/Cockroach/tracing=f/Scan1-8 86.5kB ± 1% 90.2kB ± 3% +4.29% (p=0.000 n=9+9) Tracing/Cockroach/tracing=f/Insert-8 109kB ± 3% 111kB ± 0% +2.68% (p=0.005 n=9+7) Tracing/Cockroach/tracing=t/Scan1-8 86.0kB ± 1% 89.8kB ± 0% +4.42% (p=0.000 n=10+9) Tracing/Cockroach/tracing=t/Insert-8 108kB ± 3% 112kB ± 3% +3.05% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=f/Scan1-8 156kB ±37% 149kB ± 1% ~ (p=0.050 n=9+9) Tracing/MultinodeCockroach/tracing=f/Insert-8 154kB ±10% 154kB ± 3% ~ (p=0.382 n=8+8) Tracing/MultinodeCockroach/tracing=t/Scan1-8 145kB ± 1% 150kB ± 5% +3.61% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=t/Insert-8 157kB ±13% 183kB ±40% ~ (p=0.237 n=8+10) name old allocs/op new allocs/op delta Tracing/Cockroach/tracing=f/Scan1-8 1.44k ± 0% 1.50k ± 0% +4.22% (p=0.000 n=9+8) Tracing/Cockroach/tracing=f/Insert-8 1.64k ± 0% 1.72k ± 0% +4.39% (p=0.000 n=9+9) Tracing/Cockroach/tracing=t/Scan1-8 1.44k ± 0% 1.50k ± 0% +4.33% (p=0.000 n=8+7) Tracing/Cockroach/tracing=t/Insert-8 1.64k ± 0% 1.72k ± 0% +4.34% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=f/Scan1-8 2.41k ±11% 2.38k ± 1% ~ (p=0.150 n=10+9) Tracing/MultinodeCockroach/tracing=f/Insert-8 2.15k ± 1% 2.22k ± 1% +3.35% (p=0.000 n=8+8) Tracing/MultinodeCockroach/tracing=t/Scan1-8 2.31k ± 1% 2.40k ± 4% +4.19% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=t/Insert-8 2.17k ± 6% 2.33k ±12% +7.43% (p=0.012 n=8+10) ``` These statistics are the *base* hit we take on the redaction work **prior** to any conditional redaction being added. This is what we're trying to improve on. ``` ~/go/src/github.com/cockroachdb/cockroach make-traces-redactable-and-default-to-off--v2* ≡ 19:36:49 ❯ benchstat 83fc452-no-tracing-bench.txt dc42e37-after-redaction.txt name old time/op new time/op delta Tracing/Cockroach/tracing=f/Scan1-8 612µs ± 9% 693µs ± 2% +13.22% (p=0.000 n=10+10) Tracing/Cockroach/tracing=f/Insert-8 785µs ± 4% 878µs ± 1% +11.86% (p=0.000 n=8+9) Tracing/Cockroach/tracing=t/Scan1-8 597µs ± 1% 687µs ± 3% +15.05% (p=0.000 n=10+9) Tracing/Cockroach/tracing=t/Insert-8 762µs ± 3% 878µs ± 1% +15.35% (p=0.000 n=10+8) Tracing/MultinodeCockroach/tracing=f/Scan1-8 1.05ms ±30% 1.06ms ± 2% ~ (p=0.050 n=9+9) Tracing/MultinodeCockroach/tracing=f/Insert-8 1.19ms ±15% 1.28ms ±11% +7.85% (p=0.035 n=10+10) Tracing/MultinodeCockroach/tracing=t/Scan1-8 979µs ± 3% 1066µs ± 1% +8.87% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=t/Insert-8 1.54ms ±48% 1.26ms ± 2% ~ (p=0.515 n=10+8) name old alloc/op new alloc/op delta Tracing/Cockroach/tracing=f/Scan1-8 86.5kB ± 1% 98.0kB ± 1% +13.33% (p=0.000 n=9+9) Tracing/Cockroach/tracing=f/Insert-8 109kB ± 3% 125kB ± 1% +15.09% (p=0.000 n=9+9) Tracing/Cockroach/tracing=t/Scan1-8 86.0kB ± 1% 97.9kB ± 1% +13.90% (p=0.000 n=10+9) Tracing/Cockroach/tracing=t/Insert-8 108kB ± 3% 125kB ± 1% +15.53% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=f/Scan1-8 156kB ±37% 162kB ± 1% ~ (p=0.059 n=9+8) Tracing/MultinodeCockroach/tracing=f/Insert-8 154kB ±10% 197kB ±43% +28.54% (p=0.000 n=8+10) Tracing/MultinodeCockroach/tracing=t/Scan1-8 145kB ± 1% 162kB ± 1% +11.81% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=t/Insert-8 157kB ±13% 172kB ± 5% +9.84% (p=0.007 n=8+8) name old allocs/op new allocs/op delta Tracing/Cockroach/tracing=f/Scan1-8 1.44k ± 0% 1.66k ± 0% +15.59% (p=0.000 n=9+9) Tracing/Cockroach/tracing=f/Insert-8 1.64k ± 0% 1.95k ± 0% +18.48% (p=0.000 n=9+8) Tracing/Cockroach/tracing=t/Scan1-8 1.44k ± 0% 1.66k ± 0% +15.46% (p=0.000 n=8+9) Tracing/Cockroach/tracing=t/Insert-8 1.64k ± 0% 1.95k ± 1% +18.52% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=f/Scan1-8 2.41k ±11% 2.58k ± 1% ~ (p=0.105 n=10+8) Tracing/MultinodeCockroach/tracing=f/Insert-8 2.15k ± 1% 2.58k ±12% +20.04% (p=0.000 n=8+10) Tracing/MultinodeCockroach/tracing=t/Scan1-8 2.31k ± 1% 2.59k ± 1% +12.09% (p=0.000 n=9+9) Tracing/MultinodeCockroach/tracing=t/Insert-8 2.17k ± 6% 2.47k ± 1% +13.76% (p=0.000 n=8+8) ``` Touches cockroachdb#70562. Release note: None

When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with them at the KV-Tenant boundary. Serverless has shipped with this redactability in place in order to prevent data leakage across the boundary with KV and now needs to get back on the `release-21.2` branch to sync with the rest of development. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled). This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes via the additional costs of redacting log strings via the `redact.Sprintf` function. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. The approach taken is that tracers and the spans they generate can be marked as "redactable". If the flag is enabled, logging events to the span will use full redaction via `redact.Sprintf`. If the flag is not enabled a coarse redation of the entire message is done instead. A further commit will enable redactability on tenant-bound traces since we do want to maintain that functionality in multi-tenant deployments. Original issue investigation that kicked off concerns about performance impact of redaction: cockroachdb#70110 (comment) ---- Benchmarks TODO(davidh): Need to re-do these on `master` Touches cockroachdb#70562. Release note: None

When we discovered that tenants could "trace into" KV, we considered this a security problem since no mechanism ensured that tenants could receive data they are allowed to see. The KV team reacted to this with them at the KV-Tenant boundary. Serverless has shipped with this redactability in place in order to prevent data leakage across the boundary with KV and now needs to get back on the `release-21.2` branch to sync with the rest of development. Unfortunately, making traces redactable had a [significant] performance overhead. On simple selects, this overhead is in the 10-20% area (when verbose tracing is enabled). This presents a stability risk, since some customers run with verbose tracing enabled for all queries. Once they switch to 21.2, their system may be underprovisioned as a result of the newly grown tracing overhead in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes via the additional costs of redacting log strings via the `redact.Sprintf` function. This PR changes makes trace redactability a choice. Because we needed to support mixed-version clusters in cockroachdb#70562 this is not as big a lift as might have been expected. The approach taken is that tracers and the spans they generate can be marked as "redactable". If the flag is enabled, logging events to the span will use full redaction via `redact.Sprintf`. If the flag is not enabled a coarse redation of the entire message is done instead. A further commit will enable redactability on tenant-bound traces since we do want to maintain that functionality in multi-tenant deployments. Original issue investigation that kicked off concerns about performance impact of redaction: cockroachdb#70110 (comment) Touches cockroachdb#70562. Release note: None

cockroach-teamcity added branch-release-21.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 13, 2021

nvanbenschoten mentioned this issue Sep 14, 2021

roachtest: follower-reads/survival=zone/locality=regional/reads=exact-staleness failed #69817

Closed

cockroach-teamcity mentioned this issue Sep 20, 2021

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70011

Closed

nvanbenschoten removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 21, 2021

cockroach-teamcity mentioned this issue Oct 7, 2021

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed #71244

Closed

exalate-issue-sync bot added the T-kv KV Team label Oct 11, 2021

AlexTalks mentioned this issue Oct 13, 2021

roachtest: follower-reads/survival=region/locality=global/reads=bounded-staleness failed #71455

Closed

tbg mentioned this issue Oct 15, 2021

release-21.2: revert trace redactability, drop tenant-bound recordings instead #71606

Merged

tbg closed this as completed Oct 20, 2021

dhartunian mentioned this issue Dec 6, 2021

[release-21.2] tracing: restore trace redactability and default to "off" #73117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70110

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70110

cockroach-teamcity commented Sep 13, 2021 •

edited by nvanbenschoten

Loading

cockroach-teamcity commented Sep 16, 2021

cockroach-teamcity commented Sep 17, 2021

cockroach-teamcity commented Oct 5, 2021

cockroach-teamcity commented Oct 7, 2021

cockroach-teamcity commented Oct 8, 2021

cockroach-teamcity commented Oct 12, 2021

nvanbenschoten commented Oct 13, 2021

tbg commented Oct 13, 2021

tbg commented Oct 13, 2021 •

edited

Loading

andy-kimball commented Oct 13, 2021

logston commented Oct 13, 2021

andy-kimball commented Oct 14, 2021

tbg commented Oct 14, 2021 •

edited

Loading

andreimatei commented Oct 14, 2021

tbg commented Oct 15, 2021 •

edited

Loading

tbg commented Oct 19, 2021

tbg commented Oct 20, 2021

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70110

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70110

Comments

cockroach-teamcity commented Sep 13, 2021 • edited by nvanbenschoten Loading

cockroach-teamcity commented Sep 16, 2021

cockroach-teamcity commented Sep 17, 2021

cockroach-teamcity commented Oct 5, 2021

cockroach-teamcity commented Oct 7, 2021

cockroach-teamcity commented Oct 8, 2021

cockroach-teamcity commented Oct 12, 2021

nvanbenschoten commented Oct 13, 2021

tbg commented Oct 13, 2021

tbg commented Oct 13, 2021 • edited Loading

andy-kimball commented Oct 13, 2021

logston commented Oct 13, 2021

andy-kimball commented Oct 14, 2021

tbg commented Oct 14, 2021 • edited Loading

andreimatei commented Oct 14, 2021

tbg commented Oct 15, 2021 • edited Loading

tbg commented Oct 19, 2021

tbg commented Oct 20, 2021

cockroach-teamcity commented Sep 13, 2021 •

edited by nvanbenschoten

Loading

tbg commented Oct 13, 2021 •

edited

Loading

tbg commented Oct 14, 2021 •

edited

Loading

tbg commented Oct 15, 2021 •

edited

Loading