Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestStoreLeaseTransferTimestampCacheTxnRecord failed #117486

Closed
cockroach-teamcity opened this issue Jan 8, 2024 · 6 comments · Fixed by #118300
Closed

kv/kvserver: TestStoreLeaseTransferTimestampCacheTxnRecord failed #117486

cockroach-teamcity opened this issue Jan 8, 2024 · 6 comments · Fixed by #118300
Assignees
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 8, 2024

kv/kvserver.TestStoreLeaseTransferTimestampCacheTxnRecord failed on master @ 58502d1125b70dafeea34b96733967a387f9ef0d:

=== RUN   TestStoreLeaseTransferTimestampCacheTxnRecord
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestStoreLeaseTransferTimestampCacheTxnRecord3215970383
    test_log_scope.go:81: use -show-logs to present logs inline
    client_lease_test.go:684: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/client_lease_test.go:684
        	Error:      	Expect "<nil>" to match "TransactionAbortedError\(ABORT_REASON_NEW_LEASE_PREVENTS_TXN\)"
        	Test:       	TestStoreLeaseTransferTimestampCacheTxnRecord
    panic.go:523: -- test log scope end --
test logs left over in: outputs.zip/logTestStoreLeaseTransferTimestampCacheTxnRecord3215970383
--- FAIL: TestStoreLeaseTransferTimestampCacheTxnRecord (3.48s)

Parameters:

  • attempt=1
  • run=2
  • shard=22
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-35210

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jan 8, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Jan 8, 2024
@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestStoreLeaseTransferTimestampCacheTxnRecord failed with artifacts on master @ 5243cc37c02ba17cc53a70230298fbe1e30e651b:

=== RUN   TestStoreLeaseTransferTimestampCacheTxnRecord
I240108 17:15:42.101357 1 (gostd) rand.go:250  [-] 1  random seed: 3492486949698708590
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/ac3e4004160555c7c575e1264d1b8f5f/logTestStoreLeaseTransferTimestampCacheTxnRecord263603817
    test_log_scope.go:81: use -show-logs to present logs inline
    client_lease_test.go:684: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/client_lease_test.go:684
        	Error:      	Expect "<nil>" to match "TransactionAbortedError\(ABORT_REASON_NEW_LEASE_PREVENTS_TXN\)"
        	Test:       	TestStoreLeaseTransferTimestampCacheTxnRecord
    panic.go:523: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/ac3e4004160555c7c575e1264d1b8f5f/logTestStoreLeaseTransferTimestampCacheTxnRecord263603817
--- FAIL: TestStoreLeaseTransferTimestampCacheTxnRecord (5.62s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestStoreLeaseTransferTimestampCacheTxnRecord failed with artifacts on master @ 5243cc37c02ba17cc53a70230298fbe1e30e651b:

=== RUN   TestStoreLeaseTransferTimestampCacheTxnRecord
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/ac3e4004160555c7c575e1264d1b8f5f/logTestStoreLeaseTransferTimestampCacheTxnRecord3661833498
    test_log_scope.go:81: use -show-logs to present logs inline
    client_lease_test.go:684: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/client_lease_test.go:684
        	Error:      	Expect "<nil>" to match "TransactionAbortedError\(ABORT_REASON_NEW_LEASE_PREVENTS_TXN\)"
        	Test:       	TestStoreLeaseTransferTimestampCacheTxnRecord
    panic.go:523: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/ac3e4004160555c7c575e1264d1b8f5f/logTestStoreLeaseTransferTimestampCacheTxnRecord3661833498
--- FAIL: TestStoreLeaseTransferTimestampCacheTxnRecord (3.04s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@nvanbenschoten
Copy link
Member

Given the timing, this is probably fallout from #116880.

@nvanbenschoten nvanbenschoten self-assigned this Jan 8, 2024
@nvanbenschoten nvanbenschoten removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jan 8, 2024
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 8, 2024
Informs cockroachdb#117486.

Skip until I can fix the test.

Release note: None
craig bot pushed a commit that referenced this issue Jan 8, 2024
117480: roachtest: skip multitenant/distsql for now r=yuzefovich a=yuzefovich

This test is flaking with some infra issue.

Informs: #117150.
Fixes: #117461.
Fixes: #117462.
Fixes: #117463.
Fixes: #117464.

Release note: None

117506: kv: skip TestStoreLeaseTransferTimestampCacheTxnRecord r=nvanbenschoten a=nvanbenschoten

Informs #117486.

Skip until I can fix the test.

Release note: None

117510: rowenc: fix up a recent commit r=yuzefovich a=yuzefovich

This commit optimizes recently added code a bit (by using slightly more efficient `PeekValueLengthWithOffsetsAndType` method) as well as adds a missing word in a comment.

Epic: None

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestStoreLeaseTransferTimestampCacheTxnRecord failed with artifacts on master @ 5bd4ca0cce6bf2358f05ee0cb2a6cd2533ad435c:

=== RUN   TestStoreLeaseTransferTimestampCacheTxnRecord
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/ac3e4004160555c7c575e1264d1b8f5f/logTestStoreLeaseTransferTimestampCacheTxnRecord777471584
    test_log_scope.go:81: use -show-logs to present logs inline
    client_lease_test.go:684: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/client_lease_test.go:684
        	Error:      	Expect "<nil>" to match "TransactionAbortedError\(ABORT_REASON_NEW_LEASE_PREVENTS_TXN\)"
        	Test:       	TestStoreLeaseTransferTimestampCacheTxnRecord
    panic.go:523: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/ac3e4004160555c7c575e1264d1b8f5f/logTestStoreLeaseTransferTimestampCacheTxnRecord777471584
--- FAIL: TestStoreLeaseTransferTimestampCacheTxnRecord (2.87s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@nvanbenschoten nvanbenschoten added P-2 Issues/test failures with a fix SLA of 3 months C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels Jan 16, 2024
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 29, 2024
… range merges

Fixes cockroachdb#61986.
Fixes cockroachdb#117486.
Unblocks cockroachdb#118000.

This commit uses the new `tscache.Cache.Serialize` method introduced in cockroachdb#118299 to
ship high-resolution summaries of the timestamp cache during lease transfers and
ranges merges. In doing so, it eliminates the loss of precision that occurs in an
incoming leaseholder's timestamp cache when it receives a lease transfer or range
merge.

This loss of precision was a source of transaction retries for three reasons:
1. txn tombstone marker keys would have their timestamp advanced, leading to
   TransactionAbortedError with the `ABORT_REASON_NEW_LEASE_PREVENTS_TXN`
   reason.
2. txn push marker keys would have their timestamp advanced, leading to
   transactions having their commit timestamp pushed, which could lead to
   TransactionRetryError with the `RETRY_SERIALIZABLE` reason.
3. global keys would have their timestamp advanced as if they had been read,
   also leading to transactions having their commit timestamp pushed if they
   wrote to those keys, which could also lead to TransactionRetryError with the
   `RETRY_SERIALIZABLE` reason.

The first issue here is the most severe, both because it can not be refreshed
away and because it affects transactions of all isolation levels.

This commit introduces two new cluster settings to control the maximum size of
these timestamp cache read summaries:
- `kv.lease_transfer_read_summary.local_budget`
- `kv.lease_transfer_read_summary.global_budget`

It configures the local keyspace budget to 4MB and the global keyspace budget to
0B. This default configuration should be sufficient to eliminate the first two
sources of retries described above. The third has not been observed as a serious
issue in practice, so we configure the global budget to 0 so that we can hit a
serialization fast-path.

Release note (ops change): Transaction replay protection state is now
passed between the outgoing and incoming leaseholder for a range during
a lease transfer. This avoids cases where lease transfers can cause
transactions to throw TransactionAbortedError(ABORT_REASON_NEW_LEASE_PREVENTS_TXN) errors.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 31, 2024
… range merges

Fixes cockroachdb#61986.
Fixes cockroachdb#117486.
Unblocks cockroachdb#118000.

This commit uses the new `tscache.Cache.Serialize` method introduced in cockroachdb#118299 to
ship high-resolution summaries of the timestamp cache during lease transfers and
ranges merges. In doing so, it eliminates the loss of precision that occurs in an
incoming leaseholder's timestamp cache when it receives a lease transfer or range
merge.

This loss of precision was a source of transaction retries for three reasons:
1. txn tombstone marker keys would have their timestamp advanced, leading to
   TransactionAbortedError with the `ABORT_REASON_NEW_LEASE_PREVENTS_TXN`
   reason.
2. txn push marker keys would have their timestamp advanced, leading to
   transactions having their commit timestamp pushed, which could lead to
   TransactionRetryError with the `RETRY_SERIALIZABLE` reason.
3. global keys would have their timestamp advanced as if they had been read,
   also leading to transactions having their commit timestamp pushed if they
   wrote to those keys, which could also lead to TransactionRetryError with the
   `RETRY_SERIALIZABLE` reason.

The first issue here is the most severe, both because it can not be refreshed
away and because it affects transactions of all isolation levels.

This commit introduces two new cluster settings to control the maximum size of
these timestamp cache read summaries:
- `kv.lease_transfer_read_summary.local_budget`
- `kv.lease_transfer_read_summary.global_budget`

It configures the local keyspace budget to 4MB and the global keyspace budget to
0B. This default configuration should be sufficient to eliminate the first two
sources of retries described above. The third has not been observed as a serious
issue in practice, so we configure the global budget to 0 so that we can hit a
serialization fast-path.

Release note (ops change): Transaction replay protection state is now
passed between the outgoing and incoming leaseholder for a range during
a lease transfer. This avoids cases where lease transfers can cause
transactions to throw TransactionAbortedError(ABORT_REASON_NEW_LEASE_PREVENTS_TXN) errors.
craig bot pushed a commit that referenced this issue Feb 1, 2024
115746: kv: log on excessive latch hold duration r=lyang24 a=lyang24

This commit aims to help observability by logging request holding latches over
threshold. long_latch_hold_duration is a new cluster setting that is introduced
to set the latch holding time threshold, latches held over the threshold will
be logged at most every second. To achieve logging spanlatch.manager now
contains a pointer to cluster setting.

Fixes: #114609

Release note: None

118300: kv: ship high-resolution tscache summaries during lease transfers and range merges r=nvanbenschoten a=nvanbenschoten

Fixes #61986.
Fixes #117486.
Unblocks #118000.

This commit uses the new `tscache.Cache.Serialize` method introduced in #118299 to ship high-resolution summaries of the timestamp cache during lease transfers and ranges merges. In doing so, it eliminates the loss of precision that occurs in an incoming leaseholder's timestamp cache when it receives a lease transfer or range merge.

This loss of precision was a source of transaction retries for three reasons:
1. txn tombstone marker keys would have their timestamp advanced, leading to TransactionAbortedError with the `ABORT_REASON_NEW_LEASE_PREVENTS_TXN` reason.
2. txn push marker keys would have their timestamp advanced, leading to transactions having their commit timestamp pushed, which could lead to TransactionRetryError with the `RETRY_SERIALIZABLE` reason.
3. global keys would have their timestamp advanced as if they had been read, also leading to transactions having their commit timestamp pushed if they wrote to those keys, which could also lead to TransactionRetryError with the `RETRY_SERIALIZABLE` reason.

The first issue here is the most severe, both because it can not be refreshed away and because it affects transactions of all isolation levels.

This commit introduces two new cluster settings to control the maximum size of these timestamp cache read summaries:
- `kv.lease_transfer_read_summary.local_budget`
- `kv.lease_transfer_read_summary.global_budget`

It configures the local keyspace budget to 4MB and the global keyspace budget to 0B. This default configuration should be sufficient to eliminate the first two sources of retries described above. The third has not been observed as a serious issue in practice, so we configure the global budget to 0 so that we can hit a serialization fast-path.

Release note (ops change): Transaction replay protection state is now passed between the outgoing and incoming leaseholder for a range during a lease transfer. This avoids cases where lease transfers can cause transactions to throw TransactionAbortedError(ABORT_REASON_NEW_LEASE_PREVENTS_TXN) errors.

Co-authored-by: lyang24 <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in a4c578c Feb 2, 2024
wenyihu6 pushed a commit to wenyihu6/cockroach that referenced this issue Feb 21, 2024
… range merges

Fixes cockroachdb#61986.
Fixes cockroachdb#117486.
Unblocks cockroachdb#118000.

This commit uses the new `tscache.Cache.Serialize` method introduced in cockroachdb#118299 to
ship high-resolution summaries of the timestamp cache during lease transfers and
ranges merges. In doing so, it eliminates the loss of precision that occurs in an
incoming leaseholder's timestamp cache when it receives a lease transfer or range
merge.

This loss of precision was a source of transaction retries for three reasons:
1. txn tombstone marker keys would have their timestamp advanced, leading to
   TransactionAbortedError with the `ABORT_REASON_NEW_LEASE_PREVENTS_TXN`
   reason.
2. txn push marker keys would have their timestamp advanced, leading to
   transactions having their commit timestamp pushed, which could lead to
   TransactionRetryError with the `RETRY_SERIALIZABLE` reason.
3. global keys would have their timestamp advanced as if they had been read,
   also leading to transactions having their commit timestamp pushed if they
   wrote to those keys, which could also lead to TransactionRetryError with the
   `RETRY_SERIALIZABLE` reason.

The first issue here is the most severe, both because it can not be refreshed
away and because it affects transactions of all isolation levels.

This commit introduces two new cluster settings to control the maximum size of
these timestamp cache read summaries:
- `kv.lease_transfer_read_summary.local_budget`
- `kv.lease_transfer_read_summary.global_budget`

It configures the local keyspace budget to 4MB and the global keyspace budget to
0B. This default configuration should be sufficient to eliminate the first two
sources of retries described above. The third has not been observed as a serious
issue in practice, so we configure the global budget to 0 so that we can hit a
serialization fast-path.

Release note (ops change): Transaction replay protection state is now
passed between the outgoing and incoming leaseholder for a range during
a lease transfer. This avoids cases where lease transfers can cause
transactions to throw TransactionAbortedError(ABORT_REASON_NEW_LEASE_PREVENTS_TXN) errors.
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
No open projects
Status: Closed
4 participants