roachtest: restore2TB/nodes=10 failed #54118

cockroach-teamcity · 2020-09-09T13:48:45Z

(roachtest).restore2TB/nodes=10 failed on release-20.2@d9ff57ed3d5891ab93b7756be657ba868ecc87a6:

		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2284
		  | main.(*cluster).Run
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2204
		  | main.registerRestore.func1.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:255
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2626
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) output in run_074800.287_n1_cockroach_sql
		Wraps: (3) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2254937-1599631852-21-n10cpu4:1 -- ./cockroach sql --insecure -e "
		  | 				RESTORE csv.bank FROM
		  | 				'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank'
		  | 				WITH into_db = 'restore2tb'" returned
		  | stderr:
		  | Flag --insecure has been deprecated, it will be removed in a subsequent release.
		  | For details, see: https://go.crdb.dev/issue-v/53404/v20.2
		  |
		  | stdout:
		Wraps: (4) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2648,restore.go:262,test_runner.go:754: monitor failure: monitor task failed: context canceled
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2636
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2644
		  | main.registerRestore.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:262
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2692
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (4) monitor task failed
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

More

Artifacts: /restore2TB/nodes=10
Related:

roachtest: restore2TB/nodes=10 failed #53197 roachtest: restore2TB/nodes=10 failed C-test-failure O-roachtest O-robot branch-master release-blocker

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

The text was updated successfully, but these errors were encountered:

andreimatei · 2020-09-09T16:11:01Z

timed out; must be the same as #53197

cockroach-teamcity · 2020-09-12T10:48:41Z

(roachtest).restore2TB/nodes=10 failed on release-20.2@a504fc7f26be39586c14c3fa22ad728ed8fa9b6d:

		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2292
		  | main.(*cluster).Run
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2212
		  | main.registerRestore.func1.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:255
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2634
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) output in run_075013.439_n1_cockroach_sql
		Wraps: (3) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2264915-1599891051-21-n10cpu4:1 -- ./cockroach sql --insecure -e "
		  | 				RESTORE csv.bank FROM
		  | 				'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank'
		  | 				WITH into_db = 'restore2tb'" returned
		  | stderr:
		  | Flag --insecure has been deprecated, it will be removed in a subsequent release.
		  | For details, see: https://go.crdb.dev/issue-v/53404/v20.2
		  |
		  | stdout:
		Wraps: (4) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2656,restore.go:262,test_runner.go:754: monitor failure: monitor task failed: dial tcp 34.66.219.179:26257: connect: connection timed out
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2644
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2652
		  | main.registerRestore.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:262
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2700
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (4) monitor task failed
		Wraps: (5) dial tcp 34.66.219.179:26257
		Wraps: (6) connect
		Wraps: (7) connection timed out
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *net.OpError (6) *os.SyscallError (7) syscall.Errno

More

Artifacts: /restore2TB/nodes=10
Related:

roachtest: restore2TB/nodes=10 failed #53197 roachtest: restore2TB/nodes=10 failed C-test-failure O-roachtest O-robot branch-master release-blocker

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

andreimatei · 2020-09-15T00:10:46Z

I'm going crazy with this one. There's something misbehaving in the range cache, which I've tracked down to the entry pointer we return here

cockroach/pkg/kv/kvclient/kvcoord/range_cache.go

Line 951 in 75c8b1a

return entry, rawEntry

not being equal (on the machines where it happens) to what the caller gets here

cockroach/pkg/kv/kvclient/kvcoord/range_cache.go

Line 633 in 75c8b1a

if entry, _ := rdc.getCachedRLocked(ctx, key, useReverseScan); entry != nil {

I've got no idea what's going on... Still looking.

Relatedly, it has come to my attention that they way in which we're using the util.cache.OrderedCached for the range cache is incorrect - we're doing retrievals under a read lock, but the retrievals are mutating operations (update LRU bookkeeping). This was recently fixed elsewhere in 9f653a6

This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing the token's interface - now different token methods take the receiver by pointer and mutate it, and the DistSender methods share a single the token instance between them. Fixes cockroachdb#54118 Fixes cockroachdb#53197 Release note: None

This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing how descriptor evictions work: instead of comparing a token's pointer to the current cache entry and refusing to do anything if they don't match, now evicting a descriptor compares the descriptor's value with what's in the cache. The cache code generally moves away from comparing pointers anywhere, and an EvictionToken stops storing pointers to the cache's insides for clarity. Fixes cockroachdb#54118 Fixes cockroachdb#53197 Release note: None

54468: kvclient: fix a request routing bug r=andreimatei a=andreimatei This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing how descriptor evictions work: instead of comparing a token's pointer to the current cache entry and refusing to do anything if they don't match, now evicting a descriptor compares the descriptor's value with what's in the cache. The cache code generally moves away from comparing pointers anywhere, and an EvictionToken stops storing pointers to the cache's insides for clarity. Fixes #54118 Fixes #53197 Release note: None Co-authored-by: Andrei Matei <[email protected]>

andreimatei · 2020-09-18T02:57:54Z

reopening until backported

andreimatei · 2020-09-18T02:59:26Z

now backported

This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing how descriptor evictions work: instead of comparing a token's pointer to the current cache entry and refusing to do anything if they don't match, now evicting a descriptor compares the descriptor's value with what's in the cache. The cache code generally moves away from comparing pointers anywhere, and an EvictionToken stops storing pointers to the cache's insides for clarity. Fixes #54118 Fixes #53197 Release note: None

cockroach-teamcity added branch-release-20.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 9, 2020

cockroach-teamcity added this to the 20.2 milestone Sep 9, 2020

andreimatei self-assigned this Sep 14, 2020

andreimatei mentioned this issue Sep 16, 2020

kvclient: fix a request routing bug #54468

Merged

andreimatei mentioned this issue Sep 17, 2020

release-20.2: kvclient: fix a request routing bug #54533

Merged

craig bot closed this as completed in 84fa010 Sep 17, 2020

andreimatei reopened this Sep 18, 2020

andreimatei closed this as completed Sep 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: restore2TB/nodes=10 failed #54118

roachtest: restore2TB/nodes=10 failed #54118

cockroach-teamcity commented Sep 9, 2020

andreimatei commented Sep 9, 2020

cockroach-teamcity commented Sep 12, 2020

andreimatei commented Sep 15, 2020

andreimatei commented Sep 18, 2020

andreimatei commented Sep 18, 2020

roachtest: restore2TB/nodes=10 failed #54118

roachtest: restore2TB/nodes=10 failed #54118

Comments

cockroach-teamcity commented Sep 9, 2020

andreimatei commented Sep 9, 2020

cockroach-teamcity commented Sep 12, 2020

andreimatei commented Sep 15, 2020

andreimatei commented Sep 18, 2020

andreimatei commented Sep 18, 2020