Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore2TB/nodes=10 failed #54118

Closed
cockroach-teamcity opened this issue Sep 9, 2020 · 5 comments · Fixed by #54468
Closed

roachtest: restore2TB/nodes=10 failed #54118

cockroach-teamcity opened this issue Sep 9, 2020 · 5 comments · Fixed by #54468
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).restore2TB/nodes=10 failed on release-20.2@d9ff57ed3d5891ab93b7756be657ba868ecc87a6:

		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2284
		  | main.(*cluster).Run
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2204
		  | main.registerRestore.func1.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:255
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2626
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) output in run_074800.287_n1_cockroach_sql
		Wraps: (3) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2254937-1599631852-21-n10cpu4:1 -- ./cockroach sql --insecure -e "
		  | 				RESTORE csv.bank FROM
		  | 				'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank'
		  | 				WITH into_db = 'restore2tb'" returned
		  | stderr:
		  | Flag --insecure has been deprecated, it will be removed in a subsequent release.
		  | For details, see: https://go.crdb.dev/issue-v/53404/v20.2
		  |
		  | stdout:
		Wraps: (4) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2648,restore.go:262,test_runner.go:754: monitor failure: monitor task failed: context canceled
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2636
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2644
		  | main.registerRestore.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:262
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2692
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (4) monitor task failed
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

More

Artifacts: /restore2TB/nodes=10
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-release-20.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 9, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Sep 9, 2020
@andreimatei
Copy link
Contributor

timed out; must be the same as #53197

@cockroach-teamcity
Copy link
Member Author

(roachtest).restore2TB/nodes=10 failed on release-20.2@a504fc7f26be39586c14c3fa22ad728ed8fa9b6d:

		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2292
		  | main.(*cluster).Run
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2212
		  | main.registerRestore.func1.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:255
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2634
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) output in run_075013.439_n1_cockroach_sql
		Wraps: (3) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2264915-1599891051-21-n10cpu4:1 -- ./cockroach sql --insecure -e "
		  | 				RESTORE csv.bank FROM
		  | 				'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank'
		  | 				WITH into_db = 'restore2tb'" returned
		  | stderr:
		  | Flag --insecure has been deprecated, it will be removed in a subsequent release.
		  | For details, see: https://go.crdb.dev/issue-v/53404/v20.2
		  |
		  | stdout:
		Wraps: (4) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2656,restore.go:262,test_runner.go:754: monitor failure: monitor task failed: dial tcp 34.66.219.179:26257: connect: connection timed out
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2644
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2652
		  | main.registerRestore.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:262
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2700
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (4) monitor task failed
		Wraps: (5) dial tcp 34.66.219.179:26257
		Wraps: (6) connect
		Wraps: (7) connection timed out
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *net.OpError (6) *os.SyscallError (7) syscall.Errno

More

Artifacts: /restore2TB/nodes=10
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@andreimatei andreimatei self-assigned this Sep 14, 2020
@andreimatei
Copy link
Contributor

I'm going crazy with this one. There's something misbehaving in the range cache, which I've tracked down to the entry pointer we return here

return entry, rawEntry

not being equal (on the machines where it happens) to what the caller gets here
if entry, _ := rdc.getCachedRLocked(ctx, key, useReverseScan); entry != nil {

I've got no idea what's going on... Still looking.

Relatedly, it has come to my attention that they way in which we're using the util.cache.OrderedCached for the range cache is incorrect - we're doing retrievals under a read lock, but the retrievals are mutating operations (update LRU bookkeeping). This was recently fixed elsewhere in 9f653a6

andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 16, 2020
This patch fixes a bug in the DistSender's interaction with the range
cache. The DistSender interacts with the cache through an EvictionToken;
this token is used to update the lease information for a cache entry and
to evict the entry completely if the descriptor is found to be too stale
to be useful. The problem was that the eviction does not work as intended
if a copy of the token had been used to perform a lease update prior.
Operations that update the cache take a token and return an updated
token, and only the most up to date such token can be used in order to
perform subsequent cache updates (in particular, to evict the cache
entry). The bug was that the DistSender sometimes lost track of this
most up to date token, and tried to evict using an older copy - which
eviction was a no-op.

Specifically, the bad scenario was the following:
- DistSender.sendPartialBatch made a copy of the token and handed it to
  sendToReplicas.
- sendToReplicas updated the cached lease through this token (through
  tok.UpdateLease). The updated token was further used by sendToReplicas
  but not returned to sendPartialBatch.
- No replica can serve the request cause the descriptor is stale, and
  control flow gets back to sendPartialBatch.
- sendPartialBatch uses its own, stale, copy of the token to attempt a
  cache eviction. The cache ignores the eviction request, claiming that
  it has a more up to date entry than the one that sendPartialBatch is
  trying to evict (which is true).
- sendPartialBatch then asks the cache for a new descriptor, and the
  cache returns the same one as before. It also returns the lease that
  we have added before, but that doesn't help very much.
- All this can happen over and over again in some case that I don't
  fully understand. In the restore2TB test, what happens is that the
  node that's supposed to be the leaseholder (say, n1), perpetually
  doesn't know about the range, and another node perpetually thinks that
  n1 is the leaseholder. I'm not sure why the latter happens, but this
  range is probably in an unusual state because the request that's
  spinning in this way is an ImportRequest.

This patch fixes the problem by changing the token's interface - now
different token methods take the receiver by pointer and mutate it, and
the DistSender methods share a single the token instance between them.

Fixes cockroachdb#54118
Fixes cockroachdb#53197

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 17, 2020
This patch fixes a bug in the DistSender's interaction with the range
cache. The DistSender interacts with the cache through an EvictionToken;
this token is used to update the lease information for a cache entry and
to evict the entry completely if the descriptor is found to be too stale
to be useful. The problem was that the eviction does not work as intended
if a copy of the token had been used to perform a lease update prior.
Operations that update the cache take a token and return an updated
token, and only the most up to date such token can be used in order to
perform subsequent cache updates (in particular, to evict the cache
entry). The bug was that the DistSender sometimes lost track of this
most up to date token, and tried to evict using an older copy - which
eviction was a no-op.

Specifically, the bad scenario was the following:
- DistSender.sendPartialBatch made a copy of the token and handed it to
  sendToReplicas.
- sendToReplicas updated the cached lease through this token (through
  tok.UpdateLease). The updated token was further used by sendToReplicas
  but not returned to sendPartialBatch.
- No replica can serve the request cause the descriptor is stale, and
  control flow gets back to sendPartialBatch.
- sendPartialBatch uses its own, stale, copy of the token to attempt a
  cache eviction. The cache ignores the eviction request, claiming that
  it has a more up to date entry than the one that sendPartialBatch is
  trying to evict (which is true).
- sendPartialBatch then asks the cache for a new descriptor, and the
  cache returns the same one as before. It also returns the lease that
  we have added before, but that doesn't help very much.
- All this can happen over and over again in some case that I don't
  fully understand. In the restore2TB test, what happens is that the
  node that's supposed to be the leaseholder (say, n1), perpetually
  doesn't know about the range, and another node perpetually thinks that
  n1 is the leaseholder. I'm not sure why the latter happens, but this
  range is probably in an unusual state because the request that's
  spinning in this way is an ImportRequest.

This patch fixes the problem by changing how descriptor evictions work:
instead of comparing a token's pointer to the current cache entry and
refusing to do anything if they don't match, now evicting a descriptor
compares the descriptor's value with what's in the cache.
The cache code generally moves away from comparing pointers anywhere,
and an EvictionToken stops storing pointers to the cache's insides for
clarity.

Fixes cockroachdb#54118
Fixes cockroachdb#53197

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 17, 2020
This patch fixes a bug in the DistSender's interaction with the range
cache. The DistSender interacts with the cache through an EvictionToken;
this token is used to update the lease information for a cache entry and
to evict the entry completely if the descriptor is found to be too stale
to be useful. The problem was that the eviction does not work as intended
if a copy of the token had been used to perform a lease update prior.
Operations that update the cache take a token and return an updated
token, and only the most up to date such token can be used in order to
perform subsequent cache updates (in particular, to evict the cache
entry). The bug was that the DistSender sometimes lost track of this
most up to date token, and tried to evict using an older copy - which
eviction was a no-op.

Specifically, the bad scenario was the following:
- DistSender.sendPartialBatch made a copy of the token and handed it to
  sendToReplicas.
- sendToReplicas updated the cached lease through this token (through
  tok.UpdateLease). The updated token was further used by sendToReplicas
  but not returned to sendPartialBatch.
- No replica can serve the request cause the descriptor is stale, and
  control flow gets back to sendPartialBatch.
- sendPartialBatch uses its own, stale, copy of the token to attempt a
  cache eviction. The cache ignores the eviction request, claiming that
  it has a more up to date entry than the one that sendPartialBatch is
  trying to evict (which is true).
- sendPartialBatch then asks the cache for a new descriptor, and the
  cache returns the same one as before. It also returns the lease that
  we have added before, but that doesn't help very much.
- All this can happen over and over again in some case that I don't
  fully understand. In the restore2TB test, what happens is that the
  node that's supposed to be the leaseholder (say, n1), perpetually
  doesn't know about the range, and another node perpetually thinks that
  n1 is the leaseholder. I'm not sure why the latter happens, but this
  range is probably in an unusual state because the request that's
  spinning in this way is an ImportRequest.

This patch fixes the problem by changing how descriptor evictions work:
instead of comparing a token's pointer to the current cache entry and
refusing to do anything if they don't match, now evicting a descriptor
compares the descriptor's value with what's in the cache.
The cache code generally moves away from comparing pointers anywhere,
and an EvictionToken stops storing pointers to the cache's insides for
clarity.

Fixes cockroachdb#54118
Fixes cockroachdb#53197

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 17, 2020
This patch fixes a bug in the DistSender's interaction with the range
cache. The DistSender interacts with the cache through an EvictionToken;
this token is used to update the lease information for a cache entry and
to evict the entry completely if the descriptor is found to be too stale
to be useful. The problem was that the eviction does not work as intended
if a copy of the token had been used to perform a lease update prior.
Operations that update the cache take a token and return an updated
token, and only the most up to date such token can be used in order to
perform subsequent cache updates (in particular, to evict the cache
entry). The bug was that the DistSender sometimes lost track of this
most up to date token, and tried to evict using an older copy - which
eviction was a no-op.

Specifically, the bad scenario was the following:
- DistSender.sendPartialBatch made a copy of the token and handed it to
  sendToReplicas.
- sendToReplicas updated the cached lease through this token (through
  tok.UpdateLease). The updated token was further used by sendToReplicas
  but not returned to sendPartialBatch.
- No replica can serve the request cause the descriptor is stale, and
  control flow gets back to sendPartialBatch.
- sendPartialBatch uses its own, stale, copy of the token to attempt a
  cache eviction. The cache ignores the eviction request, claiming that
  it has a more up to date entry than the one that sendPartialBatch is
  trying to evict (which is true).
- sendPartialBatch then asks the cache for a new descriptor, and the
  cache returns the same one as before. It also returns the lease that
  we have added before, but that doesn't help very much.
- All this can happen over and over again in some case that I don't
  fully understand. In the restore2TB test, what happens is that the
  node that's supposed to be the leaseholder (say, n1), perpetually
  doesn't know about the range, and another node perpetually thinks that
  n1 is the leaseholder. I'm not sure why the latter happens, but this
  range is probably in an unusual state because the request that's
  spinning in this way is an ImportRequest.

This patch fixes the problem by changing how descriptor evictions work:
instead of comparing a token's pointer to the current cache entry and
refusing to do anything if they don't match, now evicting a descriptor
compares the descriptor's value with what's in the cache.
The cache code generally moves away from comparing pointers anywhere,
and an EvictionToken stops storing pointers to the cache's insides for
clarity.

Fixes cockroachdb#54118
Fixes cockroachdb#53197

Release note: None
craig bot pushed a commit that referenced this issue Sep 17, 2020
54468: kvclient: fix a request routing bug r=andreimatei a=andreimatei

This patch fixes a bug in the DistSender's interaction with the range
cache. The DistSender interacts with the cache through an EvictionToken;
this token is used to update the lease information for a cache entry and
to evict the entry completely if the descriptor is found to be too stale
to be useful. The problem was that the eviction does not work as intended
if a copy of the token had been used to perform a lease update prior.
Operations that update the cache take a token and return an updated
token, and only the most up to date such token can be used in order to
perform subsequent cache updates (in particular, to evict the cache
entry). The bug was that the DistSender sometimes lost track of this
most up to date token, and tried to evict using an older copy - which
eviction was a no-op.

Specifically, the bad scenario was the following:
- DistSender.sendPartialBatch made a copy of the token and handed it to
  sendToReplicas.
- sendToReplicas updated the cached lease through this token (through
  tok.UpdateLease). The updated token was further used by sendToReplicas
  but not returned to sendPartialBatch.
- No replica can serve the request cause the descriptor is stale, and
  control flow gets back to sendPartialBatch.
- sendPartialBatch uses its own, stale, copy of the token to attempt a
  cache eviction. The cache ignores the eviction request, claiming that
  it has a more up to date entry than the one that sendPartialBatch is
  trying to evict (which is true).
- sendPartialBatch then asks the cache for a new descriptor, and the
  cache returns the same one as before. It also returns the lease that
  we have added before, but that doesn't help very much.
- All this can happen over and over again in some case that I don't
  fully understand. In the restore2TB test, what happens is that the
  node that's supposed to be the leaseholder (say, n1), perpetually
  doesn't know about the range, and another node perpetually thinks that
  n1 is the leaseholder. I'm not sure why the latter happens, but this
  range is probably in an unusual state because the request that's
  spinning in this way is an ImportRequest.

This patch fixes the problem by changing how descriptor evictions work:
instead of comparing a token's pointer to the current cache entry and
refusing to do anything if they don't match, now evicting a descriptor
compares the descriptor's value with what's in the cache.
The cache code generally moves away from comparing pointers anywhere,
and an EvictionToken stops storing pointers to the cache's insides for
clarity.

Fixes #54118
Fixes #53197

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
@craig craig bot closed this as completed in 84fa010 Sep 17, 2020
@andreimatei
Copy link
Contributor

reopening until backported

@andreimatei andreimatei reopened this Sep 18, 2020
@andreimatei
Copy link
Contributor

now backported

arulajmani pushed a commit that referenced this issue Sep 18, 2020
This patch fixes a bug in the DistSender's interaction with the range
cache. The DistSender interacts with the cache through an EvictionToken;
this token is used to update the lease information for a cache entry and
to evict the entry completely if the descriptor is found to be too stale
to be useful. The problem was that the eviction does not work as intended
if a copy of the token had been used to perform a lease update prior.
Operations that update the cache take a token and return an updated
token, and only the most up to date such token can be used in order to
perform subsequent cache updates (in particular, to evict the cache
entry). The bug was that the DistSender sometimes lost track of this
most up to date token, and tried to evict using an older copy - which
eviction was a no-op.

Specifically, the bad scenario was the following:
- DistSender.sendPartialBatch made a copy of the token and handed it to
  sendToReplicas.
- sendToReplicas updated the cached lease through this token (through
  tok.UpdateLease). The updated token was further used by sendToReplicas
  but not returned to sendPartialBatch.
- No replica can serve the request cause the descriptor is stale, and
  control flow gets back to sendPartialBatch.
- sendPartialBatch uses its own, stale, copy of the token to attempt a
  cache eviction. The cache ignores the eviction request, claiming that
  it has a more up to date entry than the one that sendPartialBatch is
  trying to evict (which is true).
- sendPartialBatch then asks the cache for a new descriptor, and the
  cache returns the same one as before. It also returns the lease that
  we have added before, but that doesn't help very much.
- All this can happen over and over again in some case that I don't
  fully understand. In the restore2TB test, what happens is that the
  node that's supposed to be the leaseholder (say, n1), perpetually
  doesn't know about the range, and another node perpetually thinks that
  n1 is the leaseholder. I'm not sure why the latter happens, but this
  range is probably in an unusual state because the request that's
  spinning in this way is an ImportRequest.

This patch fixes the problem by changing how descriptor evictions work:
instead of comparing a token's pointer to the current cache entry and
refusing to do anything if they don't match, now evicting a descriptor
compares the descriptor's value with what's in the cache.
The cache code generally moves away from comparing pointers anywhere,
and an EvictionToken stops storing pointers to the cache's insides for
clarity.

Fixes #54118
Fixes #53197

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants