Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvnemesis: TestKVNemesisSingleNode failed #118005

Closed
cockroach-teamcity opened this issue Jan 20, 2024 · 9 comments · Fixed by #118673
Closed

kv/kvnemesis: TestKVNemesisSingleNode failed #118005

cockroach-teamcity opened this issue Jan 20, 2024 · 9 comments · Fixed by #118673
Assignees
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 20, 2024

kv/kvnemesis.TestKVNemesisSingleNode failed on master @ f740c076ec0d92972a47cf6633cf065c8c98678f:

Fatal error:

panic: test timed out after 4m57s
running tests:
	TestKVNemesisSingleNode (4m54s)

Stack:

goroutine 85585 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2259 +0x3b9
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x2d
Log preceding fatal error

=== RUN   TestKVNemesisSingleNode
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestKVNemesisSingleNode4120009455
    test_log_scope.go:81: use -show-logs to present logs inline
    kvnemesis_test.go:327: seed: 5131562799898371130
    kvnemesis_test.go:234: kvnemesis logging to /var/lib/engflow/worker/work/1/exec/_tmp/e2608b3e5de5271ebe05e919d261d295/kvnemesis4171750900

Parameters:

  • attempt=1
  • run=30
  • shard=1
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-35451

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jan 20, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Jan 20, 2024
@cockroach-teamcity
Copy link
Member Author

kv/kvnemesis.TestKVNemesisSingleNode failed on master @ 3c837c2a86165188a26e8629b7e21b7ae0fb56ae:

Fatal error:

panic: test timed out after 4m57s
running tests:
	TestKVNemesisSingleNode (4m55s)

Stack:

goroutine 63935 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2259 +0x3b9
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x2d
Log preceding fatal error

=== RUN   TestKVNemesisSingleNode
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestKVNemesisSingleNode1351152789
    test_log_scope.go:81: use -show-logs to present logs inline
    kvnemesis_test.go:327: seed: 1459129335189392956
    kvnemesis_test.go:234: kvnemesis logging to /var/lib/engflow/worker/work/3/exec/_tmp/fb930ff13537ce55361891aaf7d31bf9/kvnemesis1960930861

Parameters:

  • attempt=1
  • run=18
  • shard=1
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@miraradeva
Copy link
Contributor

Seems to be stuck waiting in the lock table for over 4 minutes:

goroutine 5374 [select, 4 minutes]:
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*lockTableWaiterImpl).WaitOn(0xc0000ce5a0, {0x719b5d0, 0xc004e78780}, {0xc0046feb40, {0x17abf6e5303ed07b, 0x1}, 0x0, 0x0, 0x0, 0x0, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go:156 +0x2e5

@erikgrinaker

This comment was marked as off-topic.

@arulajmani arulajmani self-assigned this Jan 22, 2024
@arulajmani
Copy link
Collaborator

Looks like there's some sort of deadlock here, given we're waiting in the lockTableWaiter, waiting for a notification. It would be helpful to look at the last steps the test was performing to see if we can glean something from them. However, for some reason, the test didn't dump the steps it ran like we normally see -- I wonder if that's because of EngFlow.

@arulajmani
Copy link
Collaborator

I've been running this on my GCE worker for more than an hour and I'm yet to see a failure.

@lyang24
Copy link
Contributor

lyang24 commented Jan 23, 2024

I wonder if this have to do with test failure, purely guessing based on timing 839c2a0 is merged the day before test failure.

nvm I misread 839c2a0 is merged after failure

@arulajmani
Copy link
Collaborator

@lyang24 that PR also only affects crdb_internal.cluster_locks, so that shouldn't be an issue here.

Another hour and a half on my GCE worker and nothing. I'm beginning to wonder if there's something specific to EngFlow going on here.

miraradeva added a commit to miraradeva/cockroach that referenced this issue Jan 25, 2024
Previously, kvnemesis used os.TempDir() to write the various debug
files (including the repro steps) to a temp dir. When the test failed
on EngFlow, the temp dir was not included in output.zip, which made
tests hard to investigate.

This patch uses datapathutils.DebuggableTempDir() instead. If the test
is running locally, the behavior is the same as os.TempDir(). If the
test is running remotely, it will write the debug files to
TEST_UNDECLARED_OUTPUTS_DIR and Bazel will package them up into
outputs.zip.

Informs: cockroachdb#118005

Release note: None
craig bot pushed a commit that referenced this issue Jan 26, 2024
118317: kvnemesis: write debug files using DebuggableTempDir r=rickystewart a=miraradeva

Previously, kvnemesis used os.TempDir() to write the various debug files (including the repro steps) to a temp dir. When the test failed on EngFlow, the temp dir was not included in output.zip, which made tests hard to investigate.

This patch uses datapathutils.DebuggableTempDir() instead. If the test is running locally, the behavior is the same as os.TempDir(). If the test is running remotely, it will write the debug files to TEST_UNDECLARED_OUTPUTS_DIR and Bazel will package them up into outputs.zip.

Informs: #118005

Release note: None

Co-authored-by: Mira Radeva <[email protected]>
@cockroach-teamcity
Copy link
Member Author

kv/kvnemesis.TestKVNemesisSingleNode failed on master @ 0baf22a03d5f55e2611701bc723e3e0b713ab051:

Fatal error:

panic: test timed out after 4m57s
running tests:
	TestKVNemesisSingleNode (4m55s)

Stack:

goroutine 57818 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2259 +0x3b9
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x2d
Log preceding fatal error

=== RUN   TestKVNemesisSingleNode
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestKVNemesisSingleNode2899087022
    test_log_scope.go:81: use -show-logs to present logs inline
    kvnemesis_test.go:328: seed: 1605344677229518599
    kvnemesis_test.go:235: kvnemesis logging to /var/lib/engflow/worker/work/0/exec/bazel-out/k8-fastbuild/testlogs/pkg/kv/kvnemesis/kvnemesis_test/run_16_of_30/test.outputs/kvnemesis1012630944

Parameters:

  • attempt=1
  • run=16
  • shard=1
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@rickystewart
Copy link
Collaborator

If need be, feel free to get a skip.UnderRemoteExecutionWithIssue() in there. You can use this issue.

craig bot pushed a commit that referenced this issue Feb 3, 2024
116958: rpc: move system ranges to system RPC class r=lunevalex a=lunevalex

Move all the ranges between /Min and /System/tsd to use the default RPC class. This will allow for
isolation from network congestion for all system
ranges, which crucial for the stability of the system.

Fixes: #111239

Release note: None

118555: workflows: run UI lint and test in experimental github actions build r=rail a=rickystewart

Epic: [CRDB-8308](https://cockroachlabs.atlassian.net/browse/CRDB-8308)
Release note: None

118581: roachtest: stop ignoring activerecord failures r=rafiss a=rafiss

The adapter has been stabilized, so we should enable this test again.

fixes #108938
Release note: None

118659: tests: use `test.Pool` instead of `Pool` r=rail a=rickystewart

This tells Bazel to use the pool only for test actions instead of the compile action associated with each test.

Epic: CRDB-8308
Release note: None

118673: kv: mark kvnemesis tests as "large" sized r=nvanbenschoten a=arulajmani

We've recently seen these time out exclusively on eng flow. In all those instances, we can see the test is making some progress from the stack traces -- it's slow though. We mark KVNemesis tests as large, which in turn bumps their timeout in CI.

Closes #118624
Closes #118005

Release note: None

118675: ui: remove warning when auto refresh enable r=maryliag a=maryliag

The warning being displayed about old active executions was not bein properly removed when the auto refresh was turned back on.
This commit fixes this for both Statements and Transactions pages, on Active Executions.

Fixes CRDB-35837

https://www.loom.com/share/76b57eba17ab44758fe81f178f07fecd

Release note (ui change): Properly remove warning of old date on Active Executions when auto refresh is enabled.

Co-authored-by: Alex Lunev <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: maryliag <[email protected]>
@craig craig bot closed this as completed in 174bb2b Feb 3, 2024
@craig craig bot closed this as completed in #118673 Feb 3, 2024
@kvoli kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels Feb 5, 2024
wenyihu6 pushed a commit to wenyihu6/cockroach that referenced this issue Feb 21, 2024
We've recently seen these time out exclusively on eng flow. In all
those instances, we can see the test is making some progress from
the stack traces -- it's slow though. We mark KVNemesis tests as
large, which in turn bumps their timeout in CI.

Closes cockroachdb#118624
Closes cockroachdb#118005

Release note: None
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
No open projects
Status: Closed
Development

Successfully merging a pull request may close this issue.

7 participants