Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: ycsb/E/nodes=3/cpu=32 failed #106474

Closed
cockroach-teamcity opened this issue Jul 8, 2023 · 7 comments
Closed

roachtest: ycsb/E/nodes=3/cpu=32 failed #106474

cockroach-teamcity opened this issue Jul 8, 2023 · 7 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team X-duplicate Closed as a duplicate of another issue.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 8, 2023

roachtest.ycsb/E/nodes=3/cpu=32 failed with artifacts on master @ 43c26aec0072f76e02e6d5ffc1b7079026b24630:

(cluster.go:2282).Run: output in run_080341.261360991_n4_workload-run-ycsb-in: ./workload run ycsb --init --insert-count=1000000 --workload=E --concurrency=144 --splits=3 --histograms=perf/stats.json --select-for-update=true --ramp=2m --duration=30m {pgurl:1-3} returned: COMMAND_PROBLEM: exit status 1
(monitor.go:137).Wait: monitor failure: monitor task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/ycsb/E/nodes=3/cpu=32/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=32 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-29579

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team labels Jul 8, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Jul 8, 2023
@renatolabs
Copy link
Contributor

Error: ERROR: result is ambiguous: error=ba: ConditionalPut [/Table/106/1/"user4881747055679970146"/0,/Min), EndTxn(parallel commit) [/Table/106/1/"user4881747055679970146"/0], [txn: c4db6824], [can-forward-ts] RPC error: grpc: grpc: the client connection is closing [code 1/Canceled] [exhausted] (SQLSTATE 40003)

@cockroachdb/kv should this error have bubbled up to the client (workload)? I'll let you decide what to do with this failure.

@andrewbaptist
Copy link
Contributor

@arulajmani can you take a quick look at this.

Looking more at this there are some confusing things:

  1. The test starts the workload at 08:03:41 with the command:
    ./workload run ycsb --init --insert-count=1000000 --workload=E --concurrency=144 --splits=3 --histograms=perf/stats.json --select-for-update=true --ramp=2m --duration=30m {pgurl:1-3}
  2. After 2:09 minutes, just after the ramp finishes, on node 3, we get a message that the clock jumped by almost 2 seconds
    W230708 08:05:50.108845 374 2@rpc/clock_offset.go:291 ⋮ [T1,n3,rnode=1,raddr=‹×›,class=system,rpc] 164 latency jump (prev avg 0.40ms, current 1904.04ms)
  3. Immediately after a bunch of requests on that node fail with context canceled.
 I230708 08:05:50.243888 4693 server/node.go:1335 ⋮ [T1,n3] 165  batch request ‹ConditionalPut [/Table/106/1/"user8186684413617255504"/0,/Min), EndTxn(parallel commit) [/Table/106/1/"user8186684413617255504"/0], [txn: c38eb787], [can-forward-ts]› failed with     error: context canceled
 I230708 08:05:50.243888 4693 server/node.go:1335 ⋮ [T1,n3] 165 +trace:
 I230708 08:05:50.243888 4693 server/node.go:1335 ⋮ [T1,n3] 165 +     0.000ms      0.000ms    === operation:/cockroach.roachpb.Internal/Batch
 I230708 08:05:50.243975 4712 server/node.go:1335 ⋮ [T1,n3] 166  batch request ‹Scan [/Table/106/1/"user8102713941260612647",/Table/106/1/"user9223372036854775806"), [txn: 9ed0c802]› failed with error: context canceled
 I230708 08:05:50.243975 4712 server/node.go:1335 ⋮ [T1,n3] 166 +trace:
 I230708 08:05:50.243975 4712 server/node.go:1335 ⋮ [T1,n3] 166 +     0.000ms      0.000ms    === operation:/cockroach.roachpb.Internal/Batch
 I230708 08:05:50.243975 4712 server/node.go:1335 ⋮ [T1,n3] 166 +     0.500ms      0.500ms    structured:‹{"@type":"type.googleapis.com/cockroach.roachpb.ScanStats","numInterfaceSeeks":"2","numInternalSeeks":"2","numInterfaceSteps":"544","numInternalSteps":      "544","blockBytes":"768790","blockBytesInCache":"768790","keyBytes":"21204","valueBytes":"558625","pointCount":"545","pointsCoveredByRangeTombstones":"0","rangeKeyCount":"0","rangeKeyContainedPoints":"0","rangeKeySkippedPoints":"0","separatedPointCount":"1",    "separatedPointValueBytes":"30","separatedPointValueBytesFetched":"0","numGets":"0","numScans":"1","numReverseScans":"0"}›

Nothing indicates that the node is having any problems other than the clock jump. I believe that jump is causing the request to fail, but I'm unsure how.

@tbg
Copy link
Member

tbg commented Jul 18, 2023

we get a message that the clock jumped by almost 2 seconds

This can also mean the node was overloaded, btw. That would be my default assumption here, especially in conjunction with the AmbiguousResultError, which is often the result of a timeout firing (and a proposer abandoning a proposal).

The workload here doesn't tolerate errors, which, you know, technically it should always have to tolerate this one. SQL users "always" have to be prepared to handle it, though admittedly there shouldn't be a reason to see it in a healthy system. Still, the system in these benchmarks is not "healthy" since we're driving it as close to the limit as possible.

@erikgrinaker
Copy link
Contributor

Yeah, this isn't a clock jump, it's request latency.

@andrewbaptist andrewbaptist removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 19, 2023
@renatolabs
Copy link
Contributor

Is there something to do about this and related failures?

The workload here doesn't tolerate errors, which, you know, technically it should always have to tolerate this one. SQL users "always" have to be prepared to handle it

Not sure we have a good way to handle this in the workload in the general case (i.e., not safe to always retry). What would we prefer workload did in this case? If there's something concrete to be done, I'd appreciate if someone could write a more direct issue describing it.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 25, 2023

We basically have to implement the same client-side error handling that we ask all of our users to implement in their applications that use CRDB:

https://www.cockroachlabs.com/docs/stable/transaction-retry-error-reference

https://www.cockroachlabs.com/docs/stable/common-errors#result-is-ambiguous

@renatolabs
Copy link
Contributor

Created #107571 for us to keep track of this. I'm going to close this issue in favour of that one, as it describes the issue more directly.

@irfansharif irfansharif added the X-duplicate Closed as a duplicate of another issue. label Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team X-duplicate Closed as a duplicate of another issue.
Projects
None yet
Development

No branches or pull requests

7 participants