-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: acceptance/bank/cluster-recovery failed #38785
Comments
SHA: https://github.com/cockroachdb/cockroach/commits/1ad0ecc8cbddf82c9fedb5a5c5e533e72a657ff7 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1399000&tab=buildLog
|
@nvanbenschoten you mentioned you were stressing this one, right? |
I got distracted by #39018. Could be the same root cause. I'll stress with and without the fix tomorrow. |
SHA: https://github.com/cockroachdb/cockroach/commits/bd27eb358f558bb7598945318240335ebcfcdf13 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1446993&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/bd27eb358f558bb7598945318240335ebcfcdf13 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1447014&tab=buildLog
|
@irfansharif do you mind taking a look at this? The most recent failure looks like a flake, but the two |
Taking a look at this now. |
The recent failures in #35326 look similar. They probably have the same root cause. |
This is possibly related to #39841. We trip up circuit breakers around node startup failing with the following:
|
The grpc failures were a red herring, looking at the timings for when each circuit breaker for each gossip connection was tripped and when exactly each node was killed, they line up as we'd expect. |
Ok, so here's what's going on (TL;DR roachtest bug, not CRDB). The "stalls"
Given our stall timeout is only 30s, we have just that long to go through all We should really only be interested in how long it takes for the chaos monkey All that being said, this failure mode is awfully similar to the recent |
Informs cockroachdb#38785. The "stalls" detected boil down to the time between chaos monkey iterations. Within each chaos monkey iteration we do the following: - Lock all clients, in sequence - Restart each node - Unlock all clients - Sleep until at least one client has made progress Given our stall timeout is only 30s, we have just that long to go through all of the above. In each client we lock around the UPDATE query so as to not be interrupted. The problem is that every now and then these UPDATE queries take a lot longer than a few milliseconds. This is expected behaviour: this is primarily due to txnwait procedures and having to wait for the expiration of an extant contending txn. More importantly, it's not what we're testing here as the clients are still making progress. Given the chaos monkey first locks each client, it has to drain out these requests, which eats out of the 30s or so we have for each chaos monkey iteration. This is made worse by the fact that we do this in sequence for each client. When we're unlucky, we run into this particular convoy situation and we're unable to finish the round in time, and a "stall" is detected. We should really only be interested in how long it takes for the chaos monkey to restart a set of nodes, and ensuring that after it does, that clients are still making progress. We already have statement timeouts for the UPDATE queries that fail if we take "too long". Removing the stopClients apparatus gives us what we need. Release justification: Category 1: Non-production code changes Release note: None
Informs cockroachdb#38785. The "stalls" detected boil down to the time between chaos monkey iterations. Within each chaos monkey iteration we do the following: - Lock all clients, in sequence - Restart each node - Unlock all clients - Sleep until at least one client has made progress Given our stall timeout is only 30s, we have just that long to go through all of the above. In each client we lock around the UPDATE query so as to not be interrupted. The problem is that every now and then these UPDATE queries take a lot longer than a few milliseconds. This is expected behaviour: this is primarily due to txnwait procedures and having to wait for the expiration of an extant contending txn. More importantly, it's not what we're testing here as the clients are still making progress. Given the chaos monkey first locks each client, it has to drain out these requests, which eats out of the 30s or so we have for each chaos monkey iteration. This is made worse by the fact that we do this in sequence for each client. When we're unlucky, we run into this particular convoy situation and we're unable to finish the round in time, and a "stall" is detected. We should really only be interested in how long it takes for the chaos monkey to restart a set of nodes, and ensuring that after it does, that clients are still making progress. We already have statement timeouts for the UPDATE queries that fail if we take "too long". Removing the stopClients apparatus gives us what we need. Release justification: Category 1: Non-production code changes Release note: None
40874: delegate: Fix index resolution hack in show partitions r=rohany a=rohany This fixes a hack that was introduced when writing show partitions in order to have a better error message when the user provided an invalid index. Release justification: Low risk improvement to functionality. Release note: None 40976: roachtest: fix bank/cluster-recovery r=irfansharif a=irfansharif Informs #38785. The "stalls" detected boil down to the time between chaos monkey iterations. Within each chaos monkey iteration we do the following: - Lock all clients, in sequence - Restart each node - Unlock all clients - Sleep until at least one client has made progress Given our stall timeout is only 30s, we have just that long to go through all of the above. In each client we lock around the UPDATE query so as to not be interrupted. The problem is that every now and then these UPDATE queries take a lot longer than a few milliseconds. This is expected behaviour: this is primarily due to txnwait procedures and having to wait for the expiration of an extant contending txn. More importantly, it's not what we're testing here as the clients are still making progress. Given the chaos monkey first locks each client, it has to drain out these requests, which eats out of the 30s or so we have for each chaos monkey iteration. This is made worse by the fact that we do this in sequence for each client. When we're unlucky, we run into this particular convoy situation and we're unable to finish the round in time, and a "stall" is detected. We should really only be interested in how long it takes for the chaos monkey to restart a set of nodes, and ensuring that after it does, that clients are still making progress. We already have statement timeouts for the UPDATE queries that fail if we take "too long". Removing the stopClients apparatus gives us what we need. Release justification: Category 1: Non-production code changes Release note: None Co-authored-by: Rohan Yadav <[email protected]> Co-authored-by: irfan sharif <[email protected]>
40997: roachtest: deflake bank/{node-restart,cluster-recovery} r=irfansharif a=irfansharif Fixes #38785. Fixes #35326. Because everything roachprod does, it does through SSH, we're particularly susceptible to network delays, packet drops, etc. We've seen this before, or at least pointed to this being a problem before, over at #37001. Setting timeouts around our calls to roachprod helps to better surface these kind of errors. The underlying issue in #38785 and in #35326 is the fact that we're running roachprod commands that may (reasonably) fail due to connection issues, and we're unable to retry them safely (the underlying commands are non-idempotent). Presently we simply fail the entire test, when really we should be able to retry the commands. This is left unaddressed. Release justification: Category 1: Non-production code changes Release note: None 41029: cli: fix the demo licensing code r=rohany a=knz Fixes #40734. Fixes #41024. Release justification: fixes a flaky test, fixes UX of main new feature Before this patch, there were multiple problems with the code: - if the license acquisition was disabled by the env var config, the error message would not be clear. - the licensing code would deadlock silently on OSS-only builds (because the license failure channel was not written in that control branch). - the error/warning messages would be interleaved on the same line as the input line (missing newline at start of message). - the test code would fail when the license server is not available. - the set up of the example database and workload would be performed asynchronously, with unclear signalling of when the user can expect to use them interactively. After this patch: - it's possible to override the license acquisition URL with COCKROACH_DEMO_LICENSE_URL, this is used in tests. - setting up the example database, partitioning and workload is done before presenting the interactive prompt. - partitioning the example database, if requested by --geo-partitioned-replicas, waits for license acquisition to complete (license acquisition remains asynchronous otherwise). - impossible configurations are reported early(earlier). For example: - OSS-only builds: ``` kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas * * ERROR: enterprise features are required for this demo, cannot run from OSS-only binary * Failed running "demo" ``` For license acquisition failures: ``` kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas error while contacting licensing server: Get https://192.168.2.170/api/license?clusterid=5548b310-14b7-46de-8c92-30605bfe95c4&kind=demo&version=v19.2: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) * * ERROR: license acquisition was unsuccessful. * Note: enterprise features are needed for --geo-partitioned-replicas. * Failed running "demo" ``` Additionally, this change fixes test flakiness that arises from an unavailable license server. Release note (cli change): To enable uses of `cockroach demo` with enterprise features in firewalled network environments, it is now possible to redirect the license acquisition with the environment variable COCKROACH_DEMO_LICENSE_URL to a replacement server (for example a suitably configured HTTP proxy). Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
SHA: https://github.com/cockroachdb/cockroach/commits/f74db5e81f8eaa190a41d708a9ccafb3eba9370a
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1380786&tab=buildLog
The text was updated successfully, but these errors were encountered: