-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: acceptance/bank/node-restart failed #35326
Comments
SHA: https://github.com/cockroachdb/cockroach/commits/de1793532332fb64fca27cafe92d2481d900a5a0 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1160394&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/a119a3a158725c9e3f9b8084d9398601c0e67007 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1170795&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/5b36cc6276340282cb333ff4a9cb4f1fbd6c3348 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189990&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/7f8a0969e8e9eb7e9fc0d2fe96e03849d30dd561 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1199677&tab=buildLog
|
It's curious that every time this test fails, there's a pending
cc @vivekmenezes / @jordanlewis could this error be responsible here? I'm thinking something about the case in which the node getting killed holds the table lease and for some reason doesn't manage to release it after the restart, blocking writes to the table for north of 30s and thus failing the test. |
Oh hey I found the same error in the logs two up:
|
@tbg this looks like a situation where the "read orphaned table leases" query (ROTL) was attempted a few times and eventually succeeded. The results of that query were leases that were orphaned and were released. But I don't see how this could have blocked writes to the table. There is no schema change happening in this test. Besides all this logic is running asynchronously and not holding up the node from restarting. |
SHA: https://github.com/cockroachdb/cockroach/commits/c6df752eefe4609b8a5bbada0955f79a2cfb790e Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1217763&tab=buildLog
|
cockroach/pkg/cmd/roachtest/bank.go Lines 177 to 190 in e830bb6
The node actually died when the |
I don't see anything that would account for the stuck The
I believe Hmm, this is curious in
Might be nothing. The Seems like the error from |
I'm not seeing anything wedged in the goroutines on |
SHA: https://github.com/cockroachdb/cockroach/commits/6da68d7fe2c9a29b85e2ec0c7e545a0d6bdc4c5c Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1226521&tab=buildLog
|
The last one is a stall during I'm going to keep this open for now -- it reproduces every 5-10 days on average it seems, so we're due for another one soon. |
@andreimatei could you have an eye on this? The difficulty here, assuming it's not test infra, is observing what goes wrong during the start sequence. Also potentially related to infra fix #37001 |
SHA: https://github.com/cockroachdb/cockroach/commits/73765b6d168fb999466756b112fd590747a3a8c4 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1266059&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/24feca7a4106f08c73534e16ebb79d949a479f35 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1268176&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/84dc682eca4b11e6abaf390fc8883f32afe81fb4 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1283539&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/a65ec243adaa11ea951f02319dc2e02463461ab2 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1290143&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/9671342fead0509bec0913bae4ae1f244660788e Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1298500&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/c280de40c2bcab93c41fe82bef8353a5ecd95ac4 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1311970&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/f1c9693da739fa5fc2c94d4d978fadd6710d17da Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1371441&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/f74db5e81f8eaa190a41d708a9ccafb3eba9370a Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1380786&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/bd27eb358f558bb7598945318240335ebcfcdf13 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1446993&tab=buildLog
|
Running on bd27eb3 I ran into the following:
This then prompts the |
Ok I have a handle on what's happening here and #38785 (TL;DR flaky networking Below I've included roachtest logs from all the failures I've been able to
Note the long stall between 16:42:17 and 16:42:57. This corresponds to our Here's another:
The out of order timestamps are funky but I think not a problem. There's Another one:
Except this time I also saw this failure, which was unlike the ones above, but still related:
Here I was pretty sure at this point this was some random infra flake. I then tried |
Very interesting about Network Link Conditioner helping to reproduce this. 100ms of delay really causes our ssh invocations to get in trouble?!? |
There's a fair bit of tweaking and waiting around for it to happen, and I found increasing the test duration to a longer 5 minutes or so more conducive to shuffle these out. Dropped packets help, higher latencies help, but yes even 100ms if timed just right can make things go wonky. I've not seen this ssh logging, I was working off an older branch. I'll check it out. |
(cross-posting from #38785 (comment)) For some more confirmation bias that it's flakey infrastructure at play:
|
40997: roachtest: deflake bank/{node-restart,cluster-recovery} r=irfansharif a=irfansharif Fixes #38785. Fixes #35326. Because everything roachprod does, it does through SSH, we're particularly susceptible to network delays, packet drops, etc. We've seen this before, or at least pointed to this being a problem before, over at #37001. Setting timeouts around our calls to roachprod helps to better surface these kind of errors. The underlying issue in #38785 and in #35326 is the fact that we're running roachprod commands that may (reasonably) fail due to connection issues, and we're unable to retry them safely (the underlying commands are non-idempotent). Presently we simply fail the entire test, when really we should be able to retry the commands. This is left unaddressed. Release justification: Category 1: Non-production code changes Release note: None 41029: cli: fix the demo licensing code r=rohany a=knz Fixes #40734. Fixes #41024. Release justification: fixes a flaky test, fixes UX of main new feature Before this patch, there were multiple problems with the code: - if the license acquisition was disabled by the env var config, the error message would not be clear. - the licensing code would deadlock silently on OSS-only builds (because the license failure channel was not written in that control branch). - the error/warning messages would be interleaved on the same line as the input line (missing newline at start of message). - the test code would fail when the license server is not available. - the set up of the example database and workload would be performed asynchronously, with unclear signalling of when the user can expect to use them interactively. After this patch: - it's possible to override the license acquisition URL with COCKROACH_DEMO_LICENSE_URL, this is used in tests. - setting up the example database, partitioning and workload is done before presenting the interactive prompt. - partitioning the example database, if requested by --geo-partitioned-replicas, waits for license acquisition to complete (license acquisition remains asynchronous otherwise). - impossible configurations are reported early(earlier). For example: - OSS-only builds: ``` kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas * * ERROR: enterprise features are required for this demo, cannot run from OSS-only binary * Failed running "demo" ``` For license acquisition failures: ``` kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas error while contacting licensing server: Get https://192.168.2.170/api/license?clusterid=5548b310-14b7-46de-8c92-30605bfe95c4&kind=demo&version=v19.2: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) * * ERROR: license acquisition was unsuccessful. * Note: enterprise features are needed for --geo-partitioned-replicas. * Failed running "demo" ``` Additionally, this change fixes test flakiness that arises from an unavailable license server. Release note (cli change): To enable uses of `cockroach demo` with enterprise features in firewalled network environments, it is now possible to redirect the license acquisition with the environment variable COCKROACH_DEMO_LICENSE_URL to a replacement server (for example a suitably configured HTTP proxy). Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
SHA: https://github.com/cockroachdb/cockroach/commits/032c4980720abc1bdd71e4428e4111e6e6383297
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1158877&tab=buildLog
The text was updated successfully, but these errors were encountered: