-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
teamcity: failed test: bank/node-restart #30064
Comments
#29678 also has a failure for this test, though a different failure mode. |
I'm unable to reproduce this failure locally. Seems to have happened several times on CI. |
It's interesting that the stall occurs on round 1. Looks like the cluster never came up. The logs on the nodes don't show anything interesting:
|
Here's something odd. The test indicated that it started the cluster, then stopped node 3 and restarted it. Yet I see 2 log files on nodes 1, 2 and 4 while I only expect to see 1. Simple manual test of this scenario shows only 1 log produced on the nodes that were started once. Oh, it looks like one of the log files is from the invocation of cockroach with the |
While investigating another acceptance failure, I reproduced the problem seen in the original message:
Unfortunately, I didn't specify the |
Starting to repro this too.
Don't think a single write to the DB has happened (the db is running happily and has a fresh accounts table). Going to add some stack traces and retry. |
Hmm, looking at a stuck run in round 18. The cluster is happy, and the accounts table has only zeroes as well. Wouldn't be surprised if the test is doing something silly. |
I CTRL+C'ed a test and can't sure whether this is why that happened, but I got stuck with this stack trace:
|
Hrm, this test is a bit nonsensical, look at this query:
This is a noop if the withdrawal account $1 doesn't have an amount of $3 on it, but how would it? We never initialize any account with a nonzero value in this test. |
Did I cut&paste something incorrectly from the old acceptance test? Doesn't look like it. |
Got the stall error too, on the first round. The cluster is healthy and happy to respond to queries.
|
Might've been nonsensical for a long time. We just need to initialize with, say, 100, instead of a zero balance and it'll actually do something. But first let's deflake it... |
The "stall detected" message comes with this goroutine stack (the other goroutines are doing harmless stuff):
The server is happy. I'm afraid this means that what I want is a stack trace of the node the client is talking to. Let's see. |
Hmm. I had what I thought was a failing test run:
The |
Yeah - the same thing happened again.
|
Next hang. This time, the cluster is not fine (connecting via the SQL shell times out).
Okay, here's what the node4 (the one the client connects to) say about this connection:
|
Interesting. I can actually connect to the SQL shell and query the table on n4, but not on n1 and n2. Then I tried n1 again and it worked. n2 still doesn't. Hmm. I wonder if this has something to do with it:
|
Restricting to network file descriptors there are only a few dozen which seems reasonable, so back to the nodes itself... n1 is clearly unhappy:
|
No, and that theory didn't pan out. I still see all followers slowly fall behind the leader under heavy load. |
Everything I've written about so far is concerning and deserves continued investigation. It is pretty easy to reproduce with #31081 if you disable To recap, we've fixed three separate stalls so far and the test passes reliably for about 2-3 hours of stress. When it does fail, it doesn't stall indefinitely as it did with the other issues. Instead, it stalls for only about 60 seconds. I suspect that the following log explains these 60-second stalls. We see these on
The logs continue for exactly a minute before going away. These logs originate from
We then establish a SQL connection to a node stuck trying to acquire this lease and needing to wait for the abandoned lease to expire. I don't have a good solution here in the general case, but there are a few things we could do in the test to fix this and prevent the flake:
I think the second option is the way to go, but I want to make sure it's not undermining the purpose of the test. @petermattis could you weigh in on that? Are we trying to test everything from the server startup process onwards while under chaos? Also for what it's worth, I only noticed this because I searched for |
Interesting. Is it really an inability to get the migrations lease which is causing these 60s stalls? If we disable that code do the stalls go away? Why are we even trying to get the migrations lease? Looks to me like Of the mitigation options, the second sounds the best. |
I'm also confused why this is happening. Are there really any outstanding migrations? If so, do you see this only just after the cluster got booted up? Something seems fishy here. |
I'm testing that right now. I suspect that they will unless there's another class of stalls.
It doesn't look like there is any protection against a fresh cluster running these SQL migrations, and I can certianly see all |
It's intentional. I've never liked it much, but we use migrations as a convenient place to run cluster bootstrapping code (i.e., code that must be run at least once when a cluster is first created). For example: cockroach/pkg/sqlmigrations/migrations.go Lines 95 to 99 in 3960772
(which I think Tobi added actually 😀) |
Also, even without these "permanent" migrations, fresh clusters still tend to have migrations: cockroach/pkg/sql/sqlbase/system.go Lines 866 to 872 in 3960772
|
Somehow I thought you said that there was a lease acquired every time a node started, but I just verified (and it seems that you're all implying that) this only happens once at the beginning of a new cluster (assuming all the migrations get applied before something gets killed). I think it's fine if we just work around that then. Should be enough to |
Yeah exactly, and in doing so I've only had a single stall (which was persistent) in the last 36 hours of stressing the test. I'll make that change, fix the proposal forwarding issues now that etcd-io/etcd#10167 is merged, and rebase the test on master to continue the hunt. It's possible we've already fixed this stall, and if not, it appears to be exceedingly rare. |
I think it would be sufficient to verify all nodes can serve a |
I was doing it at the beginning of |
Informs cockroachdb#30064. This prevents us from running into issues with SQL migrations. Release note: None
Informs cockroachdb#30064. This prevents us from running into issues with SQL migrations. Release note: None
30989: roachtest/bank: don't start chaos until nodes are serving SQL r=nvanbenschoten a=nvanbenschoten Informs #30064. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
31408: storage: replace remote proposal tracking with uncommitted log size protection r=nvanbenschoten a=nvanbenschoten Closes #30064. This change reverts most of the non-testing code from 03b116f and f2f3fd2 and replaces it with use of the MaxUncommittedEntriesSize config. This configuration was added in etcd-io/etcd#10167 and provides protection against unbounded Raft log growth when a Raft group stops being able to commit entries. It makes proposals into Raft safer because proposers don't need to verify before the fact that the proposal isn't a duplicate that might be blowing up the size of the Raft group. By default, the configuration is set to double the Replica's proposal quota. The logic here is that the quotaPool should be responsible for throttling proposals in all cases except for unbounded Raft re-proposals because it queues efficiently instead of dropping proposals on the floor indiscriminately. Release note (bug fix): Fix a bug where Raft proposals could get stuck if forwarded to a leader who could not itself append a new entry to its log. This will be backported, but not to 2.1.0. The plan is to get it into 2.1.1. Co-authored-by: Nathan VanBenschoten <[email protected]>
I'm not confident I actually had all of the fix from #31408 in place when I hit this stall, and after ensuring that, I haven't been able to trigger a stall again in more than 24 hours. I'm closing this for now but will keep stressing it for the next few days. We'll see if it ever comes up again. |
4 days of continuous stress without a failure. I'm declaring this stable. |
🎉 🎉 🎉 |
This was forgotten, cockroachdb#30064 is long fixed :fingers_crossed: Release note: None
33549: acceptance: re-enable bank/zerosum-restart r=petermattis a=tbg This was forgotten, #30064 is long fixed :fingers_crossed: Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
The following tests appear to have failed:
#897838:
Please assign, take a look and update the issue accordingly.
The text was updated successfully, but these errors were encountered: