-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad confused about how many CSI plugins should be running #9371
Comments
Another oddity which I've just noticed. The CLI still shows the EBS controller as 1/1 as above, and in the UI at However at
(Also it's a little weird that for EFS under the volumes screen it shows controller health as |
Forgot to mention, I tried |
Hi @tsarna! We fixed a few bugs around plugin counts in 0.12.2-0.12.4 (also ref #8948), so the challenge here will be to figure out if this is a problem because of upgrading from a buggy version or whether we haven't really fixed the plugin count issue.
That definitely should work, and as you note if the count was wrong before you deployed the second system job that would appear to rule that out. But thank you for that context.
This is what makes me think we might have a bug where a plugin that somehow doesn't get deregistered properly isn't getting the count decremented as we expect. It's been a little bit since I last looked at it, but my suspicion is we're relying on increment/decrement operations rather deriving the expected count from the job specification(s). Would you be willing to run a
Definitely. |
Another thing that could help in addition to that raft data is the output of |
I think I've found where the issue is. While trying to replicate the circumstances in a test I found that concurrently updating allocations could trigger a bug in the way we've set up the transactions around plugins. A more minimal test case that could be dropped into func TestStateStore_Nested(t *testing.T) {
s := testStateStore(t)
// new plugin
index := uint64(0)
plugin := structs.NewCSIPlugin("foo", index)
err := s.UpsertCSIPlugin(index, plugin)
require.NoError(t, err)
txn := s.db.WriteTxn(index)
defer txn.Abort()
inc := func(index uint64) {
// This doesn't work!
plugin, _ := s.CSIPluginByID(nil, "foo")
plugin = plugin.Copy()
// This does!
// raw, _ := txn.First("csi_plugins", "id_prefix", "foo")
// plugin := raw.(*structs.CSIPlugin)
plugin.NodesExpected += 1
err := txn.Insert("csi_plugins", plugin)
require.NoError(t, err)
}
index++
inc(index)
index++
inc(index)
index++
inc(index)
txn.Commit()
plugin, _ = s.CSIPluginByID(nil, "foo")
require.Equal(t, 3, plugin.NodesExpected)
} We have a helper method The PR is going to have a lot of small fiddly bits to check, but I'm going to try to land it before the long holiday weekend starts here in the US. |
Well, I didn't have a chance to look at this for several days, when I got back to it just now I found that one of my client nodes had died and the count of both EBS and EFS plugins for the remaining nodes was correct at 3/3. |
Ok, that also tells me that an update to the plugin seems to correct the counts, so I think we need to make sure we do a reconciliation of those during the periodic system GC as well. (If for no other reason than to clean up pre-1.0 plugins that will have hit the bug I've described above.) |
@tgross As for #9248; I was trying this with no other cluster activity running. I honestly do not think that there was another concurrent update. The interesting thing is that it should be easily reproducable: #9248 (comment) -- did you see different results there? |
From your comment:
There were multiple allocations running and then you stopped the job, right? That will make at least one update per node (when each Nomad client updates the server that it's stopping the allocation), and there's a very good chance those land concurrently. I haven't been able to reproduce without having multiple node plugin instances running. |
Ah sorry, the way I read your comment I assumed you ment concurrent updates to other jobs. Yes I have a node only plugin running as system job over all three clients in the cluster. Sorry for the confusion. |
Closed by #9438, which will ship in Nomad 1.0 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
For reporting security vulnerabilities please refer to the website.
If you have a question, prepend your issue with
[question]
or preferably use the nomad mailing list.If filing a bug please include the following:
Nomad version
currently Nomad v0.12.7 ('0.12.7'), I think the problem started with 0.11.3 but I'm not certain.
Operating system and Environment details
Alpine 3.12.1, currently 3 x AWS t3a.small and 1 x t4g.small clients, 3 x t4g.micro servers.
t4g.small client: CSI plugin chengpan/aws-efs-csi-driver:latest, (a test aarch64 build of amazon/aws-efs-csi-driver)
t3a.small clients: amazon/aws-efs-csi-driver:v1.0.0
The servers and amd64 clients started at 0.11.3 and were upgraded at some point.
Issue
Let me say at the start that I don't know that this issue has had any practical impact at all, or is just a cosmetic issue with the displayed count, but I'm reporting it in case you find the information useful. I did see one bit of flakiness which I'm not sure is related or not.
I'm doing some testing with the AWS EFS CSI drivers in a mixed architecture environment. Somewhere along the way as I added and removed clients, Nomad has become confused about how many plugins are supposed to be running and says at
/ui/csi/plugins
:aws-efs | Node Only | Healthy (4/7) | efs.csi.aws.com
similarly from the cli:
One thing I'm doing that may be unusual is that I am running two separate system jobs for the plugins since they're using different containers, but they register the same plugin, because I want the volume to be accessible to jobs running on either architecture client, however, the count was already screwed up (3/6) before I added the arm64 node and plugin.
I've had one occasion where nomad was unable to schedule a job using an EFS volume (
exhausted its available writer claims
, odd since the volume hadaccess_mode = "multi-node-multi-writer"
). Force deregistering and reregistering fixed it. I've had other volumes with no issues at all. I don't know if this is in some way related to confused bookkeeping about the plugins or not.Reproduction steps
It's not clear exactly what sequence of events led to the issue. I think the count was off before I upgraded to 0.12.7, but I'm not 100% sure now. I have added and removed several client nodes and the count always remains 3 greater than the number of clients.
Job file (if appropriate)
The plugin jobs files are essentially both the same as what I submitted in #9366, except for one references v1.0.0 of the official image and constrains
${attr.cpu.arch}
toamd64
and the other references chengpan/aws-efs-csi-driver:latest and constrains${attr.cpu.arch}
toarm64
Nomad Client logs (if appropriate)
I found nothing relevant in the logs, but I'm not exactly sure what I'd be looking for either, nor during exactly what timeframe
Nomad Server logs (if appropriate)
The text was updated successfully, but these errors were encountered: