-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seems that IPs must be unassigned before being reassigned #36
Comments
Hey, sorry for the long time to answer and thanks for your PR! I'm not yet convinced that this is correct. Are you sure that it is a race condition? I would read the docs as it should be possible to update the IP in place. Un- and re-assigning has some troublesome edge-cases (e.g. unassign works, re-assign fails: This would unassign the IP and in the worst case send the controller into a crashloop in that state). I just checked https://docs.hetzner.cloud/#floating-ip-actions-assign-a-floating-ip-to-a-server .. it seems the 201 I'm checking on the reply is send anyhow. It might be, that the I will try to look into this the next few days/weeks. But for your proposed change I see too much possible complications. I could try to bring a relatively small change into it before, so that the json body of the answer will also be printed when updating the IP (possibly debug level). This should make debugging the error case easier. |
Hi @cbeneke Not sure it's a race condition or how to call it, but during my testing with the Hetzner Cloud CLI the only way I managed to get the reassignment of IPs to different nodes to work consistently was by unassigning the IPs first. Otherwise, if the IPs are simply "moved" from a node to another, sometimes something stops working for some reason and the servers are no longer reachable at the floating IPs even though they appear to be assigned correctly. I saw this with the fip controlller too. Reassignment works, but not always. Hetzner support that that they would pass this to the engineers, so it didn't seem like they were denying that this might be an issue. Thanks in advance if you can take a look! |
Interesting... I have repeated exactly the same testing now with another test cluster/project etc, and cannot reproduce the issue 🤔I wonder if Hetzner Cloud was having some issues when I was testing before? Anyway I can't reproduce now so I'll close this. |
@cbeneke Now that I think of it, the only difference between this cluster and the previous is that in this cluster I have installed the cloud controller manager and have installed the fip controller with node address type set to 'external'. But that shouldn't make any difference that might explain the issue I was having, I guess? Thanks |
Hmm.. the IP is only used to identify the node between kubernetes and hetzner API. For the assignment itself is uses the (discovered) Server ID. |
I am using the same cloud init config to set up all the floating IPs in the project on all the servers, so it's unlikely. Anyway I wasn't able to reproduce again so.. both not sure :D |
Hi @cbeneke I still believe this is an issue, and today I was confirmed by HC support that there is a bug because of which the floating IPs should be unassigned before reassigning them to ensure the assignment works properly. They are aware of it. Yesterday I created a new cluster in HC (after having tried Scaleway's managed Kubernetes for a little while) and installed the fip controller with the latest helm chart. Today one of the fip controller's pods was restarted a few times with the error Would it be possible to make a change in the controller to unassign the IP before reassigning it? Also, do the controller pods periodically query the HC API? One pod was restarted because of these failed API requests. Is this supposed to happen? Thanks a lot in advance! |
Hi @vitobotta thanks for re-opening the issue. Could you maybe link the issue? Since they seem to be aware of it I think it's feasible to implement a toggle, which ensures un-assignment before re-assignment (I'm still not convinced this should be the default, since I have never seen that problem in any of my cluster). Would that work for you? :) |
Regarding your second question: Yes, the pods are querying the hetzner API regulary (e.g. to get a list of servers etc). The whole project is written in a fail-fast manner, meaning any non-recoverable error should result in the pod dying (which a timeout is not per-se, but a sequence of timeouts would be). It could be feasible to handle timeouts differently or have any form of failure backoff with a retry, but that should be another ticket^^ |
Hi @cbeneke - what do you mean by "link the issue"? What they told me in a ticket is:
It's weird that you haven't hit this yet, perhaps you are lucky :D |
Ah unfortunate, I hoped they have a public bug tracker link for this which they sent to you. |
I just released v0.4.1 that supports retry with backoff. |
Hi!
There seems to be a race condition or something like that when the floating IPs are reassigned to a different node. Sometimes e.g. ping keeps working, but sometimes the ping stop working even if the IPs have been reassigned. During a lot of testing I found that it's safer to unassign the IPs from the current nodes first, and then assign them to other nodes. This way it works consistently.
Is it possible to implement this change? I am not familiar with Go enough to do it myself. :)
The text was updated successfully, but these errors were encountered: