Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seems that IPs must be unassigned before being reassigned #36

Open
vitobotta opened this issue Apr 9, 2020 · 12 comments
Open

Seems that IPs must be unassigned before being reassigned #36

vitobotta opened this issue Apr 9, 2020 · 12 comments

Comments

@vitobotta
Copy link

Hi!

There seems to be a race condition or something like that when the floating IPs are reassigned to a different node. Sometimes e.g. ping keeps working, but sometimes the ping stop working even if the IPs have been reassigned. During a lot of testing I found that it's safer to unassign the IPs from the current nodes first, and then assign them to other nodes. This way it works consistently.

Is it possible to implement this change? I am not familiar with Go enough to do it myself. :)

@cbeneke
Copy link
Owner

cbeneke commented Apr 13, 2020

Hey, sorry for the long time to answer and thanks for your PR!

I'm not yet convinced that this is correct. Are you sure that it is a race condition? I would read the docs as it should be possible to update the IP in place. Un- and re-assigning has some troublesome edge-cases (e.g. unassign works, re-assign fails: This would unassign the IP and in the worst case send the controller into a crashloop in that state).

I just checked https://docs.hetzner.cloud/#floating-ip-actions-assign-a-floating-ip-to-a-server .. it seems the 201 I'm checking on the reply is send anyhow. It might be, that the status field of the answer contains an error in the race-condition-cases you have observed on your account (in my case this actually never happened).

I will try to look into this the next few days/weeks. But for your proposed change I see too much possible complications. I could try to bring a relatively small change into it before, so that the json body of the answer will also be printed when updating the IP (possibly debug level). This should make debugging the error case easier.

@vitobotta
Copy link
Author

Hi @cbeneke

Not sure it's a race condition or how to call it, but during my testing with the Hetzner Cloud CLI the only way I managed to get the reassignment of IPs to different nodes to work consistently was by unassigning the IPs first. Otherwise, if the IPs are simply "moved" from a node to another, sometimes something stops working for some reason and the servers are no longer reachable at the floating IPs even though they appear to be assigned correctly. I saw this with the fip controlller too. Reassignment works, but not always.

Hetzner support that that they would pass this to the engineers, so it didn't seem like they were denying that this might be an issue.

Thanks in advance if you can take a look!

@vitobotta
Copy link
Author

Interesting... I have repeated exactly the same testing now with another test cluster/project etc, and cannot reproduce the issue 🤔I wonder if Hetzner Cloud was having some issues when I was testing before? Anyway I can't reproduce now so I'll close this.

@vitobotta
Copy link
Author

@cbeneke Now that I think of it, the only difference between this cluster and the previous is that in this cluster I have installed the cloud controller manager and have installed the fip controller with node address type set to 'external'. But that shouldn't make any difference that might explain the issue I was having, I guess? Thanks

@cbeneke
Copy link
Owner

cbeneke commented Apr 14, 2020

Hmm.. the IP is only used to identify the node between kubernetes and hetzner API. For the assignment itself is uses the (discovered) Server ID.
Thinking about it: Can it be, that one of the servers had the IP wrongly configured, so that - when assigning the IP to that specific server - your described issue occurred? Can't really think of anything else which might trigger the behaviour.

@vitobotta
Copy link
Author

I am using the same cloud init config to set up all the floating IPs in the project on all the servers, so it's unlikely. Anyway I wasn't able to reproduce again so.. both not sure :D

@cbeneke cbeneke added the noop Problem originating from different system label Apr 28, 2020
@vitobotta
Copy link
Author

Hi @cbeneke

I still believe this is an issue, and today I was confirmed by HC support that there is a bug because of which the floating IPs should be unassigned before reassigning them to ensure the assignment works properly. They are aware of it.

Yesterday I created a new cluster in HC (after having tried Scaleway's managed Kubernetes for a little while) and installed the fip controller with the latest helm chart.

Today one of the fip controller's pods was restarted a few times with the error could not fetch servers due to a timeout when querying the HC api. As a result, the IP was reassigned to another node by another pod. Problem is, that after reassigning the IP to another node it became unreachable. It has happened twice today causing 25 minutes of downtime for me.

Would it be possible to make a change in the controller to unassign the IP before reassigning it?

Also, do the controller pods periodically query the HC API? One pod was restarted because of these failed API requests. Is this supposed to happen?

Thanks a lot in advance!

@vitobotta vitobotta reopened this Jun 17, 2020
@cbeneke
Copy link
Owner

cbeneke commented Jun 17, 2020

Hi @vitobotta

thanks for re-opening the issue. Could you maybe link the issue? Since they seem to be aware of it I think it's feasible to implement a toggle, which ensures un-assignment before re-assignment (I'm still not convinced this should be the default, since I have never seen that problem in any of my cluster). Would that work for you? :)

@cbeneke
Copy link
Owner

cbeneke commented Jun 17, 2020

Regarding your second question: Yes, the pods are querying the hetzner API regulary (e.g. to get a list of servers etc). The whole project is written in a fail-fast manner, meaning any non-recoverable error should result in the pod dying (which a timeout is not per-se, but a sequence of timeouts would be). It could be feasible to handle timeouts differently or have any form of failure backoff with a retry, but that should be another ticket^^

@vitobotta
Copy link
Author

Hi @cbeneke - what do you mean by "link the issue"?

What they told me in a ticket is:

there was an issue with the attach/detach of floating IP. We are already aware of this bug and are working on it. The best way to workaround that is to detach the floating IP and attach it in order to move it.

It's weird that you haven't hit this yet, perhaps you are lucky :D

@cbeneke
Copy link
Owner

cbeneke commented Jun 17, 2020

Ah unfortunate, I hoped they have a public bug tracker link for this which they sent to you.

@phiphi282
Copy link
Collaborator

I just released v0.4.1 that supports retry with backoff.
Can you test if that maybe solves this issue for you too? :)

@cbeneke cbeneke removed the noop Problem originating from different system label Jan 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants