Seems that IPs must be unassigned before being reassigned #36

vitobotta · 2020-04-09T10:02:13Z

Hi!

There seems to be a race condition or something like that when the floating IPs are reassigned to a different node. Sometimes e.g. ping keeps working, but sometimes the ping stop working even if the IPs have been reassigned. During a lot of testing I found that it's safer to unassign the IPs from the current nodes first, and then assign them to other nodes. This way it works consistently.

Is it possible to implement this change? I am not familiar with Go enough to do it myself. :)

cbeneke · 2020-04-13T02:10:24Z

Hey, sorry for the long time to answer and thanks for your PR!

I'm not yet convinced that this is correct. Are you sure that it is a race condition? I would read the docs as it should be possible to update the IP in place. Un- and re-assigning has some troublesome edge-cases (e.g. unassign works, re-assign fails: This would unassign the IP and in the worst case send the controller into a crashloop in that state).

I just checked https://docs.hetzner.cloud/#floating-ip-actions-assign-a-floating-ip-to-a-server .. it seems the 201 I'm checking on the reply is send anyhow. It might be, that the status field of the answer contains an error in the race-condition-cases you have observed on your account (in my case this actually never happened).

I will try to look into this the next few days/weeks. But for your proposed change I see too much possible complications. I could try to bring a relatively small change into it before, so that the json body of the answer will also be printed when updating the IP (possibly debug level). This should make debugging the error case easier.

vitobotta · 2020-04-13T13:05:55Z

Hi @cbeneke

Not sure it's a race condition or how to call it, but during my testing with the Hetzner Cloud CLI the only way I managed to get the reassignment of IPs to different nodes to work consistently was by unassigning the IPs first. Otherwise, if the IPs are simply "moved" from a node to another, sometimes something stops working for some reason and the servers are no longer reachable at the floating IPs even though they appear to be assigned correctly. I saw this with the fip controlller too. Reassignment works, but not always.

Hetzner support that that they would pass this to the engineers, so it didn't seem like they were denying that this might be an issue.

Thanks in advance if you can take a look!

vitobotta · 2020-04-13T13:22:10Z

Interesting... I have repeated exactly the same testing now with another test cluster/project etc, and cannot reproduce the issue 🤔I wonder if Hetzner Cloud was having some issues when I was testing before? Anyway I can't reproduce now so I'll close this.

vitobotta · 2020-04-13T13:24:29Z

@cbeneke Now that I think of it, the only difference between this cluster and the previous is that in this cluster I have installed the cloud controller manager and have installed the fip controller with node address type set to 'external'. But that shouldn't make any difference that might explain the issue I was having, I guess? Thanks

cbeneke · 2020-04-14T11:48:05Z

Hmm.. the IP is only used to identify the node between kubernetes and hetzner API. For the assignment itself is uses the (discovered) Server ID.
Thinking about it: Can it be, that one of the servers had the IP wrongly configured, so that - when assigning the IP to that specific server - your described issue occurred? Can't really think of anything else which might trigger the behaviour.

vitobotta · 2020-04-14T13:01:30Z

I am using the same cloud init config to set up all the floating IPs in the project on all the servers, so it's unlikely. Anyway I wasn't able to reproduce again so.. both not sure :D

vitobotta · 2020-06-17T13:49:11Z

Hi @cbeneke

I still believe this is an issue, and today I was confirmed by HC support that there is a bug because of which the floating IPs should be unassigned before reassigning them to ensure the assignment works properly. They are aware of it.

Yesterday I created a new cluster in HC (after having tried Scaleway's managed Kubernetes for a little while) and installed the fip controller with the latest helm chart.

Today one of the fip controller's pods was restarted a few times with the error could not fetch servers due to a timeout when querying the HC api. As a result, the IP was reassigned to another node by another pod. Problem is, that after reassigning the IP to another node it became unreachable. It has happened twice today causing 25 minutes of downtime for me.

Would it be possible to make a change in the controller to unassign the IP before reassigning it?

Also, do the controller pods periodically query the HC API? One pod was restarted because of these failed API requests. Is this supposed to happen?

Thanks a lot in advance!

cbeneke · 2020-06-17T14:19:08Z

Hi @vitobotta

thanks for re-opening the issue. Could you maybe link the issue? Since they seem to be aware of it I think it's feasible to implement a toggle, which ensures un-assignment before re-assignment (I'm still not convinced this should be the default, since I have never seen that problem in any of my cluster). Would that work for you? :)

cbeneke · 2020-06-17T14:22:18Z

Regarding your second question: Yes, the pods are querying the hetzner API regulary (e.g. to get a list of servers etc). The whole project is written in a fail-fast manner, meaning any non-recoverable error should result in the pod dying (which a timeout is not per-se, but a sequence of timeouts would be). It could be feasible to handle timeouts differently or have any form of failure backoff with a retry, but that should be another ticket^^

vitobotta · 2020-06-17T15:12:41Z

Hi @cbeneke - what do you mean by "link the issue"?

What they told me in a ticket is:

there was an issue with the attach/detach of floating IP. We are already aware of this bug and are working on it. The best way to workaround that is to detach the floating IP and attach it in order to move it.

It's weird that you haven't hit this yet, perhaps you are lucky :D

cbeneke · 2020-06-17T15:22:29Z

Ah unfortunate, I hoped they have a public bug tracker link for this which they sent to you.

phiphi282 · 2020-10-16T13:42:10Z

I just released v0.4.1 that supports retry with backoff.
Can you test if that maybe solves this issue for you too? :)

vitobotta mentioned this issue Apr 9, 2020

Unassign floating IPs before reassigning them to fix occasional race condition #37

Closed

vitobotta closed this as completed Apr 13, 2020

cbeneke added the noop Problem originating from different system label Apr 28, 2020

vitobotta reopened this Jun 17, 2020

vitobotta mentioned this issue Jun 17, 2020

Retry failed API requests before causing the pod to die #50

Closed

cbeneke removed the noop Problem originating from different system label Jan 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seems that IPs must be unassigned before being reassigned #36

Seems that IPs must be unassigned before being reassigned #36

vitobotta commented Apr 9, 2020

cbeneke commented Apr 13, 2020

vitobotta commented Apr 13, 2020

vitobotta commented Apr 13, 2020

vitobotta commented Apr 13, 2020

cbeneke commented Apr 14, 2020

vitobotta commented Apr 14, 2020

vitobotta commented Jun 17, 2020

cbeneke commented Jun 17, 2020

cbeneke commented Jun 17, 2020

vitobotta commented Jun 17, 2020

cbeneke commented Jun 17, 2020

phiphi282 commented Oct 16, 2020

Seems that IPs must be unassigned before being reassigned #36

Seems that IPs must be unassigned before being reassigned #36

Comments

vitobotta commented Apr 9, 2020

cbeneke commented Apr 13, 2020

vitobotta commented Apr 13, 2020

vitobotta commented Apr 13, 2020

vitobotta commented Apr 13, 2020

cbeneke commented Apr 14, 2020

vitobotta commented Apr 14, 2020

vitobotta commented Jun 17, 2020

cbeneke commented Jun 17, 2020

cbeneke commented Jun 17, 2020

vitobotta commented Jun 17, 2020

cbeneke commented Jun 17, 2020

phiphi282 commented Oct 16, 2020