-
Notifications
You must be signed in to change notification settings - Fork 300
Race condition causing not released lock and too many HAProxy processes #267
Comments
I'm definitely not seeing this in my own tests. Are you using the provided Docker image? Do you have long lived connections? The recommendation from the HAProxy docs is to not use a value of
I've never tried to use the multiprocess feature of HAProxy, so I can't comment on how well it works. |
I wonder if this is the same issue I've seen on the Bamboo project: |
Fix there was to upgrade Go which isn't used in our case. @brndnmtthws As I wrote problem reproduces also without using Scenario in my head is as follows:
HAProxy handling 1st reload spawns new processes (still leaving old ones - X). When this is still in progress 2nd reload applies and in https://github.com/mesosphere/marathon-lb/blob/master/service/haproxy/run#L41 |
This is pretty strange, because MLB shouldn't reload HAProxy in parallel. There's a check in MLB which blocks to make sure the reload actually occurs: https://github.com/mesosphere/marathon-lb/blob/master/marathon_lb.py#L599-L603 And the thread which executes the reloads cannot run in parallel. The code above, however, doesn't check which PIDs have changed, only if there's any change at all. that could be made a little more robust (i.e., check that a new process was spawned). |
Yeah, noticed it and seems as you mentioned that could be improved a bit (theoretically still can be vulnerable to my scenario - there are also 2 calls to Not sure if we should check only that new processes have been added but maybe also that old ones are gone to avoid my scenario? I might be that |
Checking that old processes are gone doesn't work because there's an assumption that old processes may stick around forever after a reload in the event of long-lived connections. |
We might test on our side if this is it by using I guess would be good to know first if HAProxy reports in PIDFILE also processes waiting for connections to be closed:
|
When doing a graceful reload, use `pidof haproxy` rather than catting the pidfile to make sure that we send SIGUSR1 to _all_ existing HAProxy instances. This is to address #267.
When reloading, make sure there's actually a new PID by doing some basic set math. To address #267.
Go ahead and test out the |
When reloading, make sure there's actually a new PID by doing some basic set math. To address #267.
I don't believe this will help, the issue appears to be a issue with haproxy itself. Talk to to @lloesche for context. |
This is also affecting
|
Is this fixed in HAProxy |
Sorry, the issue seems to reproduce after all :( |
You can try add the follow info. |
Hi,
Initially I've asked on Marathon mailing list but gather more data and since I've discovered that https://github.com/mesosphere/marathon-lb/blob/master/service/haproxy/run isn't provided by HAProxy but Marathon-lb then issue probably is caused by later.
Looks like lock isn't released and we've some hanging calls to flock:
I'm running:
Initially we've suspected it's related to
nbproc
we're using from HAProxy but even without it we've reached the state when we had more than one HAProxy process (right after changing configuration, killing old containers and then Marathon starts new one). It doesn't reproduce always though so probably there is some kind of race conditionLists of HAProxy processes (there are 96 but should be 48 but we had also box with almost 500 HAProxy processes) - https://gist.github.com/mlowicki/1b3672db46c2dcd26f1db7bba004ad99.
The text was updated successfully, but these errors were encountered: