-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
goroutine leaks when making nomad api cals #6620
Comments
Thanks for reporting this, @tantra35! I'm going to take a look through the logs you've provided and see if I can replicate this. |
@tgross Its very hard to reproduce, for example we cant simulate this on test stand only on production system :-( |
Understood... but we'll figure it out! 😀 Just so that I understand the scenario you have here: you're hitting the |
@tgross yes |
@tantra35 I took a first pass through the logs you provided. The yamux goroutines that are in I'm going to assign myself to that issue as well (and I apologize that it was left to languish after you provided so much good info!). But in the meantime do you also see lots of open sockets when this goroutine leak is happening? |
@tgross yes the situation is the same |
In #4604 you reported that these were mostly sockets in I'm trying to see if I can reproduce the behavior but haven't yet done so successfully. Right now that environment looks like the following:
#!/bin/bash
while true
do
curl -so /dev/null "$NOMAD_ADDR/v1/client/allocation/$ALLOC_ID/stats"
echo -n "."
sleep 1
done Then on the client I'm using I also think that the suggestion you had in the earlier issue to add the RPC timeout is probably right (but maybe not enough, as you noted), but I want to have a reproduction before I go changing things in Nomad. There's a deadline configuration in yamux we're not using. |
One of the complicating factors for my tests yesterday was that if the Nomad client loses connectivity with a server, it'll establish a connection with another server. I've corrected this with a new test setup where I'm running the following script on the client: #!/bin/bash
# must be run as root
SERVER_IPS=($(curl -s localhost:4646/v1/status/peers | jq -r '.[] | split(":") | .[0]'))
echo "dropping traffic from ${SERVER_IPS[0]} ${SERVER_IPS[1]}"
echo "shaping traffic from ${SERVER_IPS[2]}"
iptables -I INPUT -s ${SERVER_IPS[0]} -j DROP
iptables -I INPUT -s ${SERVER_IPS[1]} -j DROP
# create 3 queue-discipline classes
tc qdisc add dev eth0 root handle 1: prio
# configure these classes, with the low-priority band being set to drop 25% of
# packets with 25% correlation. this should be enough to cause TCP connections
# that we send thru that queue to break and force the Nomad client to try to
# reconnect
tc qdisc add dev eth0 parent 1:1 handle 10: sfq
tc qdisc add dev eth0 parent 1:2 handle 20: sfq
tc qdisc add dev eth0 parent 1:3 handle 30: netem loss 25% 25%
# filter traffic from server 3
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \
match ip dst ${SERVER_IPS[2]}/32 flowid 1:3
After verifying that we're seeing roughly 25% packet loss from the last server and that this is the only server the client can connect to, I set my I've even ramped up to 50% and 75% packet loss to see if that could recreate the situation, but at higher packet loss levels the Nomad servers just mark the node as down and we get "no path to node". I eventually start getting lingering I'm going to leave this to soak for a couple hours to see if anything changes, but at this point I think my hypothesis that this has to do with simple network degradation is incorrect -- or at least more complex than that. @tantra35 it might help if you can answer a few questions:
|
@tgross we placed inside aws cloud and does't see any network problems most time and use ubuntu 16.04lts as our main distro, with minimal turning of network parameters Also i think that real problem is that in yamux nomad doesnt use any timeouts(we have patched version where implement them, its not ideal due nomad have blocked api and sometimes timeouts must but too huge: 5 minutes(maximum time)), but yamux does't allow to set timeout on connection phase only on read/write(on pathed version we observe that gotountines begin leaks in yamux connection phase OpenStream) i can demonstrate this with follow gorountines list and some stack traces OpenStreamHung.txt I get it on nomad version 0.8.6(but doens;t find any time to post bug report, also patched version was a stop factor) |
I agree that the lack of timeouts is one issue. The blocked goroutines should all continue once we have some timeouts in place but I want to have a clear mechanism of the bug so that we're not just pushing the problem to some other place in the stack. When you're talking about a patched version, you're referring to the patch from #4604, or is there some other patch you're using here?: sc.stream.SetDeadline(time.Now().Add(10 * time.Second)) For the file In #4604 you mentioned there was another region involved and that you weren't seeing the problem with same-region metrics collection. So the problem may be part of how we're handling RPC fowarding. If the diagram below describes this topology correctly, are you seeing the goroutine increase on Server A or Server B?
|
@tgross Oh, it's been a long time since, and we abandoned such a scheme. So for now we doesn't do any cross region queries for statistic(metrics), and have a collecting daemon in each region. That's why we removed patch with timeouts we hoped that exactly cross region queries was a real reason of hung connections, but we were wrong, as practic shows As for the patch with timeouts in #4604 you absolutely right, but we were extended this timeout to 5 minutes due blocking queries that can be in nomad calls(https://www.nomadproject.io/api/index.html#blocking-queries), and which max timeout is 5 minutes As for the |
@tantra35 I'd still like to see if I can reproduce this. One of the avenues I'm investigating is whether the number of allocations involved is very high and that's exposing some kind of deadlocking condition we'd normally never hit in development environments. You mentioned this only happens in production -- roughly how many allocations per node is this? Is the HTTP client you're using to hit the API querying all the allocations as once, or sequentially? |
@tgross Hm, and doesn't think that this is key of problem, due in our case on node launches 5-10 allocations average. Also i must said that this happens very strange in time, i mean that can be month and we doesn't see any problems, but some times this can be fews days continusly |
@tgross hello we think that the real problem in our case is exactly leader switch, and the goruntine increase is consequence. look at follow screenshots as you can see at 15:44 nomad switch leader, and at this point something went wrog and and doesn't return to raft follower state, so in cluster at this point we have 2 leaders, which one of them is wrong in this case there was and this increase only can be explained as connections to dead nodes(this is another bug in nomad, as i can make conclusion) due yamux streaming model(only some tcp connections) but i doesn't find yet place in nomad code base where this tcp connection making, anyway this looks very strange that tcp socket begin increase Aslso i attached eval brocekr stat as buggi moment There you also can see that at moment of switch So there can be a some race condition at moment when from server node requested allocation stat and switch leadership happens, or this may be simple fast leadership lost which doesn't correctly handled by nomad Also i provide some statistical info from logs at buggy moment there we can see that new leader(
and follow log on buggy node:
So here can be applied approach which i suggest in #4749 |
just reproduced this again and we got sockets stats(
so this is not a communication error with worker nodes, but problem in raft interoperation itself So problem described here is more complex then a think before:
|
That's a pretty serious correctness issue; I'm going to re-open #4749 so we can follow up on that one. Looks like stalebot got overzealous.
I don't see an obvious way this would be related to multiple leaders,
I definitely understand your frustration with this. For what it's worth, the Nomad engineering team has been staffed up quite a bit in the last year (especially in the last 6 months). And my particular focus has been around reliability and stability issues. Thanks so much for how much you've been willing to contribute back in extensive bug reports! We'll get there! |
Unfortunately it looks like we're missing the original election failover for node 172.27.5.61 in the logs you've provided -- we can see that the node is already in a flapping state and trying to execute leader operations (and failing to do so because it's not the raft leader). It probably doesn't matter how we got into the flapping state in the first place; there's plenty of networking or other reasons why a node might flap and we need to account for that. Like in #4749 we can also see that the node has lost connectivity or has intermittent connectivity to the other two server nodes. It manages to win an election but then times out on heartbeats ~500ms later. From the output of That narrows down the places to look quite a bit in terms of trying to reproduce. While I work on that, if you can reproduce while running with debug mode on, could you dump the output of |
I've been able to get similar log outputs from test clusters of 3 servers by running the script below on one of the machines. I've also tried adding extra load (by running CPU stress tests on AWS @tantra35 you've mentioned a couple times both 0.8.x and 0.9.x -- is this happening similarly on both clusters? If not, what's the exact version of the cluster involved? And is there any chance I can get debug-level logs? The raft system does some logging there that might be helpful to us. #!/bin/bash
set -e
trap cleanup err exit
cleanup() {
echo
echo "removing queue disciplines!"
sudo tc qdisc del dev eth0 root
}
MY_IP=$(ip addr | grep 'state UP' -A2 | tail -n1 | awk '{print $2}' | cut -f1 -d'/')
SERVER_IPS=($(curl -s localhost:4646/v1/status/peers | jq -r '.[] | split(":") | .[0]'))
# clear any old queue disciplines, if any
sudo tc qdisc del dev eth0 root > /dev/null 2>&1 || true
echo "Hit Ctrl-C to exit..."
while :
do
# configure queue discipline classes
sudo tc qdisc add dev eth0 root handle 1: prio
sudo tc qdisc add dev eth0 parent 1:1 handle 10: sfq
sudo tc qdisc add dev eth0 parent 1:2 handle 20: sfq
sudo tc qdisc add dev eth0 parent 1:3 handle 30: netem loss 75% 75%
# attach the filter to the peers
for SERVER_IP in ${SERVER_IPS[@]}; do
if [[ "$SERVER_IP" != "$MY_IP" ]]; then
sudo tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \
match ip dst "${SERVER_IP}/32" flowid 1:3
fi
done
echo -n '.'
sleep 2
sudo tc qdisc del dev eth0 root
sleep 5
done |
As you doesn't see any leaked sockets, I am 100% sure, that in your logs you can see this pair, and this is mean that it is not our case. At some moment we doesn't see
This is due, leader generate new token(managment) every time when it changes, and this token as i understand remembered in eval, so we have some this buggy evals that will never end up. How we can solve this? |
I just want to make sure I understand very clearly:
|
We doesn't see If you see massage pair, this means that in your test case nomad work perfectly and doesn't have any problems, in our case nomad doesn't leave leader state fully, due in logs doesn't appear For example this buggy server does't update cluster state, and if we request for allocation that appear after bug happens, this server response that allocation not found, and other strange things happens when we communicate with cluster(request allocations, collect allocations stat, launch jobs etc), so this looks like split brain |
Just wanted to add that we are experiencing the same issue. We're making calls to |
@filip-vt if you full enough brave jeday, you can try our binary https://yadi.sk/d/QnKOvsBaY4fSog |
…ue i think real problem in nomad code, which hung in imo Accept) + add support for context
@tgross We saw all changes that was maded by hashicop team. Good work, but at now we still use 0.9.7(with our custom patch) version due we not actually believe that 0.10.4 is ready for production(we must backport #6324 to 0.10.4), also problem with leaked goroutines solved only partial (we eliminate raft problems, but not eliminate #4604, in that issue i provided more info that we found) |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.8.x - latest nomad version
Operating system and Environment details
Issue
We making nomad api calls to collect telemetry for jobs and time to time see that goroutine leaks appear on control plane(we remove them by restartting nomad with leaked goroutines)
This happens in yamux code, look on our goroutines which we took with delve debugger
glist.txt
also we get some backtrace of some goroutines with most typical behavior
bt-1119376.txt
bt-821359.txt
bt-886686.txt
Backtraces was collected on nomad-0.9.6
The text was updated successfully, but these errors were encountered: