Close open connections when changing primary #2526

wsong · 2016-10-20T22:19:33Z

Signed-off-by: Wayne Song [email protected]

GordonTheTurtle · 2016-10-20T22:20:22Z

Please sign your commits following these rules:
https://github.com/docker/docker/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "close_replica_conns" [email protected]:wsong/swarm.git somewhere
$ cd somewhere
$ git commit --amend -s --no-edit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

wsong · 2016-10-20T22:23:54Z

Ping @dongluochen @nishanttotla

I'm not sure if this is the right way to accomplish this; at first glance, it seems like there could be a danger of connections being closed while there's still stuff being written to them.

The reason why I sent this PR out in the first place is that I noticed that /info requests were being blindly forwarded to the primary, rather than being handled by a replica, as they should be. Theoretically, the localRoutes logic in replica.go should handle this case, but it didn't appear to be true, and the problem got fixed when I disabled keepalives in our client code. It looks like hijacked TCP connections are being incorrectly kept alive?

Signed-off-by: Wayne Song <[email protected]>

wsong · 2016-10-24T20:52:55Z

Tests appear to be failing with error messages like these showing up in the logs:

# time="2016-10-24T20:31:46Z" level=error msg="failed to read lock: Get http://127.0.0.1:8500/v1/kv/test/docker/swarm/leader?index=9&wait=15000ms: EOF"

I wonder if I'm accidentally killing connections to the KV store somehow?

dongluochen · 2016-10-28T01:25:04Z

The problem is not clear here. We may need to clarify problem first.

/info requests were being blindly forwarded to the primary, rather than being handled by a replica, as they should be.

/info forwarded to primary is by design. The primary has the latest node info like node's health state.

wsong · 2016-10-28T16:04:48Z

My understanding was that the /info endpoint was supposed to be handled by the local node and not get forwarded to the primary:

https://github.com/docker/swarm/blob/master/api/replica.go#L10

dongluochen · 2016-10-28T22:41:45Z

You are right that /info is handled at each replica. This is the call sequence. Why do you think it's forwarded to primary?

(dlv) bt
 0  0x00000000006413bb in github.com/docker/swarm/cluster/swarm.(*Cluster).Info
    at /home/dchen/go/src/github.com/docker/swarm/cluster/swarm/cluster.go:862
 1  0x00000000004679b3 in github.com/docker/swarm/cli.(*statusHandler).Status
    at /home/dchen/go/src/github.com/docker/swarm/cli/manage.go:65
 2  0x00000000005f0e9c in github.com/docker/swarm/api.getInfo
    at /home/dchen/go/src/github.com/docker/swarm/api/handlers.go:62
 3  0x0000000000609353 in github.com/docker/swarm/api.setupPrimaryRouter.func1
    at /home/dchen/go/src/github.com/docker/swarm/api/primary.go:154
 4  0x000000000050fc1d in net/http.HandlerFunc.ServeHTTP
    at /usr/local/go/src/net/http/server.go:1726
 5  0x0000000000646a61 in github.com/docker/swarm/vendor/github.com/gorilla/mux.(*Router).ServeHTTP
    at /home/dchen/go/src/github.com/docker/swarm/vendor/github.com/gorilla/mux/mux.go:103
 6  0x0000000000603229 in github.com/docker/swarm/api.(*Replica).ServeHTTP
    at /home/dchen/go/src/github.com/docker/swarm/api/replica.go:38
 7  0x00000000006035e2 in github.com/docker/swarm/api.(*dispatcher).ServeHTTP
    at /home/dchen/go/src/github.com/docker/swarm/api/server.go:33
 8  0x0000000000511aa7 in net/http.serverHandler.ServeHTTP
    at /usr/local/go/src/net/http/server.go:2202
 9  0x000000000050e491 in net/http.(*conn).serve
    at /usr/local/go/src/net/http/server.go:1579
10  0x000000000045dff0 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:2086

wsong · 2016-10-28T22:45:52Z

We turned on debug logging and it was clear that the /info request was being handled by the primary and not the replica. The specific problem we saw was that /info requests to replicas were returning results with "Role: Primary" in the output.

dongluochen · 2016-10-28T23:06:31Z

The specific problem we saw was that /info requests to replicas were returning results with "Role: Primary" in the output.

That doesn't sound right. Can you show me the commands you used to start the (primary and) replicas?

Here is what I see. Docker /info to primary:

ubuntu@ip-172-19-71-145:~$ docker -H 172.19.71.145:3376 info
Containers: 9
 Running: 5
 Paused: 0
 Stopped: 4
Images: 17
Server Version: swarm/1.2.5
Role: primary

Docker /info to replica:

ubuntu@ip-172-19-71-145:~$ docker -H 172.19.29.201:3376 info
Containers: 9
 Running: 5
 Paused: 0
 Stopped: 4
Images: 17
Server Version: swarm/1.2.5
Role: replica
Primary: 172.19.71.145:3376

wsong · 2016-10-28T23:12:34Z

This was within UCP, and it's difficult to replicate on the command line because (assuming our theory is correct) the issue only occurs if you're talking to a replica and you make a request that gets forwarded to the primary; you then save the TCP connection and then later attempt to reuse it with an /info command. We believe that the TCP connection (which the primary has hijacked) then goes straight to the primary without the replica getting the chance to check localRoutes.

wsong · 2016-10-28T23:45:06Z

I managed to repro this with a small script:

package main

import (
    "fmt"
    "github.com/docker/engine-api/client"
    "golang.org/x/net/context"
)

func main() {
    c, _ := client.NewEnvClient()
    c.ContainerInspect(context.TODO(), "asdf")
    info, _ := c.Info(context.TODO())
    fmt.Printf("%s\n", info)
}

To test this, I set up a two-node cluster with two managers: node A and node B. node B was a replica, and I set DOCKER_HOST to be node B's Swarm manager address. docker info at the command line correctly showed "Role: replica", but running the above script with the same env var showed "Role: primary".

dongluochen · 2016-10-29T00:37:51Z

It looks like hijacked TCP connections are being incorrectly kept alive?

The problem may need more clear trace, or analysis on network trace. If keepalive is the problem, I think disconnecting after hijack is the solution. But I'd try not to add another channel.

GordonTheTurtle added status/0-triage dco/no labels Oct 20, 2016

wsong force-pushed the close_replica_conns branch from b0609bc to ea0d8ee Compare October 20, 2016 22:20

GordonTheTurtle added the dco/no label Oct 20, 2016

Close open connections when changing primary

069b7da

Signed-off-by: Wayne Song <[email protected]>

wsong force-pushed the close_replica_conns branch from ea0d8ee to 069b7da Compare October 20, 2016 22:28

GordonTheTurtle removed the dco/no label Oct 20, 2016

This was referenced Nov 28, 2016

#2488 Update vendoring to remove engine-api #2492

Merged

Don't forward _ping requests to the primary #2554

Closed

justincormack closed this Jun 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close open connections when changing primary #2526

Close open connections when changing primary #2526

wsong commented Oct 20, 2016 •

edited

Loading

GordonTheTurtle commented Oct 20, 2016

wsong commented Oct 20, 2016

wsong commented Oct 24, 2016

dongluochen commented Oct 28, 2016

wsong commented Oct 28, 2016

dongluochen commented Oct 28, 2016

wsong commented Oct 28, 2016

dongluochen commented Oct 28, 2016

wsong commented Oct 28, 2016

wsong commented Oct 28, 2016

dongluochen commented Oct 29, 2016

Close open connections when changing primary #2526

Close open connections when changing primary #2526

Conversation

wsong commented Oct 20, 2016 • edited Loading

GordonTheTurtle commented Oct 20, 2016

wsong commented Oct 20, 2016

wsong commented Oct 24, 2016

dongluochen commented Oct 28, 2016

wsong commented Oct 28, 2016

dongluochen commented Oct 28, 2016

wsong commented Oct 28, 2016

dongluochen commented Oct 28, 2016

wsong commented Oct 28, 2016

wsong commented Oct 28, 2016

dongluochen commented Oct 29, 2016

wsong commented Oct 20, 2016 •

edited

Loading