Skip to content
This repository has been archived by the owner on Feb 1, 2021. It is now read-only.

Close open connections when changing primary #2526

Closed
wants to merge 1 commit into from

Conversation

wsong
Copy link
Contributor

@wsong wsong commented Oct 20, 2016

Signed-off-by: Wayne Song [email protected]

@GordonTheTurtle
Copy link

Please sign your commits following these rules:
https://github.com/docker/docker/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "close_replica_conns" [email protected]:wsong/swarm.git somewhere
$ cd somewhere
$ git commit --amend -s --no-edit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

@wsong
Copy link
Contributor Author

wsong commented Oct 20, 2016

Ping @dongluochen @nishanttotla

I'm not sure if this is the right way to accomplish this; at first glance, it seems like there could be a danger of connections being closed while there's still stuff being written to them.

The reason why I sent this PR out in the first place is that I noticed that /info requests were being blindly forwarded to the primary, rather than being handled by a replica, as they should be. Theoretically, the localRoutes logic in replica.go should handle this case, but it didn't appear to be true, and the problem got fixed when I disabled keepalives in our client code. It looks like hijacked TCP connections are being incorrectly kept alive?

@wsong
Copy link
Contributor Author

wsong commented Oct 24, 2016

Tests appear to be failing with error messages like these showing up in the logs:

# time="2016-10-24T20:31:46Z" level=error msg="failed to read lock: Get http://127.0.0.1:8500/v1/kv/test/docker/swarm/leader?index=9&wait=15000ms: EOF"

I wonder if I'm accidentally killing connections to the KV store somehow?

@dongluochen
Copy link
Contributor

The problem is not clear here. We may need to clarify problem first.

/info requests were being blindly forwarded to the primary, rather than being handled by a replica, as they should be.

/info forwarded to primary is by design. The primary has the latest node info like node's health state.

@wsong
Copy link
Contributor Author

wsong commented Oct 28, 2016

My understanding was that the /info endpoint was supposed to be handled by the local node and not get forwarded to the primary:

https://github.com/docker/swarm/blob/master/api/replica.go#L10

@dongluochen
Copy link
Contributor

You are right that /info is handled at each replica. This is the call sequence. Why do you think it's forwarded to primary?

(dlv) bt
 0  0x00000000006413bb in github.com/docker/swarm/cluster/swarm.(*Cluster).Info
    at /home/dchen/go/src/github.com/docker/swarm/cluster/swarm/cluster.go:862
 1  0x00000000004679b3 in github.com/docker/swarm/cli.(*statusHandler).Status
    at /home/dchen/go/src/github.com/docker/swarm/cli/manage.go:65
 2  0x00000000005f0e9c in github.com/docker/swarm/api.getInfo
    at /home/dchen/go/src/github.com/docker/swarm/api/handlers.go:62
 3  0x0000000000609353 in github.com/docker/swarm/api.setupPrimaryRouter.func1
    at /home/dchen/go/src/github.com/docker/swarm/api/primary.go:154
 4  0x000000000050fc1d in net/http.HandlerFunc.ServeHTTP
    at /usr/local/go/src/net/http/server.go:1726
 5  0x0000000000646a61 in github.com/docker/swarm/vendor/github.com/gorilla/mux.(*Router).ServeHTTP
    at /home/dchen/go/src/github.com/docker/swarm/vendor/github.com/gorilla/mux/mux.go:103
 6  0x0000000000603229 in github.com/docker/swarm/api.(*Replica).ServeHTTP
    at /home/dchen/go/src/github.com/docker/swarm/api/replica.go:38
 7  0x00000000006035e2 in github.com/docker/swarm/api.(*dispatcher).ServeHTTP
    at /home/dchen/go/src/github.com/docker/swarm/api/server.go:33
 8  0x0000000000511aa7 in net/http.serverHandler.ServeHTTP
    at /usr/local/go/src/net/http/server.go:2202
 9  0x000000000050e491 in net/http.(*conn).serve
    at /usr/local/go/src/net/http/server.go:1579
10  0x000000000045dff0 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:2086

@wsong
Copy link
Contributor Author

wsong commented Oct 28, 2016

We turned on debug logging and it was clear that the /info request was being handled by the primary and not the replica. The specific problem we saw was that /info requests to replicas were returning results with "Role: Primary" in the output.

@dongluochen
Copy link
Contributor

The specific problem we saw was that /info requests to replicas were returning results with "Role: Primary" in the output.

That doesn't sound right. Can you show me the commands you used to start the (primary and) replicas?

Here is what I see. Docker /info to primary:

ubuntu@ip-172-19-71-145:~$ docker -H 172.19.71.145:3376 info
Containers: 9
 Running: 5
 Paused: 0
 Stopped: 4
Images: 17
Server Version: swarm/1.2.5
Role: primary

Docker /info to replica:

ubuntu@ip-172-19-71-145:~$ docker -H 172.19.29.201:3376 info
Containers: 9
 Running: 5
 Paused: 0
 Stopped: 4
Images: 17
Server Version: swarm/1.2.5
Role: replica
Primary: 172.19.71.145:3376

@wsong
Copy link
Contributor Author

wsong commented Oct 28, 2016

This was within UCP, and it's difficult to replicate on the command line because (assuming our theory is correct) the issue only occurs if you're talking to a replica and you make a request that gets forwarded to the primary; you then save the TCP connection and then later attempt to reuse it with an /info command. We believe that the TCP connection (which the primary has hijacked) then goes straight to the primary without the replica getting the chance to check localRoutes.

@wsong
Copy link
Contributor Author

wsong commented Oct 28, 2016

I managed to repro this with a small script:

package main

import (
    "fmt"
    "github.com/docker/engine-api/client"
    "golang.org/x/net/context"
)

func main() {
    c, _ := client.NewEnvClient()
    c.ContainerInspect(context.TODO(), "asdf")
    info, _ := c.Info(context.TODO())
    fmt.Printf("%s\n", info)
}

To test this, I set up a two-node cluster with two managers: node A and node B. node B was a replica, and I set DOCKER_HOST to be node B's Swarm manager address. docker info at the command line correctly showed "Role: replica", but running the above script with the same env var showed "Role: primary".

@dongluochen
Copy link
Contributor

It looks like hijacked TCP connections are being incorrectly kept alive?

The problem may need more clear trace, or analysis on network trace. If keepalive is the problem, I think disconnecting after hijack is the solution. But I'd try not to add another channel.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants