Skip to content
This repository has been archived by the owner on Feb 1, 2021. It is now read-only.

#2488 Update vendoring to remove engine-api #2492

Conversation

michelvocks
Copy link
Contributor

@michelvocks michelvocks commented Sep 24, 2016

This PR is a huge one. Tell me if I should split it up somehow.

What I've done

  • Changed imports from engine-api to docker/api/type or docker/clients
  • Updated whole vendoring (not sure if this was a good idea)
  • docker/docker/pkg/version has been replaced upstream with docker/api/types/versions
  • Docker upstream docker/pkg/tlsconfig has been updated and does not work anymore with golang 1.5 (only 1.6 and higher is supported). Function "Clone" is only implemented for 1.6 or higher. Therefore, I upgraded godeps master version to golang 1.7.
  • cluster/event_monitor.go was changed to be compatible with the changed Events-API interface.

What needs to be done?

  • Some tests are currently failing. Working on this.
  • Golang needs to be changed to 1.7 in the docs.

This fixes #2488

ping @nishanttotla @dongluochen @allencloud

Signed-off-by: Michel Vocks [email protected]

@michelvocks michelvocks force-pushed the 2488-update-vendoring-to-remove-engine-api branch from 8d15522 to f315db7 Compare September 26, 2016 18:51
@michelvocks michelvocks changed the title [WIP] #2488 Update vendoring to remove engine-api #2488 Update vendoring to remove engine-api Sep 26, 2016
@michelvocks
Copy link
Contributor Author

Fixed all unit tests. Most CIs are failing because go 1.6 or higher is required (by upstream). PTAL @nishanttotla

@nishanttotla nishanttotla added this to the 1.2.6 milestone Sep 26, 2016
@michelvocks michelvocks force-pushed the 2488-update-vendoring-to-remove-engine-api branch from f315db7 to 0d6faa1 Compare September 28, 2016 14:44
@nishanttotla
Copy link
Contributor

Ping @michelvocks sorry for the delay on this. We'd like to pick it up again and get it ready soon. Could you please fix the merge conflict?

@michelvocks michelvocks force-pushed the 2488-update-vendoring-to-remove-engine-api branch from 0d6faa1 to 704dc98 Compare October 17, 2016 07:36
@michelvocks
Copy link
Contributor Author

@nishanttotla No issue. I've fixed the merge conflict.

@dongluochen
Copy link
Contributor

Build is failing. I'm tracking similar problem in #2522.

++ time go build -o /tmp/tmp.1rpt7MLm4W ../..
# github.com/docker/swarm/vendor/github.com/docker/docker/client
../../vendor/github.com/docker/docker/client/hijack.go:144: undefined: "github.com/docker/swarm/vendor/github.com/docker/docker/pkg/tlsconfig".Clone
Build step 'Execute shell' marked build as failure

@dongluochen
Copy link
Contributor

github.com/docker/docker/pkg/tlsconfig defines Clone() for go1.6+. Docker/swarm is using go1.5.4. #2522 bumps it to go1.7.1. After that we can use Clone() in github.com/docker/docker/pkg/tlsconfig.

@dongluochen dongluochen mentioned this pull request Oct 19, 2016
@nishanttotla
Copy link
Contributor

@michelvocks sorry for the delay on this, we're updating CI and Go version right now. This is needed to fix some failing tests. I'll ping you when that is done. You'll have to rebase one more time.

@dongluochen
Copy link
Contributor

@michelvocks We have updated Go to 1.7.1. Please rebase your PR.

@michelvocks
Copy link
Contributor Author

michelvocks commented Oct 20, 2016

Thanks. Will do that now.

@michelvocks michelvocks force-pushed the 2488-update-vendoring-to-remove-engine-api branch from 704dc98 to a9fd789 Compare October 20, 2016 09:48
@michelvocks
Copy link
Contributor Author

I've rebased my PR.

@dongluochen
Copy link
Contributor

Something is not right with godep. Godep is using go1.6 in your PR. But I think it's something else causes the failure.

I can run the following command locally with your PR to get the error.

dchen@vm2:~/go/src/github.com/docker/swarm$ ./test/integration/run.sh
Using default tag: latest
...
++ chmod +x /usr/local/bin/docker-compose
++ time go build -o /tmp/tmp.4eCkwdBxZb ../..
# github.com/docker/swarm/vendor/github.com/docker/docker/client
../../vendor/github.com/docker/docker/client/hijack.go:144: undefined: "github.com/docker/swarm/vendor/github.com/docker/docker/pkg/tlsconfig".Clone

@michelvocks
Copy link
Contributor Author

@dongluochen I've set godep to golang 1.6 (like described in my initial PR message). That was before you decided to switch golang to 1.7. Should I change godep to 1.7?

I'm also not able to reproduce the error you have mentioned. Are you sure that you use golang 1.6 or higher?

@dongluochen
Copy link
Contributor

@michelvocks Yes please upgrade it to go1.7.1 as docker/docker is already on it.

If you open the Jenkins Details links below you can see the same error.

@michelvocks michelvocks force-pushed the 2488-update-vendoring-to-remove-engine-api branch from a9fd789 to 07d1cec Compare October 22, 2016 09:37
@michelvocks
Copy link
Contributor Author

@dongluochen I've changed godeps to 1.7 golang version.

The CI jobs are still failing (like you said). I think this is related to the dockerswarm environment image. This image is used by the CI jobs and there is still golang 1.5 installed. I think we have to upgrade this.

@dongluochen
Copy link
Contributor

@michelvocks You are right. We have updated dockerswarm environment image.

@michelvocks
Copy link
Contributor Author

Great. Can you restart the CI jobs?

@dongluochen
Copy link
Contributor

@michelvocks There is something wrong with this change. I can reproduce the test failure in my machine. It looks like node-0 failure would not move the node to Unhealthy state.

# time="2016-10-24T19:08:09Z" level=debug msg="Engine refresh failed" id="QE24:GSA5:RVLO:VV2Z:2F6Y:OXYK:CLBH:HOQ2:WTNO:M5DQ:6BHP:PYQ4" name=node-0 
# time="2016-10-24T19:08:09Z" level=debug msg="Engine update succeeded" id="2AAU:NJNU:Q4NK:M7W4:BBHU:3MDW:TLGM:LAA4:M3RU:6LJQ:ZH7V:LJTD" name=node-1 
# time="2016-10-24T19:08:09Z" level=debug msg="Start monitoring events" id="QE24:GSA5:RVLO:VV2Z:2F6Y:OXYK:CLBH:HOQ2:WTNO:M5DQ:6BHP:PYQ4" name=node-0 
# time="2016-10-24T19:08:09Z" level=debug msg="Start monitoring events" id="QE24:GSA5:RVLO:VV2Z:2F6Y:OXYK:CLBH:HOQ2:WTNO:M5DQ:6BHP:PYQ4" name=node-0 
# Command "eval docker_swarm info | grep -q 'Unhealthy'" failed 5 times. Output: 

@michelvocks
Copy link
Contributor Author

michelvocks commented Oct 25, 2016

@dongluochen Thanks. I found the bug. It seems like the string comparison in the file cluster/engine.go was not correct anymore:

// dockerclient defines ErrConnectionRefused error. but if http client is from swarm, it's not using
    // dockerclient. We need string matching for these cases. Remove the first character to deal with
    // case sensitive issue.
    // docker/api returns ErrConnectionFailed error, so we check for that as long as dockerclient exists
    if err == dockerclient.ErrConnectionRefused ||
        err == engineapi.ErrConnectionFailed ||
        strings.Contains(err.Error(), "onnection refused") ||
        strings.Contains(err.Error(), "annot connect to the docker engine endpoint") {
        // each connection refused instance may increase failure count so
        // engine can fail fast. Short engine freeze or network failure may result
        // in engine marked as unhealthy. If this causes unnecessary failure, engine
        // can track last error time. Only increase failure count if last error is
        // not too recent, e.g., last error is at least 1 seconds ago.
        e.incFailureCount()
        // update engine error message
        e.setErrMsg(err.Error())
        return
    }

It compares the error message (e.g. cannot connect to the docker engine endpoint) with a hardcoded string. This string (the error message) has been changed within the last API change from "cannot connect to the docker engine endpoint" to "Cannot connect to the Docker daemon at...". Therefore, the compare was always wrong.

Edit: Still have to work for the master CI. Will try to investigate the issue.

@michelvocks michelvocks force-pushed the 2488-update-vendoring-to-remove-engine-api branch from 07d1cec to 149c072 Compare October 25, 2016 12:14
@michelvocks
Copy link
Contributor Author

michelvocks commented Nov 23, 2016

@nishanttotla Welcome! ☺️
Yes, correct. It seems like the master-branch has problems with the changes from the PR #2541.
If you want I can have a look at this but it will take time to figure out what's going on there.

I fixed the issue with the engine-1.9, engine-1.10 and engine-1.11. I also fixed an issue with one of the tests on the master-branch. A new line "Minimum API version" has been added to the output of docker version command which led to a failure for one of the tests. I'm not quite sure why this test is still failing. Manually reproducing this test within an docker development environment works now. Maybe the test environment/image needs an update?

@nishanttotla
Copy link
Contributor

@michelvocks actually it looks great! There was some issue with the engine-master tests, so I've restarted the CI.

@nishanttotla
Copy link
Contributor

@michelvocks in addition to the API version test and leader election, several other engine-master tests are failing as well. Do you have some idea why?

@michelvocks michelvocks force-pushed the 2488-update-vendoring-to-remove-engine-api branch from cd51b3e to a742008 Compare November 24, 2016 11:07
@michelvocks
Copy link
Contributor Author

michelvocks commented Nov 24, 2016

@nishanttotla I was able to fix the api version test. I had to remove whitespaces before doing the comparison.
Additionally, I was able to figure out why the other tests are failing. The image affinity scheduling does not work. I'm currently investigating this issue.

@michelvocks michelvocks force-pushed the 2488-update-vendoring-to-remove-engine-api branch 4 times, most recently from ee880c7 to 5bf3bd5 Compare November 25, 2016 21:02
@michelvocks michelvocks force-pushed the 2488-update-vendoring-to-remove-engine-api branch from 5bf3bd5 to 8c1f769 Compare November 25, 2016 22:45
@michelvocks
Copy link
Contributor Author

michelvocks commented Nov 25, 2016

@nishanttotla @dongluochen
After another full day of work on this PR I think it is now ready to merge (at least from my side).
I was able to solve all issues related to engine-master especially the leader-election tests.

leader-election tests
replica.go holds an array of localRoutes which should be handled locally. If a request arrives which is not in the list it will be hijacked and routed to the swarm primary manager. Due to the fact that the docker engine 1.13 client now uses first the /_ping endpoint before serving the /info endpoint, the /info request has been always routed to the primary swarm manager. The solution was quite easy, I just added the /_ping endpoint to the localRoutes array. Therefore, the test about checking if this daemon is a replica was always false because the request has been always hijacked and forwarded to the primary swarm manager.

image scheduling affinity
Some tests were failing because the scheduler couldn't apply the image affinity. I found out that the cluster.go file included an error message check which has been changed within docker 1.13 from image xyz not found to repository xyz not found. That check was hindering the scheduler to find the correct node where the image is already existing. The solution was to add another check for the new expected error message.

Docker run with --rm flag
There was one check where a container has been started with the --rm-flag. I found out that the docker daemon always throws the following error when running a simple container error removing container: Error response from daemon: Container 52839ba6b7f16d2ff8a92f19eeb2077205e7d3d2aa04fe1941836387c127ed42 not found. I think this error comes from the fact that the container execution is really fast and the container stops itself. This error is redirected to the client and with docker 1.13 it will be also printed to the user for some reason. The expected test output was before Hello Args and now with the new client it is Hello ArgsERR[0001] error removing container.... For now I just enhanced the test that it can handle this behavior because I'm not sure why error messages now are directly printed to the user.

Tests which are still failing
engine-master 38 strategy binpack
This is a flaky test. If you compare this test with the results from the last runs you will see that it is randomly failing.

@wsong
Copy link
Contributor

wsong commented Nov 28, 2016

@michelvocks Thanks for looking into the leader-election tests. I've spent a fair bit of time looking into those too, and I've seen precisely the HTTP hijacking issue you've seen with /info. I actually had a PR out to fix it: #2526

But that PR itself had some issues that were difficult to resolve. I'm glad that you've also seen this issue, though, since it gives us some more data points.

Anyways, the reason why /_ping isn't in localRoutes is because of #2539, where I changed /_ping to forward to the primary so that it could return an error code if the primary was down. I've sent out #2554 to maintain this behavior while avoiding the primary-forwarding behavior that's messing you up here.

@nishanttotla
Copy link
Contributor

@michelvocks 🎉 🎉 🎉 great job, thanks a lot! The binpack test has been flaky in the past, and I'm not surprised it's being flaky now.

@wsong we'll review your PR and merge it as well, thanks!

cc @dongluochen

@dongluochen
Copy link
Contributor

Thanks @michelvocks.

LGTM.

@wsong We will need to prioritize this PR. We can fix /_ping after this.

@michelvocks
Copy link
Contributor Author

@wsong Good point! Didn't noticed that. 😊
@dongluochen @nishanttotla Welcome! 😇

@nishanttotla
Copy link
Contributor

I kicked the tests one more time just to be sure, and it seems like there's a failure on engine-1.11

not ok 5 label affinity in parallel
# (from function `retry' in file ./helpers.bash, line 80,
#  from function `wait_until_reachable' in file ./helpers.bash, line 85,
#  from function `start_docker' in file ./helpers.bash, line 222,
#  from function `start_docker_with_busybox' in file ./helpers.bash, line 181,
#  in test file ./affinities.bats, line 148)
#   `start_docker_with_busybox 2' failed
# Command "docker -H 127.0.0.1:5014 info" failed 15 times. Output: Cannot connect to the Docker daemon. Is the docker daemon running on this host?
# Stopping 9a8c01894077df74d3ba042b15685f1e61b40e42f8ddab3e10cd57ac96e7db5b
# Stopping ea9d6fd34cd76e40effa2cb34328687e71d8b23a88de9e6321f3c876aedb7a77

I can't remember if this was seen before?

It might be another error message mismatch, so we'll see, but otherwise LGTM.

@michelvocks
Copy link
Contributor Author

@nishanttotla This must be a flaky test. The test fails at the beginning with the helper function start_docker_with_busybox 2. This helper function is used in 80% of all tests (and also in the test before and after this test). If you restart the tests I think this one will not fail again.

@nishanttotla
Copy link
Contributor

@michelvocks you're right. Most of the failures are flaky at this point. I think we'll merge this PR.

@nishanttotla nishanttotla merged commit 5a254ec into docker-archive:master Nov 30, 2016
@nishanttotla
Copy link
Contributor

Merged 🎉 🎉 🎉

@michelvocks once again thanks a lot for your effort.

@wsong please rebase your other PRs on top of this one and we'll look at them.

@wsong
Copy link
Contributor

wsong commented Nov 30, 2016

@michelvocks Thanks for the hard work!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update vendoring to remove engine-api
6 participants