Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead plugin+systemd = dead docker #813

Closed
bboreham opened this issue Dec 10, 2015 · 32 comments
Closed

Dead plugin+systemd = dead docker #813

bboreham opened this issue Dec 10, 2015 · 32 comments

Comments

@bboreham
Copy link
Contributor

I am developing a Docker Network plugin, so from time to time it crashes and/or I stop it running. If, during this time, I restart the host, Docker will often get completely hosed.

It seems this is a result of Docker waiting 15 seconds per endpoint to try to talk to my plugin and systemd only waiting a limited time for Docker to get started. Also it appears to be timing out three times for the same endpoint (at least in this example):

Docker should provide some way for installations to recover from this situation - maybe a --force-network-cleanup flag ? Or do the retrying in the background after getting the main daemon up and running.

Dec 10 10:16:35 systemd[1]: Starting Docker Application Container Engine...
Dec 10 10:16:35 docker[1274]: time="2015-12-10T10:16:35.392673377Z" level=info msg="API listen on /var/run/docker.sock"
Dec 10 10:16:35 docker[1274]: time="2015-12-10T10:16:35.423531946Z" level=info msg="Firewalld running: false"
Dec 10 10:16:35 docker[1274]: time="2015-12-10T10:16:35.463752856Z" level=error msg="getNetworkFromStore for nid d5a93b9cbfa323a53eae7808358d67ebd6470b801cf5de33b5991a7bcd5c79bf failed while trying to build sandbox for cleanup: network d5a93b9cbfa323a53eae7808358d67ebd6470b801cf5de33b5991a7bcd5c79bf not found"
Dec 10 10:16:35 docker[1274]: time="2015-12-10T10:16:35.464265535Z" level=warning msg="Failed getting network for ep 72270e49642470d5676acd7f19ead85dd7e460030f71c7dc66436f1836107fa5 during sandbox cbbec073b80c886dc1f6dddbfd32f91ad5999638afdf4cf9a61bbdf29a1ba19a delete: network d5a93b9cbfa323a53eae7808358d67ebd6470b801cf5de33b5991a7bcd5c79bf not found"
Dec 10 10:16:35 docker[1274]: time="2015-12-10T10:16:35.464430396Z" level=error msg="failed to delete sandbox cbbec073b80c886dc1f6dddbfd32f91ad5999638afdf4cf9a61bbdf29a1ba19a while trying to cleanup: could not cleanup all the endpoints in container 0317189ada28cb6a1f5c16c670c8ff744e7f6510db73035bcf991f02cf9ca531 / sandbox cbbec073b80c886dc1f6dddbfd32f91ad5999638afdf4cf9a61bbdf29a1ba19a"
Dec 10 10:16:35 docker[1274]: time="2015-12-10T10:16:35.484161832Z" level=warning msg="Unable to locate plugin: weave, retrying in 1s"
Dec 10 10:16:36 docker[1274]: time="2015-12-10T10:16:36.485561219Z" level=warning msg="Unable to locate plugin: weave, retrying in 2s"
Dec 10 10:16:38 docker[1274]: time="2015-12-10T10:16:38.486948113Z" level=warning msg="Unable to locate plugin: weave, retrying in 4s"
Dec 10 10:16:42 docker[1274]: time="2015-12-10T10:16:42.487467103Z" level=warning msg="Unable to locate plugin: weave, retrying in 8s"
Dec 10 10:16:50 docker[1274]: time="2015-12-10T10:16:50.487936016Z" level=warning msg="Unable to locate plugin: weave, retrying in 1s"
Dec 10 10:16:51 docker[1274]: time="2015-12-10T10:16:51.488902480Z" level=warning msg="Unable to locate plugin: weave, retrying in 2s"
Dec 10 10:16:53 docker[1274]: time="2015-12-10T10:16:53.490038312Z" level=warning msg="Unable to locate plugin: weave, retrying in 4s"
Dec 10 10:16:57 docker[1274]: time="2015-12-10T10:16:57.491067983Z" level=warning msg="Unable to locate plugin: weave, retrying in 8s"
Dec 10 10:17:05 docker[1274]: time="2015-12-10T10:17:05.491327232Z" level=error msg="getEndpointFromStore for eid 35b6ce7a721a46901c6f15828899801ce144c97da50758de83da3b7e756a7e5c failed while trying to build sandbox for cleanup: could not find endpoint 35b6ce7a721a46901c6f15828899801ce144c97da50758de83da3b7e756a7e5c: datastore not found for scope "
Dec 10 10:17:05 docker[1274]: time="2015-12-10T10:17:05.491416646Z" level=warning msg="Unable to locate plugin: weave, retrying in 1s"
Dec 10 10:17:06 docker[1274]: time="2015-12-10T10:17:06.491579074Z" level=warning msg="Unable to locate plugin: weave, retrying in 2s"
Dec 10 10:17:08 docker[1274]: time="2015-12-10T10:17:08.491737246Z" level=warning msg="Unable to locate plugin: weave, retrying in 4s"
Dec 10 10:17:12 docker[1274]: time="2015-12-10T10:17:12.491906061Z" level=warning msg="Unable to locate plugin: weave, retrying in 8s"
Dec 10 10:17:20 docker[1274]: time="2015-12-10T10:17:20.492103311Z" level=warning msg="Unable to locate plugin: weave, retrying in 1s"
Dec 10 10:17:21 docker[1274]: time="2015-12-10T10:17:21.493062542Z" level=warning msg="Unable to locate plugin: weave, retrying in 2s"
Dec 10 10:17:23 docker[1274]: time="2015-12-10T10:17:23.493250821Z" level=warning msg="Unable to locate plugin: weave, retrying in 4s"
Dec 10 10:17:27 docker[1274]: time="2015-12-10T10:17:27.493425280Z" level=warning msg="Unable to locate plugin: weave, retrying in 8s"
Dec 10 10:17:35 docker[1274]: time="2015-12-10T10:17:35.493618436Z" level=warning msg="Failed detaching sandbox eacfbde4430e0e618097962072c92a1fba5b46f0613134b74bedf29114fbfcb7 from endpoint 35b6ce7a721a46901c6f15828899801ce144c97da50758de83da3b7e756a7e5c: failed to get endpoint from store during leave: could not find endpoint 35b6ce7a721a46901c6f15828899801ce144c97da50758de83da3b7e756a7e5c: datastore not found for scope \n"
Dec 10 10:17:35 docker[1274]: time="2015-12-10T10:17:35.493672618Z" level=warning msg="Unable to locate plugin: weave, retrying in 1s"
Dec 10 10:17:36 docker[1274]: time="2015-12-10T10:17:36.493849667Z" level=warning msg="Unable to locate plugin: weave, retrying in 2s"
Dec 10 10:17:38 docker[1274]: time="2015-12-10T10:17:38.494796847Z" level=warning msg="Unable to locate plugin: weave, retrying in 4s"
Dec 10 10:17:42 docker[1274]: time="2015-12-10T10:17:42.494940516Z" level=warning msg="Unable to locate plugin: weave, retrying in 8s"
Dec 10 10:17:50 docker[1274]: time="2015-12-10T10:17:50.495108328Z" level=warning msg="Unable to locate plugin: weave, retrying in 1s"
Dec 10 10:17:51 docker[1274]: time="2015-12-10T10:17:51.495295240Z" level=warning msg="Unable to locate plugin: weave, retrying in 2s"
Dec 10 10:17:53 docker[1274]: time="2015-12-10T10:17:53.496253373Z" level=warning msg="Unable to locate plugin: weave, retrying in 4s"
Dec 10 10:17:57 docker[1274]: time="2015-12-10T10:17:57.496416010Z" level=warning msg="Unable to locate plugin: weave, retrying in 8s"
Dec 10 10:18:05 docker[1274]: time="2015-12-10T10:18:05.497531014Z" level=warning msg="Failed deleting endpoint 35b6ce7a721a46901c6f15828899801ce144c97da50758de83da3b7e756a7e5c: failed to get endpoint from store during Delete: could not find endpoint 35b6ce7a721a46901c6f15828899801ce144c97da50758de83da3b7e756a7e5c: datastore not found for scope \n"
Dec 10 10:18:05 systemd[1]: docker.service start operation timed out. Terminating.
Dec 10 10:18:05 systemd[1]: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 10 10:18:05 systemd[1]: Failed to start Docker Application Container Engine.
@bboreham
Copy link
Contributor Author

Similar report at moby/moby#17960 (comment), but user there is not using a plugin.

@dave-tucker
Copy link
Contributor

I've seen this problem too while developing a new network plugin. I was on Debian 8 with systemd.
It happened when I stopped running my plugin and restarted the daemon.
As I recall, I had some networks created using my driver and one or two containers on each network - none of which were started.

The only way to recover from this was to restart my plugin, then restart docker 😭

retrying in the background after getting the main daemon up and running.

But what should happen to those containers if the plugin is still unreachable?

@bboreham
Copy link
Contributor Author

The only way to recover from this was to restart my plugin

This implies one must never attempt to implement a plugin in a container. If that is a recommendation, it should be in the docs.

But what should happen to those containers if the plugin is still unreachable?

Leave them unstarted.

@dave-tucker
Copy link
Contributor

This implies one must never attempt to implement a plugin in a container. If that is a recommendation, it should be in the docs.

Not sure it's a limitation - I haven't tried it with containers yet.
I do wonder, once the plugin container was started, whether the daemon might recover.

Leave them unstarted

Works for me

/cc @mrjana @mavenugo @aboch
Could one of you take a look at this?
If you can spell out what you think an appropriate fix might look like, I can see about putting a PR together.

@bboreham
Copy link
Contributor Author

once the plugin container was started

If the user has rm'd it, perhaps misguidedly, they now have no opportunity to restart it - the docker daemon is not listening to commands and won't do so until the plugin is available. Perhaps one should be able to tell Docker that a particular container implements a driver, so docker rm would fail if there are endpoints.

Also consider the case that the driver is crashing before handling any commands.

@mavenugo
Copy link
Contributor

@bboreham @dave-tucker there are many such cases and I have requested for a discussion with @calavera to get his inputs on docker plugin infra. To me, this issue is a manifestation of the way we treat plugin containers. instead of trying to band-aid this scenario, lets try and address it at an infra level.

@dave-tucker
Copy link
Contributor

Thanks @mavenugo

@bboreham
Copy link
Contributor Author

@mavenugo there was no intention of band-aiding on my side.

this issue is a manifestation of the way we treat plugin containers

Please note the problem I reported at the top will happen just the same if your plugin is not in a container.

@mavenugo
Copy link
Contributor

Please note the problem I reported at the top will happen just the same if your plugin is not in a container.

@bboreham fair enough. But, one must note that plugin is actually an extension of the docker engine.
IMO, if the plugin is down that essentially means the engine is down. If the plugin is a container, the engine must do its best to bring it up before processing the request. I think we should explore the limit of engine's responsibility to play nice with the user's intent. if the plugin is not a container, but it is still down, engine cannot do much and staying down seems an appropriate behavior. But again, these are my opinion and @calavera may have more to say on this topic.

@bboreham
Copy link
Contributor Author

OK, I see where you're coming from. To me, having no way to reconfigure this aspect of the engine when the engine is down is the fundamental problem.

I.e. you can get in to this state via some docker [...] create commands, but once you're in this state you cannot run a docker rm command to get out of it; you can only blow away your entire Docker installation.

@mavenugo
Copy link
Contributor

Yes. Any such deadlock scenario must be avoided and is certainly a topic of discussion.

@bboreham
Copy link
Contributor Author

Inspection of the code suggests the failure is happening on this code path:

NewDaemon()
  daemon.initNetworkController()
    libnetwork.New()
      controller.sandboxCleanup()
        "getEndpointFromStore for eid %s failed while trying to build sandbox for cleanup"

which all happens strictly before the call to daemon.restore() which would run the container that implements the plugin driver.

So it is impossible to have Docker survive a restart, when a plugin driver is implemented inside a container.

@mavenugo
Copy link
Contributor

@bboreham thanks a lot for the investigation and it seems valid. I will look into this more closely and work with @calavera on the plugin side to see what we can do to mitigate this.

@bboreham
Copy link
Contributor Author

To refine the problem, sandboxCleanup seems to only apply to local-scope drivers. This probably explains why I didn't always get the issue - it is very repeatable when you use a remote local-scope driver.

@bboreham
Copy link
Contributor Author

And, getEndpointFromStore is hanging trying to find the driver to ask it what its scope is. Can we refactor that part given we know the scope?

@mavenugo
Copy link
Contributor

@bboreham the only reason we do sandboxCleanup is to cleanup any stale endpoints left behind due to an ungraceful daemon restart/shutdown. If it is graceful, this will never happen.

Having said that, ungraceful restart can happen and we should try and do our best to cleanup the states.
The problem here is chicken and egg where the endpoint holds container resource, while the plugins are loaded as containers.

I think the right solution is to not load the plugin to cleanup the endpoint states during sandboxCleanup.
The plugin API Get performs the loading part internally and hence it results in timeout issue that you raised.

Fortunately, I think I have a workaround for this. Let me try a simpler solution to see if we can get this resolved.

@bboreham
Copy link
Contributor Author

I do shutdown -r now to recreate this; I recall it looks as if Docker shuts down normally, but am not in front of that computer right now.

@mavenugo
Copy link
Contributor

@bboreham the ungraceful restart of docker daemon can happen in multiple ways. Essentially if the network sandbox is not cleaned up (which is possible only in an ungraceful daemon shutdown), this can happen. PTAL sandboxCleanup and the logic is clear there.

@cpuguy83
Copy link
Member

But why is docker trying to initialize a plugin if it's not been specifically requested yet? Or better said, why is Docker requesting a plugin before a container is even attempted to be started that needs it?

@mavenugo
Copy link
Contributor

@cpuguy83 this happens because libnetwork tries to cleanup any stale endpoint resource that are left behind due to an ungraceful restart of the daemon. If we dont perform the cleanup, these resources will be held (such as ip-address) and cannot be reused after the restart. For this cleanup, we reuse the same code-path and hence it is trying to access the driver. But this is a force cleanup case and we dont have to invoke the plugin for this cleanup. But in order to retain the same code-path it would be good to get #19390. if we dont have that, then it has to be managed entirely in libnetwork.

@bboreham
Copy link
Contributor Author

It's just an incidental use of the function, while iterating though all local-scope endpoints.

I'll try to make a PR to take it out.

@mavenugo
Copy link
Contributor

@bboreham it is not an incidental use and this should not be removed. We could definitely bypass the call to the driver during this force cleanup scenario. But the real problem is none of these.

The problem is that plugins.Get is lazy and greedy and it automatically performs retry.

Maybe we need a simple API in plugins pkg to return just the available plugins (without trying to load it).
That will solve this problem as well.

@cpuguy83
Copy link
Member

@mavenugo
Copy link
Contributor

@cpuguy83 yes. GetAll makes use of pl, err := loadWithRetry(name, false) for all the plugins.
am trying to avoid that and tried to expose the API directly in moby/moby#19390.

I think I should also add a GetActivePlugins API to just return what is active instead of trying to load the plugin.

@mavenugo
Copy link
Contributor

@cpuguy83 actually, i could use the Scan() API. Though it sounds like a private API, it is actually Public. let me try using that instead. \o/

@bboreham
Copy link
Contributor Author

OK, I tried my idea of just fixing the .scope() call, but it failed straight afterwards when controller.sandboxCleanup() called Delete() which called endpoint.Leave(). All of this seems misplaced in a routine where no remote drivers are available and we are just cleaning up stale info.

I tried your #880; it crashed. But I can no longer see your comment asking me to try it, so maybe you don't expect it to work yet.

While doing all this, I noticed there is a similar problem during shutdown: if Docker shuts down the container running the plugin, then shuts down another container attached to that network, the call to Leave() will fail. Maybe we do need that "system container" ordering after all, but at shutdown.

@mavenugo
Copy link
Contributor

@bboreham i removed the comment after i figured the fix didn't work on certain scenarios.
But, I have resolved it now. Can you please try the latest #880 and let me know how it goes ?

It works fine in the scenarios that I tested.

@mavenugo
Copy link
Contributor

@bboreham I tried a valid remote plugin manually and I hit the panic that you encountered. Have resolved that as well. Will do more validations and push the final fix.

@bboreham
Copy link
Contributor Author

Thanks; I will test that later today. I raised the shutdown problem as a separate issue - #882

mavenugo added a commit to mavenugo/docker that referenced this issue Jan 19, 2016
- Fixes moby#19404
- Fixes ungraceful daemon restart issue in systemd with remote
  network plugin (moby/libnetwork#813)

Signed-off-by: Madhu Venugopal <[email protected]>
tiborvass pushed a commit to tiborvass/docker that referenced this issue Jan 26, 2016
- Fixes moby#19404
- Fixes ungraceful daemon restart issue in systemd with remote
  network plugin (moby/libnetwork#813)

Signed-off-by: Madhu Venugopal <[email protected]>
@thaJeztah
Copy link
Member

@mavenugo @bboreham can this be closed now that moby/moby#19465 was merged?

@mavenugo
Copy link
Contributor

Yes.

@bboreham
Copy link
Contributor Author

Yes, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants