Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale dnsmasq.pid can prevent IP allocation due to PID clash #1741

Closed
jocado opened this issue Sep 18, 2020 · 8 comments · Fixed by #1748
Closed

Stale dnsmasq.pid can prevent IP allocation due to PID clash #1741

jocado opened this issue Sep 18, 2020 · 8 comments · Fixed by #1748
Labels
bug medium medium importance
Milestone

Comments

@jocado
Copy link

jocado commented Sep 18, 2020

Describe the bug
If a stale dnsmasq.pid exists from a previous execution and another running process has the PID, instances fail to boot properly with the error:

launch failed: The following errors occurred:                                   
failed to determine IP address

PID allocation on systems that don't reboot often means this is unlikely, but if you reboot daily [ for instance ] , it becomes much more likely. I have seen it happen , so it is at least possible :)

To Reproduce
Reboot a system
If you are unlucky, and one of the running PIDs matches the contents of dnsmasq.pid, then dnsmasq will not start instances fail to launch.

To reliably reproduce:
Stop multipassd
echo {PID_OF_A_RUNNING_PROCESS} > /var/snap/multipass/common/data/multipassd/network/dnsmasq.pid
Start multipassd
Try and launch an instance

Expected behavior
dnsmasq.pid should either be purge on multipassd stop, or additional logic added to check for a process signature that matches dnsmaq fore assuming dnsmaqk is running.

Instances should still be able to launch. A non-technical user will find it hard to debug this issue, and it could the cuase of some other tickets I've seen:
#1653
#1584

Logs
Also see this in the service logs:

Sep 18 11:40:02 hostname multipassd[27492]: Looking for dnsmasq
Sep 18 11:40:02 hostname multipassd[27492]: Read pid "27405" from file "/var/snap/multipass/common/data/multipassd/network/dnsmasq.pid"
Sep 18 11:40:02 hostname multipassd[27492]: existing dnsmasq found with pid 27405

Additional info
Ubuntu 18.04.5

multipass 1.4.0
multipassd 1.4.0

@jocado jocado added the bug label Sep 18, 2020
@ricab
Copy link
Collaborator

ricab commented Sep 18, 2020

Hi @jocado, thanks for the very useful report. Indeed something we overlooked and need to address!

@ricab ricab added high high importance medium medium importance and removed high high importance labels Sep 18, 2020
@ricab
Copy link
Collaborator

ricab commented Sep 18, 2020

A couple of updates after internal discussion:

  • we're moving to lxd by default, where this won't be an issue, so reducing the priority to medium
  • I will still try a quick fix and see if it works with confinement. If so, great. Otherwise we may end up dropping daemonization of dnsmasq entirely and deal with effects outside of snaps.

@jocado
Copy link
Author

jocado commented Sep 24, 2020

Thanks!

@dlbeck
Copy link
Contributor

dlbeck commented Nov 28, 2020

@ricab Hi, I am running into this issue when trying to launch. I am wondering if you guys found a "quick fix" for the issue. I am trying to develop on Ubuntu 20.04 inside of VirtualBox. Thanks

@ricab
Copy link
Collaborator

ricab commented Nov 30, 2020

Hi @dlbeck, yeah this got fixed in v1.5.0. Are you using an earlier version?

@erazemk
Copy link

erazemk commented Dec 13, 2020

Hi, I'm also getting this problem, just installed multipass using snap, version info below:

snap    2.47.1-1.fc33
snapd   2.47.1-1.fc33
series  16
fedora  33
kernel  5.9.12-200.fc33.x86_64

multipass  1.5.0
multipassd 1.5.0

I do get the following warning, but I dubt it has something to do with this:
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement

Tried both a normal launch and with cloud init, same result.

@ricab
Copy link
Collaborator

ricab commented Dec 14, 2020

Hi @erazemk, Multipass no longer reuses earlier dnsmasq processes, so this issue could not be the one causing your trouble. Would #1448 be a better match for what you're experiencing?

@erazemk
Copy link

erazemk commented Dec 14, 2020

Would #1448 be a better match for what you're experiencing?

Looks like it might be that yes, thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug medium medium importance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants