Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No resolv.conf being created in initramfs after pxe boot #393

Closed
ashak opened this issue Feb 17, 2020 · 14 comments
Closed

No resolv.conf being created in initramfs after pxe boot #393

ashak opened this issue Feb 17, 2020 · 14 comments

Comments

@ashak
Copy link

ashak commented Feb 17, 2020

Hi,

I've been testing out migration to FCOS from Container Linux.

Initially my machines were pxe booting with kernel command line options:
ip=dhcp rd.neednet=1 console=tty0 console=ttyS0 ignition.firstboot ignition.platform.id=metal ignition.config.url=http://pxeserver/fedora-coreos/ignition/generic-pxe.ign

But it was failing to download the ignition file due to not being able to resolve DNS.

The system had an IP address on eth0. I could see dhclient processes running for eth0, eth1 and eth2 (oddly not eth3, but whatever)

/tmp/net.eth0.resolv.conf was populated with the correct nameserver etc. that had been received via DHCP (presumably by /sbin/dhclient-script). But /etc/resolv.conf doesn't exist. I was unable to work out what (if anything) should be taking the /tmp/net.*.resolv.conf files and creating /etc/resolv.conf.

@dustymabe helped me out on IRC, pointing me at some dracut code that takes the /tmp/net.*.resolv.conf file and populates /run/initramfs/state/etc/resolv.conf from them. That file (and the /run/initramfs/state directory that it would be in) don't exist on my system.

We got no further in looking at what might turn that file into /etc/resolv.conf beacuse it was suggested that I change my kernel command line from ip=dhcp to ip=eth0:dhcp. This change allowed the system to end up with working DNS and therefore download my ignition file.

The system i'm using have 4 network cards. eth0 an eth2 are plugged into the same vlan on which there's a dhcp server running, eth1 and eth3 are plugged into the same networks (lots of vlan tagged networks), but not the same as eth0 and eth2. Our DHCP server only hands out DHCP addresses if it finds a matching MAC address in its config. The MAC of eth0 is in our DHCP config, the MAC of eth2 isn't.

I don't have rpm-ostree status output from the system when it wasn't working. If it would be useful I can reproduce and try to provide it.

@arithx
Copy link
Contributor

arithx commented Feb 17, 2020

Afaik that dracut module (write-ifcfg) is not enabled on Fedora CoreOS.

@dustymabe
Copy link
Member

dustymabe commented Feb 17, 2020

@arithx I couldn't remember if it was that module or some other one. That was the only one that I could grep for that seemed to be writing to a resolv.conf file.

Either way I found some notes from last year where I had debugged a similar issue and my notes said:

- It turns out that there is an extra NIC interface on his system provided
  by the hardware platform (for IPMI) that gets DHCP from the local hardware. 

- With the `ip=dhcp rd.neednet=1` settings on the kernel command line the 
  system comes up and tries to get DHCP on all interfaces including this 
  special IPMI NIC.

- Apparently dhclient overwrites /etc/resolv.conf for each instance of dhclient
  running.

- This means the DNS settings of the last interface to get initialized will win

- The DHCP response for his "IPMI NIC" provided no DNS information so /etc/resolv.conf
  got populated as an empty file.

- His coreos-install worked fine because he provided IP addresses in his URL to
  the disk images.

- His first boot/ignition run failed because it was trying to resolve a hostname
  and the /etc/resolv.conf was empty

- After providing `ip=eno1:dhcp` dhclient was only ran on one interface and resolv.conf
  settings were correct.

So Apparently dhclient overwrites /etc/resolv.conf for each instance of dhclient running is the key here.

The following workarounds on the kernel command line should work:

  • ip=eth0:dhcp where eth0 is the name of the interface
  • specifying nameserver=x.x.x.x

I think this should be handled appropriately when we move to networkmanager in the initrd.

@darkmuggle
Copy link
Contributor

What's the reason for disabling systemd-resolved? One of the strong points of systemd-resolved is scoped resolution.

https://github.com/coreos/fedora-coreos-config/blob/testing-devel/manifests/fedora-coreos-base.yaml#L46-L47

@bgilbert
Copy link
Contributor

Looks like the original removal was here. @cgwalters, do you remember any additional context?

@cgwalters
Copy link
Member

Sorry I don't 😦 - are other Fedora editions using resolved? I think Ubuntu switched and mostly worked through the bugs but I'm not sure Fedora did.

I think I'm vaguely recalling bugs around NetworkManager + resolved; IIRC Ubuntu server cases went with networkd or so?

@lump
Copy link

lump commented Mar 4, 2020

I am running a KVM VM with only one interface, so it can't be the IPMI NIC can't apply to me.

It works if I manually run:
nmcli connection up eth0

Ignition is able to resolve names to get the ignition JSON from a server, but after switching root, it's gone. Maybe NetworkManager might need to be restarted from scratch after finishing the ignition bootstrapping? Simply restarting NetworkManager from systemctl doesn't seem to be enough, though.

The following are some things I did troubleshoot.

$ ssh [email protected]

node-1.stage.te:~$ uptime
 14:32:12 up 0 min,  1 user,  load average: 0.71, 0.17, 0.06

node-1.stage.te:~$ cat /etc/NetworkManager/conf.d/default.conf
[main]
dns=default
rc-manager=file
[logging]
level=TRACE
domains=ALL

node-1.stage.te:~$  journalctl -u NetworkManager | grep dns | grep 'Mar 04 14:32'
Mar 04 14:32:01 node-1.stage.te NetworkManager[1013]: <debug> [1583357521.8327] CONFIG:   dns=default
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <trace> [1583357522.2441] dns-mgr[0x5557669b6240]: creating...
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <info>  [1583357522.2442] dns-mgr[0x5557669b6240]: init: dns=default,systemd-resolved rc-manager=file
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <trace> [1583357522.2612] dns-mgr: current configuration: @aa{sv} []
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3025] ++ ipv4.dns                  = []
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3025] ++ ipv4.dns-search           = []
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3027] ++ ipv6.dns                  = []
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3028] ++ ipv6.dns-search           = []
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3045] dns-mgr: (device_ip_config_changed): queueing DNS updates (1)
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3048] dns-mgr: (device_ip_config_changed): DNS configuration did not change
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3049] dns-mgr: (device_ip_config_changed): no DNS changes to commit (0)
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3055] dns-mgr: (device_ip_config_changed): queueing DNS updates (1)
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3058] dns-mgr: (device_ip_config_changed): DNS configuration did not change
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3058] dns-mgr: (device_ip_config_changed): no DNS changes to commit (0)
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3091] ++ ipv4.dns                  = []
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3092] ++ ipv4.dns-priority         = 100
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3092] ++ ipv4.dns-search           = []
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3097] ++ ipv6.dns                  = []
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3098] ++ ipv6.dns-priority         = 100
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3099] ++ ipv6.dns-search           = []
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <trace> [1583357522.3308] dns-sd-resolved[62a1606309006c2a]: D-Bus name for systemd-resolved has no owner
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3371] dns-mgr: (device_ip_config_changed): queueing DNS updates (1)
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3373] dns-mgr: (device_ip_config_changed): DNS configuration did not change
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3374] dns-mgr: (device_ip_config_changed): no DNS changes to commit (0)
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3380] dns-mgr: (device_ip_config_changed): queueing DNS updates (1)
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3382] dns-mgr: (device_ip_config_changed): DNS configuration did not change
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3383] dns-mgr: (device_ip_config_changed): no DNS changes to commit (0)
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <trace> [1583357522.3444] auth: call[12]: CheckAuthorization(org.freedesktop.NetworkManager.settings.modify.global-dns), subject=unix-process[pid=1079, uid=0, start=1907] (succeeding for root)
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3583] dns-mgr: (device_state_changed): queueing DNS updates (1)
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3584] dns-mgr: (update_routing_and_dns): queueing DNS updates (2)
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <trace> [1583357522.3584] policy: set-hostname: updating hostname (routing and dns)
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3585] dns-mgr: (update_routing_and_dns): DNS configuration did not change
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3586] dns-mgr: (update_routing_and_dns): no DNS changes to commit (1)
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3586] dns-mgr: (device_state_changed): DNS configuration did not change
Mar 04 14:32:02 node-1.stage.te NetworkManager[1013]: <debug> [1583357522.3586] dns-mgr: (device_state_changed): no DNS changes to commit (0)

node-1.stage.te:~$ cat /etc/resolv.conf
cat: /etc/resolv.conf: No such file or directory

node-1.stage.te:~$ cat /run/NetworkManager/no-stub-resolv.conf
cat: /run/NetworkManager/no-stub-resolv.conf: No such file or directory

node-1.stage.te:~$ find /run/NetworkManager/
/run/NetworkManager/
/run/NetworkManager/system-connections
/run/NetworkManager/system-connections/eth0.nmconnection
/run/NetworkManager/system-connections/Wired connection 1.nmconnection
/run/NetworkManager/devices
/run/NetworkManager/devices/2
/run/NetworkManager/conf.d
/run/NetworkManager/conf.d/10-dracut-dhclient.conf

node-1.stage.te:~$ nmcli device show eth0 
GENERAL.DEVICE:                         eth0
GENERAL.TYPE:                           ethernet
GENERAL.HWADDR:                         52:54:00:B6:4D:60
GENERAL.MTU:                            1500
GENERAL.STATE:                          100 (connected)
GENERAL.CONNECTION:                     eth0
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/ActiveConnection/1
WIRED-PROPERTIES.CARRIER:               on
IP4.ADDRESS[1]:                         172.22.22.161/23
IP4.GATEWAY:                            172.22.22.1
IP4.ROUTE[1]:                           dst = 0.0.0.0/0, nh = 172.22.22.1, mt = 0
IP4.ROUTE[2]:                           dst = 172.22.22.0/23, nh = 0.0.0.0, mt = 0
IP6.GATEWAY:                            --

node-1.stage.te:~$ sudo nmcli connection up eth0
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/2)

node-1.stage.te:~$ cat /etc/resolv.conf 
# Generated by NetworkManager
search te taskeasy.com stage.te
nameserver 172.22.22.2
nameserver 172.22.22.22
nameserver 172.22.220.20

node-1.stage.te:~$  journalctl -u NetworkManager | grep dns | grep 'Mar 04 14:3[3-9]'
Mar 04 14:37:17 node-1.stage.te NetworkManager[1013]: <trace> [1583357837.6214] auth: call[28]: CheckAuthorization(org.freedesktop.NetworkManager.settings.modify.global-dns), subject=unix-process[pid=1326, uid=1000, start=33476]
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <trace> [1583357872.2522] auth: call[44]: CheckAuthorization(org.freedesktop.NetworkManager.settings.modify.global-dns), subject=unix-process[pid=1342, uid=0, start=36940] (succeeding for root)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2741] dns-mgr: (device_ip_config_changed): queueing DNS updates (1)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2741] dns-mgr: (device_ip_config_changed): DNS configuration did not change
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2741] dns-mgr: (device_ip_config_changed): no DNS changes to commit (0)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2742] dns-mgr: (device_ip_config_changed): queueing DNS updates (1)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2742] dns-mgr: (device_ip_config_changed): DNS configuration did not change
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2742] dns-mgr: (device_ip_config_changed): no DNS changes to commit (0)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2747] dns-mgr: (update_routing_and_dns): queueing DNS updates (1)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <trace> [1583357872.2747] policy: set-hostname: updating hostname (routing and dns)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2747] dns-mgr: (update_routing_and_dns): DNS configuration did not change
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2747] dns-mgr: (update_routing_and_dns): no DNS changes to commit (0)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2755] dns-mgr: (device_ip_config_changed): queueing DNS updates (1)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2755] dns-mgr: (device_ip_config_changed): DNS configuration did not change
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2755] dns-mgr: (device_ip_config_changed): no DNS changes to commit (0)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2756] dns-mgr: (device_ip_config_changed): queueing DNS updates (1)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2756] dns-mgr: (device_ip_config_changed): DNS configuration did not change
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.2756] dns-mgr: (device_ip_config_changed): no DNS changes to commit (0)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.3139] dns-mgr: (device_ip_config_changed): queueing DNS updates (1)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.3139] dns-mgr: (device_ip_config_changed): DNS configuration did not change
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.3139] dns-mgr: (device_ip_config_changed): no DNS changes to commit (0)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.3205] dns-mgr: (device_state_changed): queueing DNS updates (1)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.3205] dns-mgr: (update_routing_and_dns): queueing DNS updates (2)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <trace> [1583357872.3211] policy: set-hostname: updating hostname (routing and dns)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.3213] dns-mgr: (update_routing_and_dns): DNS configuration changed
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.3214] dns-mgr: (update_routing_and_dns): no DNS changes to commit (1)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.3214] dns-mgr: (device_state_changed): DNS configuration changed
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.3214] dns-mgr: (device_state_changed): committing DNS changes (0)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <debug> [1583357872.3215] dns-mgr: update-dns: updating resolv.conf
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <trace> [1583357872.3215] dns-mgr: config:      100 best    v4 2     : 172.22.22.2 172.22.22.22 172.22.220.20
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <trace> [1583357872.3215] dns-mgr: plugin: add domain '~' (i=2, p=100)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <trace> [1583357872.3216] dns-mgr: plugin: add domain 'te' (i=2, p=100)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <trace> [1583357872.3216] dns-mgr: plugin: add domain 'taskeasy.com' (i=2, p=100)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <trace> [1583357872.3217] dns-sd-resolved[62a1606309006c2a]: send-updates: no name owner. Try start service...
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <trace> [1583357872.3218] dns-mgr: update-resolv-no-stub: '/run/NetworkManager/no-stub-resolv.conf' successfully written
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <trace> [1583357872.3221] dns-mgr: update-resolv-conf: write to /etc/resolv.conf succeeded (rc-manager=file)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <trace> [1583357872.3224] dns-mgr: update-resolv-conf: write internal file /run/NetworkManager/resolv.conf succeeded (rc-manager=file)
Mar 04 14:37:52 node-1.stage.te NetworkManager[1013]: <trace> [1583357872.3225] dns-mgr: current configuration: [{'nameservers': <['172.22.22.2', '172.22.22.22', '172.22.220.20']>, 'domains': <['te', 'taskeasy.com']>, 'interface': <'eth0'>, 'priority': <100>, 'vpn': <false>}]

node-1.stage.te:~$ nmcli device show eth0 
GENERAL.DEVICE:                         eth0
GENERAL.TYPE:                           ethernet
GENERAL.HWADDR:                         52:54:00:B6:4D:60
GENERAL.MTU:                            1500
GENERAL.STATE:                          100 (connected)
GENERAL.CONNECTION:                     eth0
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/ActiveConnection/2
WIRED-PROPERTIES.CARRIER:               on
IP4.ADDRESS[1]:                         172.22.22.161/23
IP4.GATEWAY:                            172.22.22.1
IP4.ROUTE[1]:                           dst = 0.0.0.0/0, nh = 172.22.22.1, mt = 100
IP4.ROUTE[2]:                           dst = 172.22.22.0/23, nh = 0.0.0.0, mt = 100
IP4.DNS[1]:                             172.22.22.2
IP4.DNS[2]:                             172.22.22.22
IP4.DNS[3]:                             172.22.220.20
IP4.DOMAIN[1]:                          te
IP6.GATEWAY:                            --

@lump
Copy link

lump commented Mar 10, 2020

After a while, the machine also loses its DHCP address and consequently its connectivity too.

The ip=dhcp rd.neednet=1 kernel parameters are only really for the initrd/ignition bootstrapping environment, aren't they?

When we get to the final booting stage after ignition bootstrapped everything, NetworkManager is not actually running a DHCP client. I am not quite sure what NetworkManager thinks it is doing in the final environment.

Basically, something like the following needs to be run in the final environment:

nmcli device connect eth0

After that is run, then I can see nm-dhcp-helper running and watching eth0, and it successfully writes resolv.conf.

What is the right way to make this sort of thing happen with Fedora CoreOS? I don't want to have to write my own systemd service that kicks NetworkManager in the pants at boot if it should be pulling itself up by its own bootstraps. :)

@dustymabe dustymabe self-assigned this Mar 10, 2020
@dustymabe
Copy link
Member

@lump. You're right that this is a problem and we're working to fix it. An ugly answer for now is to reboot the node after the first boot.

@lump
Copy link

lump commented Mar 10, 2020

@dustymabe this is Live PXE. Rebooting won't help because it is always first-boot.

@dustymabe
Copy link
Member

@lump good point, sorry for the noise.

@lump
Copy link

lump commented Mar 10, 2020

@dustymabe, I totally appreciate that you're watching this and I graciously thank you for your help! :)

@lump
Copy link

lump commented Mar 10, 2020

Since Live PXE FCOS is unusable until this is fixed, here is my super hacky workaround that I am using for now that brute-forces a solution for the problem.

{
 "ignition": { "version": "3.0.0" },
 "systemd": { "units": [{
    "contents": "[Unit]\nDescription=nmcli device connect %I\nBindsTo=sys-subsystem-net-devices-%i.device\nBefore=NetworkManager-wait-online.service\nAfter=sys-subsystem-net-devices-%i.device\nAfter=NetworkManager.service\n\n[Service]\nType=oneshot\nSlice=system.slice\nExecStart=/usr/bin/nmcli device connect %I\n\n[Install]\nWantedBy=network.target\n",
    "enabled": false,
    "name": "[email protected]"
 }]},
 "storage": { "links": [{
    "path": "/etc/systemd/system/network-online.target.wants/[email protected]",
    "target": "/etc/systemd/system/[email protected]",
    "overwrite": true
 }]}
}

Edit: The unit also needs After=NetworkManager.service

@dustymabe
Copy link
Member

We are now using NetworkManager in the initramfs and also propagating network information from the initramfs (kargs) when appropriate, which we think fixes this issue.

See #394 (comment) and the preceding discussion for more details.

@lump
Copy link

lump commented Mar 27, 2020

As far as I can tell, this has indeed been fixed by #394.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants