system-dockerd segfault at 8 ip #2484

niusmallnan · 2018-09-19T14:58:54Z

RancherOS Version: (ros os version)
v1.4.0/v1.4.1
Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.)
All

Check the output of dmesg:

[   40.136459] system-dockerd[1007]: segfault at 8 ip 0000000000542436 sp 000000c420d8f308 error 4 in system-dockerd[400000+1485000]

Now it seems no effect, but I hope to find the root cause.

The text was updated successfully, but these errors were encountered:

niusmallnan · 2018-09-21T06:16:05Z

I tried this to check kernel-config: https://raw.githubusercontent.com/docker/docker/master/contrib/check-config.sh

Looks no abnormal.

niusmallnan · 2018-12-07T01:16:47Z

I can get this error when I run a container with bridge network by system-docker.

$ system-docker run -it --rm alpine

# check this dmesg
$ dmesg

Every time I start the container on system-docker, I can see this log in dmesg.
It is caused by the preload-user-images container. There will be no problem if I change its network to host.

There are to ways to avoid this problem:

ros c set rancher.system_docker.bridge none
ros c set rancher.system_docker.exec false

geauxvirtual · 2019-02-04T23:07:27Z

Trying to track down an issue I'm seeing with Rancher on a physical server. I can't say if I saw this same error on other physical servers, but those servers would actually download the cloud-config I passed in as part of my kernel parameters.

For a reason I haven't found yet, I have a server that boots Rancher but does not download the cloud-config passed in as a kernel parameter. The only error I see on the running server is about system-dockerd: segfault at 8 ip.... From the server itself, the server can reach the host hosting the cloud-config, so I'm not sure why the boot process seems to be stopping before loading the cloud-config.

This is occurring with a Rancher 1.5 iso.

geauxvirtual · 2019-02-04T23:34:26Z

After making a few port configurations on the switch to allow the server to connect faster, Rancher now grabs the config file as it was doing on other hardware. It does seem that if it takes too long for Rancher to get an IP, the download of the cloud-config will fail, and not be re-tried during the boot process.

niusmallnan · 2019-02-13T14:39:30Z

@geauxvirtual Please file another issue and show more details, I think what you mentioned should be irrelevant to this issue.

chappelleshow · 2019-03-07T15:02:57Z

@niusmallnan We are having this issue as well, using ROS 1.4.2, it seems system-docker will segfault randomly.

Mar 5 18:12:38 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5255356.352365] docker-sys: port 1(vethcbd2044) entered blocking state Mar 5 18:12:38 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5255356.352370] docker-sys: port 1(vethcbd2044) entered disabled state Mar 5 18:12:38 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5255356.352452] device vethcbd2044 entered promiscuous mode Mar 5 18:12:38 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5255356.352616] IPv6: ADDRCONF(NETDEV_UP): vethcbd2044: link is not ready Mar 5 18:12:38 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5255356.590013] system-dockerd[873]: segfault at 8 ip 0000000000542436 sp 000000c42115f308 error 4 in system-dockerd[400000+1485000]

Mar 5 18:05:09 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5254907.374372] docker-sys: port 1(vethc2c8d62) entered blocking state Mar 5 18:05:09 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5254907.374375] docker-sys: port 1(vethc2c8d62) entered disabled state Mar 5 18:05:09 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5254907.374441] device vethc2c8d62 entered promiscuous mode Mar 5 18:05:09 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5254907.376833] IPv6: ADDRCONF(NETDEV_UP): vethc2c8d62: link is not ready Mar 5 18:05:09 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5254907.708806] system-dockerd[1823]: segfault at 8 ip 0000000000542436 sp 000000c42108b308 error 4 in system-dockerd[400000+1485000]

We also have a ticket open at https://support.rancher.com/hc/en-us/requests/3546. It seems to be the same memory location every time, do we know what's in memory at 0000000000542436?

laoshancun · 2019-03-13T07:53:31Z

Same issue on RancherOSv1.5.0,and System is stucked.

k00p · 2020-04-01T20:54:47Z

Still seeing this error in dmesg

$ dmesg | grep segfault
[   30.587148] system-dockerd[1232]: segfault at 8 ip 0000000000541d26 sp 000000c421421308 error 4 in system-dockerd[400000+1486000]
[   31.995174] system-dockerd[1633]: segfault at 8 ip 0000000000541d26 sp 000000c42141f308 error 4 in system-dockerd[400000+1486000]
$ sudo ros -v
version v1.5.5 from os image rancher/os:v1.5.5

Running on bare metal.

I am using bonded NICs attached to VLANs, not sure if that matters outside of providing context for the below dmesg output:

rancher:
  network:
    interfaces:
      bond1:
        bond_opts:
          downdelay: "200"
          lacp_rate: "1"
          miimon: "100"
          mode: "4"
          updelay: "200"
          vlans: 100:vlan100,300:vlan300
          xmit_hash_policy: layer3+4
        vlans: 100:vlan100,300:vlan300
      vlan300:
        dhcp: true
        vlans: 300:vlan300
      eth*:
        bond: bond1
      vlan100:
        dhcp: true
        match: vlan100
        vlans: 100:vlan100

Here's some of the surrounding output from dmesg:

[   22.855348] bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond
[   22.855918] bond1: first active interface up!
[   22.856237] IPv6: ADDRCONF(NETDEV_CHANGE): bond1: link becomes ready
[   22.860680] IPv6: ADDRCONF(NETDEV_CHANGE): vlan100: link becomes ready
[   22.861088] IPv6: ADDRCONF(NETDEV_CHANGE): vlan300: link becomes ready
[   23.316600] ixgbe 0000:19:00.1 eth1: NIC Link is Up 10 Gbps, Flow Control: None
[   23.374456] bond1: link status up for interface eth1, enabling it in 200 ms
[   23.582474] bond1: link status definitely up for interface eth1, 10000 Mbps full duplex
[   23.781409] ixgbe 0000:1a:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: None
[   23.790452] bond1: link status up for interface eth2, enabling it in 200 ms
[   23.998471] bond1: link status definitely up for interface eth2, 10000 Mbps full duplex
[   24.134272] ixgbe 0000:1a:00.1 eth3: NIC Link is Up 10 Gbps, Flow Control: None
[   24.206457] bond1: link status up for interface eth3, enabling it in 200 ms
[   24.414475] bond1: link status definitely up for interface eth3, 10000 Mbps full duplex
[   30.288403] docker-sys: port 1(veth1a368c1) entered blocking state
[   30.288723] docker-sys: port 1(veth1a368c1) entered disabled state
[   30.289144] device veth1a368c1 entered promiscuous mode
[   30.289640] IPv6: ADDRCONF(NETDEV_UP): veth1a368c1: link is not ready
[   30.587148] system-dockerd[1232]: segfault at 8 ip 0000000000541d26 sp 000000c421421308 error 4 in system-dockerd[400000+1486000]
[   30.587822] eth0: renamed from veth8da9205
[   30.611068] IPv6: ADDRCONF(NETDEV_CHANGE): veth1a368c1: link becomes ready
[   30.611505] docker-sys: port 1(veth1a368c1) entered blocking state
[   30.611852] docker-sys: port 1(veth1a368c1) entered forwarding state
[   30.612357] IPv6: ADDRCONF(NETDEV_CHANGE): docker-sys: link becomes ready
[   30.828441] docker-sys: port 1(veth1a368c1) entered disabled state
[   30.828972] veth8da9205: renamed from eth0
[   30.892954] docker-sys: port 1(veth1a368c1) entered disabled state
[   30.897782] device veth1a368c1 left promiscuous mode
[   30.897785] docker-sys: port 1(veth1a368c1) entered disabled state
[   31.518233] Loading iSCSI transport class v2.0-870.
[   31.529628] iscsi: registered transport (tcp)
[   31.687542] docker-sys: port 1(veth14ddee3) entered blocking state
[   31.687544] docker-sys: port 1(veth14ddee3) entered disabled state
[   31.687651] device veth14ddee3 entered promiscuous mode
[   31.687836] IPv6: ADDRCONF(NETDEV_UP): veth14ddee3: link is not ready
[   31.687838] docker-sys: port 1(veth14ddee3) entered blocking state
[   31.687840] docker-sys: port 1(veth14ddee3) entered forwarding state
[   31.688147] docker-sys: port 1(veth14ddee3) entered disabled state
[   31.995174] system-dockerd[1633]: segfault at 8 ip 0000000000541d26 sp 000000c42141f308 error 4 in system-dockerd[400000+1486000]
[   31.995224] eth0: renamed from vethf95b0e8
[   32.019024] IPv6: ADDRCONF(NETDEV_CHANGE): veth14ddee3: link becomes ready
[   32.019117] docker-sys: port 1(veth14ddee3) entered blocking state
[   32.019120] docker-sys: port 1(veth14ddee3) entered forwarding state
[   32.261388] docker-sys: port 1(veth14ddee3) entered disabled state
[   32.261621] vethf95b0e8: renamed from eth0

I have not tried the suggestion [https://github.com//issues/2484#issuecomment-445088750] - will Rancher OS function normally with system_docker.bridge or system_docker.exec ?

This is reproduced on 5 hardware systems currently with the same configuration. @niusmallnan Let me know if I can help with anything on this.

rouing · 2020-04-07T19:15:32Z

What we got here:
dhcpcd in System-Docker's Network container is throwing &{/usr/sbin/dhcpcd [dhcpcd -x] [] <nil> 0xc42007c008 0xc42007c010 [] <nil> 0xc420170690 exit status 1 <nil> <nil> true [0xc42000e048 0xc42007c008 0xc42007c010] [0xc42000e048] [] [] 0xc42038c300 <nil>} or similar
dmesg showing segfaults at 000000c421027308 or similar
eth0 which is unused having a stroke about 3 times.
segfault being thrown 3 times.
"networking not avail to load resource" in console container, 3 times. (speculation at this point)

Im starting to think we got a networking problem due to Architecture and design flaws.

I have observed this across 3 systems on 1.5.5 with similar hardware.

rouing · 2020-04-07T20:46:18Z

After further digging, I keep coming back to alpine and musl. It seems to be related to either how musl handles DNS resolution or some form of hardening.
nodejs/docker-node#588
gliderlabs/docker-alpine#255
There is many other issues pointing to docker here for this kinda muck.
Big yikes.

niusmallnan added the kind/enhancement label Sep 19, 2018

niusmallnan added this to the v1.5.0 milestone Sep 19, 2018

niusmallnan self-assigned this Sep 19, 2018

niusmallnan mentioned this issue Oct 17, 2018

network.interfaces: "mac=..." doesn't work; dhcp: false is ignored #2521

Closed

niusmallnan modified the milestones: v1.5.0, v1.5.1 Dec 7, 2018

niusmallnan mentioned this issue Jan 7, 2019

Use host network for preload-user-images service #2632

Merged

niusmallnan modified the milestones: v1.5.1, v1.6.0 Feb 5, 2019

This was referenced Jun 1, 2019

Docker not working after upgrade to 1.5.2 #2797

Closed

system-dockerd: segfault. #2796

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

system-dockerd segfault at 8 ip #2484

system-dockerd segfault at 8 ip #2484

niusmallnan commented Sep 19, 2018

niusmallnan commented Sep 21, 2018

niusmallnan commented Dec 7, 2018 •

edited

Loading

geauxvirtual commented Feb 4, 2019

geauxvirtual commented Feb 4, 2019

niusmallnan commented Feb 13, 2019

chappelleshow commented Mar 7, 2019

laoshancun commented Mar 13, 2019

k00p commented Apr 1, 2020 •

edited

Loading

rouing commented Apr 7, 2020 •

edited

Loading

rouing commented Apr 7, 2020

system-dockerd segfault at 8 ip #2484

system-dockerd segfault at 8 ip #2484

Comments

niusmallnan commented Sep 19, 2018

niusmallnan commented Sep 21, 2018

niusmallnan commented Dec 7, 2018 • edited Loading

geauxvirtual commented Feb 4, 2019

geauxvirtual commented Feb 4, 2019

niusmallnan commented Feb 13, 2019

chappelleshow commented Mar 7, 2019

laoshancun commented Mar 13, 2019

k00p commented Apr 1, 2020 • edited Loading

rouing commented Apr 7, 2020 • edited Loading

rouing commented Apr 7, 2020

niusmallnan commented Dec 7, 2018 •

edited

Loading

k00p commented Apr 1, 2020 •

edited

Loading

rouing commented Apr 7, 2020 •

edited

Loading