Skip to content
This repository has been archived by the owner on Oct 11, 2023. It is now read-only.

system-dockerd segfault at 8 ip #2484

Open
niusmallnan opened this issue Sep 19, 2018 · 10 comments
Open

system-dockerd segfault at 8 ip #2484

niusmallnan opened this issue Sep 19, 2018 · 10 comments
Assignees
Milestone

Comments

@niusmallnan
Copy link
Contributor

RancherOS Version: (ros os version)
v1.4.0/v1.4.1
Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.)
All

Check the output of dmesg:

[   40.136459] system-dockerd[1007]: segfault at 8 ip 0000000000542436 sp 000000c420d8f308 error 4 in system-dockerd[400000+1485000]

Now it seems no effect, but I hope to find the root cause.

@niusmallnan niusmallnan added this to the v1.5.0 milestone Sep 19, 2018
@niusmallnan niusmallnan self-assigned this Sep 19, 2018
@niusmallnan
Copy link
Contributor Author

I tried this to check kernel-config: https://raw.githubusercontent.com/docker/docker/master/contrib/check-config.sh

Looks no abnormal.

@niusmallnan
Copy link
Contributor Author

niusmallnan commented Dec 7, 2018

I can get this error when I run a container with bridge network by system-docker.

$ system-docker run -it --rm alpine

# check this dmesg
$ dmesg

Every time I start the container on system-docker, I can see this log in dmesg.
It is caused by the preload-user-images container. There will be no problem if I change its network to host.

There are to ways to avoid this problem:

  1. ros c set rancher.system_docker.bridge none
  2. ros c set rancher.system_docker.exec false

@geauxvirtual
Copy link

Trying to track down an issue I'm seeing with Rancher on a physical server. I can't say if I saw this same error on other physical servers, but those servers would actually download the cloud-config I passed in as part of my kernel parameters.

For a reason I haven't found yet, I have a server that boots Rancher but does not download the cloud-config passed in as a kernel parameter. The only error I see on the running server is about system-dockerd: segfault at 8 ip.... From the server itself, the server can reach the host hosting the cloud-config, so I'm not sure why the boot process seems to be stopping before loading the cloud-config.

This is occurring with a Rancher 1.5 iso.

@geauxvirtual
Copy link

After making a few port configurations on the switch to allow the server to connect faster, Rancher now grabs the config file as it was doing on other hardware. It does seem that if it takes too long for Rancher to get an IP, the download of the cloud-config will fail, and not be re-tried during the boot process.

@niusmallnan niusmallnan modified the milestones: v1.5.1, v1.6.0 Feb 5, 2019
@niusmallnan
Copy link
Contributor Author

@geauxvirtual Please file another issue and show more details, I think what you mentioned should be irrelevant to this issue.

@chappelleshow
Copy link

@niusmallnan We are having this issue as well, using ROS 1.4.2, it seems system-docker will segfault randomly.

Mar 5 18:12:38 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5255356.352365] docker-sys: port 1(vethcbd2044) entered blocking state Mar 5 18:12:38 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5255356.352370] docker-sys: port 1(vethcbd2044) entered disabled state Mar 5 18:12:38 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5255356.352452] device vethcbd2044 entered promiscuous mode Mar 5 18:12:38 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5255356.352616] IPv6: ADDRCONF(NETDEV_UP): vethcbd2044: link is not ready Mar 5 18:12:38 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5255356.590013] system-dockerd[873]: segfault at 8 ip 0000000000542436 sp 000000c42115f308 error 4 in system-dockerd[400000+1485000]

Mar 5 18:05:09 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5254907.374372] docker-sys: port 1(vethc2c8d62) entered blocking state Mar 5 18:05:09 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5254907.374375] docker-sys: port 1(vethc2c8d62) entered disabled state Mar 5 18:05:09 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5254907.374441] device vethc2c8d62 entered promiscuous mode Mar 5 18:05:09 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5254907.376833] IPv6: ADDRCONF(NETDEV_UP): vethc2c8d62: link is not ready Mar 5 18:05:09 07140b4c0b2340be9ae2f07bac8fb551 kernel: [5254907.708806] system-dockerd[1823]: segfault at 8 ip 0000000000542436 sp 000000c42108b308 error 4 in system-dockerd[400000+1485000]

We also have a ticket open at https://support.rancher.com/hc/en-us/requests/3546. It seems to be the same memory location every time, do we know what's in memory at 0000000000542436?

@laoshancun
Copy link

Same issue on RancherOSv1.5.0,and System is stucked.
image

@k00p
Copy link

k00p commented Apr 1, 2020

Still seeing this error in dmesg

$ dmesg | grep segfault
[   30.587148] system-dockerd[1232]: segfault at 8 ip 0000000000541d26 sp 000000c421421308 error 4 in system-dockerd[400000+1486000]
[   31.995174] system-dockerd[1633]: segfault at 8 ip 0000000000541d26 sp 000000c42141f308 error 4 in system-dockerd[400000+1486000]
$ sudo ros -v
version v1.5.5 from os image rancher/os:v1.5.5

Running on bare metal.

I am using bonded NICs attached to VLANs, not sure if that matters outside of providing context for the below dmesg output:

rancher:
  network:
    interfaces:
      bond1:
        bond_opts:
          downdelay: "200"
          lacp_rate: "1"
          miimon: "100"
          mode: "4"
          updelay: "200"
          vlans: 100:vlan100,300:vlan300
          xmit_hash_policy: layer3+4
        vlans: 100:vlan100,300:vlan300
      vlan300:
        dhcp: true
        vlans: 300:vlan300
      eth*:
        bond: bond1
      vlan100:
        dhcp: true
        match: vlan100
        vlans: 100:vlan100

Here's some of the surrounding output from dmesg:

[   22.855348] bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond
[   22.855918] bond1: first active interface up!
[   22.856237] IPv6: ADDRCONF(NETDEV_CHANGE): bond1: link becomes ready
[   22.860680] IPv6: ADDRCONF(NETDEV_CHANGE): vlan100: link becomes ready
[   22.861088] IPv6: ADDRCONF(NETDEV_CHANGE): vlan300: link becomes ready
[   23.316600] ixgbe 0000:19:00.1 eth1: NIC Link is Up 10 Gbps, Flow Control: None
[   23.374456] bond1: link status up for interface eth1, enabling it in 200 ms
[   23.582474] bond1: link status definitely up for interface eth1, 10000 Mbps full duplex
[   23.781409] ixgbe 0000:1a:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: None
[   23.790452] bond1: link status up for interface eth2, enabling it in 200 ms
[   23.998471] bond1: link status definitely up for interface eth2, 10000 Mbps full duplex
[   24.134272] ixgbe 0000:1a:00.1 eth3: NIC Link is Up 10 Gbps, Flow Control: None
[   24.206457] bond1: link status up for interface eth3, enabling it in 200 ms
[   24.414475] bond1: link status definitely up for interface eth3, 10000 Mbps full duplex
[   30.288403] docker-sys: port 1(veth1a368c1) entered blocking state
[   30.288723] docker-sys: port 1(veth1a368c1) entered disabled state
[   30.289144] device veth1a368c1 entered promiscuous mode
[   30.289640] IPv6: ADDRCONF(NETDEV_UP): veth1a368c1: link is not ready
[   30.587148] system-dockerd[1232]: segfault at 8 ip 0000000000541d26 sp 000000c421421308 error 4 in system-dockerd[400000+1486000]
[   30.587822] eth0: renamed from veth8da9205
[   30.611068] IPv6: ADDRCONF(NETDEV_CHANGE): veth1a368c1: link becomes ready
[   30.611505] docker-sys: port 1(veth1a368c1) entered blocking state
[   30.611852] docker-sys: port 1(veth1a368c1) entered forwarding state
[   30.612357] IPv6: ADDRCONF(NETDEV_CHANGE): docker-sys: link becomes ready
[   30.828441] docker-sys: port 1(veth1a368c1) entered disabled state
[   30.828972] veth8da9205: renamed from eth0
[   30.892954] docker-sys: port 1(veth1a368c1) entered disabled state
[   30.897782] device veth1a368c1 left promiscuous mode
[   30.897785] docker-sys: port 1(veth1a368c1) entered disabled state
[   31.518233] Loading iSCSI transport class v2.0-870.
[   31.529628] iscsi: registered transport (tcp)
[   31.687542] docker-sys: port 1(veth14ddee3) entered blocking state
[   31.687544] docker-sys: port 1(veth14ddee3) entered disabled state
[   31.687651] device veth14ddee3 entered promiscuous mode
[   31.687836] IPv6: ADDRCONF(NETDEV_UP): veth14ddee3: link is not ready
[   31.687838] docker-sys: port 1(veth14ddee3) entered blocking state
[   31.687840] docker-sys: port 1(veth14ddee3) entered forwarding state
[   31.688147] docker-sys: port 1(veth14ddee3) entered disabled state
[   31.995174] system-dockerd[1633]: segfault at 8 ip 0000000000541d26 sp 000000c42141f308 error 4 in system-dockerd[400000+1486000]
[   31.995224] eth0: renamed from vethf95b0e8
[   32.019024] IPv6: ADDRCONF(NETDEV_CHANGE): veth14ddee3: link becomes ready
[   32.019117] docker-sys: port 1(veth14ddee3) entered blocking state
[   32.019120] docker-sys: port 1(veth14ddee3) entered forwarding state
[   32.261388] docker-sys: port 1(veth14ddee3) entered disabled state
[   32.261621] vethf95b0e8: renamed from eth0

I have not tried the suggestion [https://github.com//issues/2484#issuecomment-445088750] - will Rancher OS function normally with system_docker.bridge or system_docker.exec ?

This is reproduced on 5 hardware systems currently with the same configuration. @niusmallnan Let me know if I can help with anything on this.

@rouing
Copy link

rouing commented Apr 7, 2020

image

What we got here:
dhcpcd in System-Docker's Network container is throwing &{/usr/sbin/dhcpcd [dhcpcd -x] [] <nil> 0xc42007c008 0xc42007c010 [] <nil> 0xc420170690 exit status 1 <nil> <nil> true [0xc42000e048 0xc42007c008 0xc42007c010] [0xc42000e048] [] [] 0xc42038c300 <nil>} or similar
dmesg showing segfaults at 000000c421027308 or similar
eth0 which is unused having a stroke about 3 times.
segfault being thrown 3 times.
"networking not avail to load resource" in console container, 3 times. (speculation at this point)

Im starting to think we got a networking problem due to Architecture and design flaws.

I have observed this across 3 systems on 1.5.5 with similar hardware.

@rouing
Copy link

rouing commented Apr 7, 2020

After further digging, I keep coming back to alpine and musl. It seems to be related to either how musl handles DNS resolution or some form of hardening.
nodejs/docker-node#588
gliderlabs/docker-alpine#255
There is many other issues pointing to docker here for this kinda muck.
Big yikes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants