-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No network at docker container level (whereas everything is OK inside flatcar server). #1552
Comments
Check the output of these on host:
|
Yeah i found some info related to "ip_forward" yesterday and also checked it at that time : it seems OK for "ip_forward" (= 1). I'm not able to interpret the firewall rules ... (but how could them be wrong, as they've not been altered in any way = i've not configured anything or tried anything about this). Here are the results :
|
Hello, I would suggest to test a minimal scenario and see if it works with the new Flatcar image you have. Can you try to start a new nginx container with an exposed port (12301 - I chose it randomly), and then try to curl it:
Thanks. |
Have just tried, and still same behavior :
|
Do you happen to know from which Flatcar version did you upgrade, to help reproduce the issue? |
I can post my IGNITION details, but i don't think it's really related (however ... maybe it was OK with that old revision and not anymore with the new versions ? For example : i'm exposing (as it's local network / homelab usage) the 2375 docker port outside of the host) : Ignition (a bit simplified here about non-related topics like SSH keys) : passwd:
users:
- name: root
password_hash: "REMOVED"
ssh_authorized_keys: "REMOVED"
storage:
links:
# Set proper timezone
- path: /etc/localtime
filesystem: root
overwrite: true
target: /usr/share/zoneinfo/Europe/Paris
files:
# we force same SSH config all the time
- path: /etc/hostname
filesystem: root
mode: 0644
contents:
inline: FLATCAR-SERVER-1
- path: /etc/ssh/ssh_host_dsa_key(extra_sections_REMOVED)
systemd:
units:
# Mount NVME to /var/lib/docker
# Has to be partitioned & formatted manually before !
- name: var-lib-docker.mount
enabled: true
contents: |
[Unit]
Description=Mount NVME to /var/lib/docker
Before=local-fs.target
[Mount]
What=/dev/nvme0n1p1
Where=/var/lib/docker
Type=ext4
[Install]
WantedBy=local-fs.target
# We want docker (service) to be automatically started instead of starting docker
# when needed through its socket
# - name: docker.socket
# enabled: false
- name: docker.service
enabled: true
dropins:
- name: 10-wait-docker.conf
contents: |
[Unit]
After=var-lib-docker.mount
Requires=var-lib-docker.mount
- name : containerd.service
enabled: true
# Expose docker socket over tcp = over the network
- name: docker-tcp.socket
enabled: true
contents: |
[Unit]
Description=Docker Socket for the API
[Socket]
ListenStream=2375
BindIPv6Only=both
Service=docker.service
[Install]
WantedBy=sockets.target The "manifest" exposing the image (i just added here at the end the disabling of IPv6) :
|
Sadly, no ... i was also wondering the same (i would have been willing to force (i think it's possible) the usage of that old revision, but i don't know which one it was). |
maybe if you can point the aproximate time of first install, I see you were using stable channel. |
From the state of the iptables, veth* and port mappings, all looks correct, maybe this is an underlying issue with the containerd (I can only guess). From inside the docker container, can you ping the docker gateway? that if you have a network interface attached. Can you share, from inside the container the output of Also, if you can share the full journalctl output, there might be some error / failure log that can pinpoint the issue. |
From inside a container :
Pinging the Docker Gateway is also not routed :
JournalCTL of the host (= flatcar) : https://gist.github.com/SR-G/41fb3d48d728b321d9c5b42967d87e4e |
The USR-B partition should have more information on when the image was created, I think. Can you mount the 4th partition (USR-B), and see the oldest file timestamps? https://www.flatcar.org/docs/latest/reference/developer-guides/sdk-disk-partitions/ |
Mhh i'm confused, how should i do that ? As indeed, in my situation, i'm running in full IPXE : i have nothing installed in local / no flatcar local partitions (i just have one NVME disk, which is mounted on /var/lib/docker (so not really "flatcar related" = it just persists volumes and container images), and nothing more = FLATCAR system is only in memory / served from iPXE without local installation) |
Gotcha, in that case, there is no leftover information. Meanwhile, I have tried to run Flatcar stable 3510.2.2 (from around May 2023) - start the nginx docker container, verified it works, and then upgrade to latest stable 3975.2.1, verified and it works as expected. |
And by the way, that's why i'm even more surprised it's not working : after each reboot, as everything is taken "on the fly", i would have expected to be nearly in a "fresh start" mode ... Also also why i tried to blank the /var/lib/docker partition (to also have docker starting from scratch, in case the docker networks would have been corrupted for any reason). |
Seems I cannot reproduce the issue, I tried to remove the /var/lib/docker/network/files/local-kv.db and then reboot, restart docker, restart container, the issue was not present. What you can try is to make sure that the environment is cleaned up on the Linux level, I can suggest you the following:
Something similar to (these commands are examples, please be cautious and run them at your own risk):
|
I applied all these commands and ... still (nearly, see below) the same behavior. Also, as i start to think that the problem may be in my IGN definition (through Butane), i removed nearly everything, especially the "exposure of docker TCP 2375 port over the network" (just in case ...) + re-disable "docker.socket"
What is now definitely different is that before i was encountering a "connection reset by peer" (as visible before) after a few seconds (1 or 2) :
Whereas now it takes way more time before failing :
(so now 2 minutes of timeout ...) So it's obvious that this is only happening in a very specific situation (mine) but i really start to struggle understanding from where it could come ... |
So i had many issues to test with older FLATCAR versions (detailled at the end of this post (in case it would help, but i think it's not really related, i would say)). In the end, i've been able to revert successfully to an older one : And guess what : with that older version, everything works perfectly out of the box (with the snippet you provided before + the "cleaning" of everything to start from scratch just before rebooting on that version) :
And it's not working on :
I tried a few intermediate versions, but they are not booting, i don't know why yet (no HDMI monitor is plugged in the NUC), like :
So : i have no idea what, but "something" has changed in FLATCAR between Regarding the "downgrade" issues (even after having "blanked" everything) :
|
Hello, I have tried to reproduce the issue using the 3374.2.4 as base and then 3975, but all worked fine. It is also possible that you do not need to disable the docker.socket if you enable the docker service, as the docker service will start the docker.socket anyways. It might also be the case something happens on the hardware layer, which is at times impossible to troubleshoot. I would suggest to boot a live Ubuntu on the NUC and do a The proxy certificate issue might be due to an outdated ca-certificates package on that proxy host.
I think I can get a NUC to test things out, as I need to also check the #1306 and can do it at the same time. Can you share the NUC type, if possible? Oh, from the Thanks. |
So a few updates. I tested on a second NUC (not exactly the same model, but still rather close) (i'll put the models in a next post). And i have the EXACT same behavior / same problem. About the "won't boot" issue I don't have it anymore.
About the exact versions not working Per previous point, as it seems that now i'm able to boot on any older version, i've been able to confirm (on the two NUC) that the problem is definitely coming since
This is :
Interestingly enough, the docker daemon has been updated between these two FLATCAR version : from Next steps
|
About the NUC i'm using They are indeed not INTEL "NUC".
|
Can you compare the output of One of the comments showed I'm going to say the problem is that the networkd config that we use for pxe is trying to manage the veth device on the host, and that's causing your problem. If you want to confirm quickly, then create the following file through ignition:
If this fixes it then i'm right and we need to include the fix. |
If that is the case, it looks similar to systemd/systemd#28626 and #1515. I am linking this issues so that we have more insight in the future with testing systemd upgrades - maybe by adding some Mantle tests for these scenarios. |
@jepio From the logs @SR-G shared: https://gist.github.com/SR-G/41fb3d48d728b321d9c5b42967d87e4e#file-gistfile1-txt-L2786 ->
|
So today's updates. About networkctl in working / non-working situation
About the suggested solution I have applied as suggested that Butane configuration : And after rebooting (i'm of course on the latest / current version, and just before it was not working),
And then the good news : from there, everything works perfectly !
|
So i think we can close this, and that you'll (at some point) include by default the provided patch. Thanks for the valuable help. |
Description
I have a NUC running flatcar, installed one year ago and never rebooted.
With only a docker installation inside it.
After today's reboot, i'm now with the latest FLATCAR images, but i can't access anymore the ports exposed out of my docker containers.
Impact
Running docker containers are unusable.
Environment and steps to reproduce
Not sure it will allow to "reproduce", but here are the symptoms.
FLATCAR :
Linux FLATCAR-SERVER-1 6.6.48-flatcar #1 SMP PREEMPT_DYNAMIC Wed Sep 4 15:49:08 -00 2024 x86_64 AMD Ryzen 5 3450U with Radeon Vega Mobile Gfx AuthenticAMD GNU/Linux
CONTEXT :
SYMPTOMS :
3887f9e5c368 victoriametrics/victoria-metrics "/victoria-metrics-p…" 37 minutes ago Up 20 minutes 0.0.0.0:8428->8428/tcp victoria-metrics
(note that the "8428" port is exposed and should be accessible from the outside of the container)# curl 127.0.0.1:8428 curl: (56) Recv failure: Connection reset by peer
localhost:8428
docker exec -it victoria-metrics /bin/ash
, then i also have NO network there :(every ping is stuck for a while so i CTRL-C)
What i tryed (without luck) :
docker inspect
, especially the network part : nothing wrong !--net=host
(so no ports exposed) : here of course, everything works (the port is accessible from the host / from other servers + from the inside of the container, i can reach any IP outside it) - but of course, this is not the proper / possible solutionAs far as i can tell, i would really say it's somehow FLATCAR related and not DOCKER related ... but i'm really stuck / i can't find any idea of what may be wrong.
Expected behavior
Network to be working
Additional information
IFCONFIG (there is 2 RJ45 port, only one is plugged at this time) :
DOCKER INSPECT :
(i compared everything from that JSON with another containers in a non-flatcar environment, and no special/structural differences)
ETC/HOSTS (nothing special)
NETSTAT seems fine ... :
and
The text was updated successfully, but these errors were encountered: