-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various issues in systemd.NotifyProxy #16515
Comments
So, the problem in the tests seems to be that the sd-notify write blocks. This is a sequence like this:
proxy waitForReady gofunc:
proxy WaitAndClose():
It seems like sd-notify was successful in sending the barrier packet, but then never gets the POLLHUP. I don't immediately see why this should happen. If the message was not sendmsg()ed at all, then it should have failed immediately, but if it got sent, then the message should be on the receiver side queue, and if that socket is closed the queue should have been drained and the fd closed. The other option is if the server somehow read the second packet too (see read size == bufferSize and the ??? above), and we got the message out of the queue. However, surely a read() instead of a recvmsg() on a socket that doesn't extract the fd:s would close them too, no? |
Thanks for your help and the great explanation, @alexlarsson. Tackling it requires more concentration than I can get in the train, but we should surely get it done by Podman v4.4 👍 |
If there's a race we select the ready chan first, so I think we're good.
I have a PR in the pipeline to fix that.
So that means that kube-play needs to run until all services are down (and the "service container" exits). Only then the notify proxies are safe to be closed. Right? |
In theory yes, but I think in practice you can probably shut it down after you have gotten a notify and a barrier from each expected container, or some timeout after notify. |
@alexlarsson do you think it's fair to assume that the barrier comes right after ready in the same message? If so, #16709 does the trick but I am not 100 percent sure. |
@vrothberg I don't think it comes in the same message. Its two separate messages, but probably sent directly after the notify. |
I am not sure whether it's best to just require a non-blocking ready. That may help in avoiding other expectations toward the protocol that Podman doesn't implement. |
Interestingly the systemd-notify in our CI (Fedora:31) is not sending barrier messages at all. Will continue digging on Monday. |
As outlined in containers#16076, a subsequent BARRIER *may* follow the READY message sent by a container. To correctly imitate the behavior of systemd's NOTIFY_SOCKET, the notify proxies span up by `kube play` must hence process messages for the entirety of the workload. We know that the workload is done and that all containers and pods have exited when the service container exits. Hence, all proxies are closed at that time. The above changes imply that Podman runs for the entirety of the workload and will henceforth act as the MAINPID when running inside of systemd. Prior to this change, the service container acted as the MAINPID which is now not possible anymore; Podman would be killed immediately on exit of the service container and could not clean up. The kube template now correctly transitions to in-active instead of failed in systemd. Fixes: containers#16076 Fixes: containers#16515 Signed-off-by: Valentin Rothberg <[email protected]>
I've been reviewing this code, hoping to find the cause of #16076. I'm not 100% sure I've found the underlying reason, but I have found several issues.
First of all, there seems to be a general misconception of how datagram sockets work. Any single write is never split, and is delivered as a separate read, never combined with any other write. This means that the loop in waitForReady() is wrong when combining reads. The receiving end should read one single read at a time (which will be the equivalent of one packet that was sent) and then immediately handle it, then handling further reads separately.
If the sending side side sends a larger message than the buffer you passed, then you will get a truncated read, and the remaining data is thrown away. If the receive buffer in the kernel is full for the socket, further incoming packets will be dropped.
Here is how systemd handles incoming messages, which we should emulate:
Secondly, sd-notify handles something called barriers. What happens then is that the client sends a packet containing only "BARRIER=1", which includes a file descriptor. The client side sends this, then waits for the file descriptor to be closed. So, when the server gets a message like this it needs to extract and close all the passed file descriptors.
The test case is using
systemd-notify --ready
, and that has this code:In other words, unless you passed
--no-block
, the client will first send the notify message, then a separate barrier message and then wait until the barrier is handled.So, issues here:
I think the best way to handle the barrier messages are to just close the fd:s in the message and ignore the message. Alternatively we could send them on to systemd and have it close them, but that seems unnecessary. I think the point of the barrier is to ensure that the remote side has seen the previous messages and we're sure they are not lost in the send buffer of the client or something. Maybe we need to have the notify proxy similarly use a barrier when reporting the notify to systemd though.
For the continued handling of messages, I'm not sure what the best approach is. There is not real way to tell that the peer side of a datagram socket has exited or closed the fd. The way systemd handles this is that it has only a single notify receive socket in the entire pid 1, and then it looks at the kernel reported peer pid on received messages to see what pid sent the message. Then it keeps the socket open forever.
Thirdly, It feels like there is a race condition between errorChan and readyChan when we deal with a timeout. If the timeout races with the readyChan, then both could be sent. I don't think this actually is an issue right now, because WaitAndClose() will just randomly pick one of them and ignore the other. However, long-term if we handle multiple messages, etc it may be an issue.
Other minor issues:
For details of how systemd works, see manager_dispatch_notify_fd()
https://github.com/systemd/systemd/blob/main/src/core/manager.c#L2444
this is called from the mainloop everytime poll/select returns the notify socket fd as readable.
The text was updated successfully, but these errors were encountered: