Various issues in systemd.NotifyProxy #16515

alexlarsson · 2022-11-15T10:15:24Z

I've been reviewing this code, hoping to find the cause of #16076. I'm not 100% sure I've found the underlying reason, but I have found several issues.

First of all, there seems to be a general misconception of how datagram sockets work. Any single write is never split, and is delivered as a separate read, never combined with any other write. This means that the loop in waitForReady() is wrong when combining reads. The receiving end should read one single read at a time (which will be the equivalent of one packet that was sent) and then immediately handle it, then handling further reads separately.

If the sending side side sends a larger message than the buffer you passed, then you will get a truncated read, and the remaining data is thrown away. If the receive buffer in the kernel is full for the socket, further incoming packets will be dropped.

Here is how systemd handles incoming messages, which we should emulate:

setsockopt is used on the socket, to set SO_RCVBUF to 8*1024*1024, which means less risk of packets being dropped.
The recvmsg() has a buffer size of PIPE_BUF == 4096 bytes
Truncated reads (MSG_TRUNC) are detected and handled as an error
Each non-truncated read is treated on its own as one full packet
reads are done with recvmsg() which allows passing of file descriptors from the client

Secondly, sd-notify handles something called barriers. What happens then is that the client sends a packet containing only "BARRIER=1", which includes a file descriptor. The client side sends this, then waits for the file descriptor to be closed. So, when the server gets a message like this it needs to extract and close all the passed file descriptors.

The test case is using systemd-notify --ready, and that has this code:

        r = sd_pid_notify(source_pid, false, n);  // main notify
...
        if (!arg_no_block) {
                r = sd_notify_barrier(0, 5 * USEC_PER_SEC);

In other words, unless you passed --no-block, the client will first send the notify message, then a separate barrier message and then wait until the barrier is handled.

So, issues here:

we must handle barrier messages correctly
we must continue to handle messages after the notify message

I think the best way to handle the barrier messages are to just close the fd:s in the message and ignore the message. Alternatively we could send them on to systemd and have it close them, but that seems unnecessary. I think the point of the barrier is to ensure that the remote side has seen the previous messages and we're sure they are not lost in the send buffer of the client or something. Maybe we need to have the notify proxy similarly use a barrier when reporting the notify to systemd though.

For the continued handling of messages, I'm not sure what the best approach is. There is not real way to tell that the peer side of a datagram socket has exited or closed the fd. The way systemd handles this is that it has only a single notify receive socket in the entire pid 1, and then it looks at the kernel reported peer pid on received messages to see what pid sent the message. Then it keeps the socket open forever.

Thirdly, It feels like there is a race condition between errorChan and readyChan when we deal with a timeout. If the timeout races with the readyChan, then both could be sent. I don't think this actually is an issue right now, because WaitAndClose() will just randomly pick one of them and ignore the other. However, long-term if we handle multiple messages, etc it may be an issue.

Other minor issues:

Systemd accepts (but warns) about the message text having a single trailing nul

For details of how systemd works, see manager_dispatch_notify_fd()
https://github.com/systemd/systemd/blob/main/src/core/manager.c#L2444
this is called from the mainloop everytime poll/select returns the notify socket fd as readable.

The text was updated successfully, but these errors were encountered:

alexlarsson · 2022-11-15T11:07:59Z

So, the problem in the tests seems to be that the sd-notify write blocks. This is a sequence like this:
client:

sd-notify writes READY=1 packet (on a new dgram socket)
sd-notify creads socketpair
sd-notify writes BARRIER=1 packet, with fd = socketpair[1] (on a new dgram socket)
sd-notify closes socketpair[1] (but the outstanding packet keeps it open)
sd-notifies blocks on poll(socketpair[0]) which will return POLLHUP when socketpair[1] is closed

proxy waitForReady gofunc:

proxy reads READY=1 packet
read size is not bufferSize (or is it???) so we handle the message dirrectly
split lies has READY=1 so we signal readyChan
waitForReady() exits, killing gofunc

proxy WaitAndClose():

we block for readyChan
we return nil as no error has happened
in defer, the datagram socket is closed and removed

It seems like sd-notify was successful in sending the barrier packet, but then never gets the POLLHUP. I don't immediately see why this should happen. If the message was not sendmsg()ed at all, then it should have failed immediately, but if it got sent, then the message should be on the receiver side queue, and if that socket is closed the queue should have been drained and the fd closed. The other option is if the server somehow read the second packet too (see read size == bufferSize and the ??? above), and we got the message out of the queue. However, surely a read() instead of a recvmsg() on a socket that doesn't extract the fd:s would close them too, no?

vrothberg · 2022-11-15T12:51:00Z

Thanks for your help and the great explanation, @alexlarsson. Tackling it requires more concentration than I can get in the train, but we should surely get it done by Podman v4.4 👍

vrothberg · 2022-12-02T11:15:53Z

Thirdly, It feels like there is a race condition between errorChan and readyChan when we deal with a timeout. If the timeout races with the readyChan, then both could be sent.

If there's a race we select the ready chan first, so I think we're good.

we must handle barrier messages correctly

I have a PR in the pipeline to fix that.

we must continue to handle messages after the notify message

So that means that kube-play needs to run until all services are down (and the "service container" exits). Only then the notify proxies are safe to be closed. Right?

alexlarsson · 2022-12-02T12:31:05Z

we must continue to handle messages after the notify message

So that means that kube-play needs to run until all services are down (and the "service container" exits). Only then the notify proxies are safe to be closed. Right?

In theory yes, but I think in practice you can probably shut it down after you have gotten a notify and a barrier from each expected container, or some timeout after notify.

vrothberg · 2022-12-02T13:29:11Z

@alexlarsson do you think it's fair to assume that the barrier comes right after ready in the same message? If so, #16709 does the trick but I am not 100 percent sure.

alexlarsson · 2022-12-02T13:33:22Z

@vrothberg I don't think it comes in the same message. Its two separate messages, but probably sent directly after the notify.

vrothberg · 2022-12-02T14:12:44Z

I am not sure whether it's best to just require a non-blocking ready. That may help in avoiding other expectations toward the protocol that Podman doesn't implement.

vrothberg · 2022-12-02T15:00:24Z

Interestingly the systemd-notify in our CI (Fedora:31) is not sending barrier messages at all. Will continue digging on Monday.

As outlined in containers#16076, a subsequent BARRIER *may* follow the READY message sent by a container. To correctly imitate the behavior of systemd's NOTIFY_SOCKET, the notify proxies span up by `kube play` must hence process messages for the entirety of the workload. We know that the workload is done and that all containers and pods have exited when the service container exits. Hence, all proxies are closed at that time. The above changes imply that Podman runs for the entirety of the workload and will henceforth act as the MAINPID when running inside of systemd. Prior to this change, the service container acted as the MAINPID which is now not possible anymore; Podman would be killed immediately on exit of the service container and could not clean up. The kube template now correctly transitions to in-active instead of failed in systemd. Fixes: containers#16076 Fixes: containers#16515 Signed-off-by: Valentin Rothberg <[email protected]>

vrothberg mentioned this issue Dec 2, 2022

kube sdnotify: run proxies for the lifespan of the service #16709

Merged

openshift-merge-robot closed this as completed in #16709 Dec 7, 2022

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 8, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various issues in systemd.NotifyProxy #16515

Various issues in systemd.NotifyProxy #16515

alexlarsson commented Nov 15, 2022 •

edited

Loading

alexlarsson commented Nov 15, 2022

vrothberg commented Nov 15, 2022

vrothberg commented Dec 2, 2022

alexlarsson commented Dec 2, 2022

vrothberg commented Dec 2, 2022

alexlarsson commented Dec 2, 2022

vrothberg commented Dec 2, 2022

vrothberg commented Dec 2, 2022

Various issues in systemd.NotifyProxy #16515

Various issues in systemd.NotifyProxy #16515

Comments

alexlarsson commented Nov 15, 2022 • edited Loading

alexlarsson commented Nov 15, 2022

vrothberg commented Nov 15, 2022

vrothberg commented Dec 2, 2022

alexlarsson commented Dec 2, 2022

vrothberg commented Dec 2, 2022

alexlarsson commented Dec 2, 2022

vrothberg commented Dec 2, 2022

vrothberg commented Dec 2, 2022

alexlarsson commented Nov 15, 2022 •

edited

Loading