Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various issues in systemd.NotifyProxy #16515

Closed
alexlarsson opened this issue Nov 15, 2022 · 8 comments · Fixed by #16709
Closed

Various issues in systemd.NotifyProxy #16515

alexlarsson opened this issue Nov 15, 2022 · 8 comments · Fixed by #16709
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@alexlarsson
Copy link
Contributor

alexlarsson commented Nov 15, 2022

I've been reviewing this code, hoping to find the cause of #16076. I'm not 100% sure I've found the underlying reason, but I have found several issues.

First of all, there seems to be a general misconception of how datagram sockets work. Any single write is never split, and is delivered as a separate read, never combined with any other write. This means that the loop in waitForReady() is wrong when combining reads. The receiving end should read one single read at a time (which will be the equivalent of one packet that was sent) and then immediately handle it, then handling further reads separately.

If the sending side side sends a larger message than the buffer you passed, then you will get a truncated read, and the remaining data is thrown away. If the receive buffer in the kernel is full for the socket, further incoming packets will be dropped.

Here is how systemd handles incoming messages, which we should emulate:

  • setsockopt is used on the socket, to set SO_RCVBUF to 8*1024*1024, which means less risk of packets being dropped.
  • The recvmsg() has a buffer size of PIPE_BUF == 4096 bytes
  • Truncated reads (MSG_TRUNC) are detected and handled as an error
  • Each non-truncated read is treated on its own as one full packet
  • reads are done with recvmsg() which allows passing of file descriptors from the client

Secondly, sd-notify handles something called barriers. What happens then is that the client sends a packet containing only "BARRIER=1", which includes a file descriptor. The client side sends this, then waits for the file descriptor to be closed. So, when the server gets a message like this it needs to extract and close all the passed file descriptors.

The test case is using systemd-notify --ready, and that has this code:

        r = sd_pid_notify(source_pid, false, n);  // main notify
...
        if (!arg_no_block) {
                r = sd_notify_barrier(0, 5 * USEC_PER_SEC);

In other words, unless you passed --no-block, the client will first send the notify message, then a separate barrier message and then wait until the barrier is handled.

So, issues here:

  • we must handle barrier messages correctly
  • we must continue to handle messages after the notify message

I think the best way to handle the barrier messages are to just close the fd:s in the message and ignore the message. Alternatively we could send them on to systemd and have it close them, but that seems unnecessary. I think the point of the barrier is to ensure that the remote side has seen the previous messages and we're sure they are not lost in the send buffer of the client or something. Maybe we need to have the notify proxy similarly use a barrier when reporting the notify to systemd though.

For the continued handling of messages, I'm not sure what the best approach is. There is not real way to tell that the peer side of a datagram socket has exited or closed the fd. The way systemd handles this is that it has only a single notify receive socket in the entire pid 1, and then it looks at the kernel reported peer pid on received messages to see what pid sent the message. Then it keeps the socket open forever.

Thirdly, It feels like there is a race condition between errorChan and readyChan when we deal with a timeout. If the timeout races with the readyChan, then both could be sent. I don't think this actually is an issue right now, because WaitAndClose() will just randomly pick one of them and ignore the other. However, long-term if we handle multiple messages, etc it may be an issue.

Other minor issues:

  • Systemd accepts (but warns) about the message text having a single trailing nul

For details of how systemd works, see manager_dispatch_notify_fd()
https://github.com/systemd/systemd/blob/main/src/core/manager.c#L2444
this is called from the mainloop everytime poll/select returns the notify socket fd as readable.

@alexlarsson
Copy link
Contributor Author

So, the problem in the tests seems to be that the sd-notify write blocks. This is a sequence like this:
client:

  • sd-notify writes READY=1 packet (on a new dgram socket)
  • sd-notify creads socketpair
  • sd-notify writes BARRIER=1 packet, with fd = socketpair[1] (on a new dgram socket)
  • sd-notify closes socketpair[1] (but the outstanding packet keeps it open)
  • sd-notifies blocks on poll(socketpair[0]) which will return POLLHUP when socketpair[1] is closed

proxy waitForReady gofunc:

  • proxy reads READY=1 packet
  • read size is not bufferSize (or is it???) so we handle the message dirrectly
  • split lies has READY=1 so we signal readyChan
  • waitForReady() exits, killing gofunc

proxy WaitAndClose():

  • we block for readyChan
  • we return nil as no error has happened
  • in defer, the datagram socket is closed and removed

It seems like sd-notify was successful in sending the barrier packet, but then never gets the POLLHUP. I don't immediately see why this should happen. If the message was not sendmsg()ed at all, then it should have failed immediately, but if it got sent, then the message should be on the receiver side queue, and if that socket is closed the queue should have been drained and the fd closed. The other option is if the server somehow read the second packet too (see read size == bufferSize and the ??? above), and we got the message out of the queue. However, surely a read() instead of a recvmsg() on a socket that doesn't extract the fd:s would close them too, no?

@vrothberg
Copy link
Member

Thanks for your help and the great explanation, @alexlarsson. Tackling it requires more concentration than I can get in the train, but we should surely get it done by Podman v4.4 👍

@vrothberg
Copy link
Member

Thirdly, It feels like there is a race condition between errorChan and readyChan when we deal with a timeout. If the timeout races with the readyChan, then both could be sent.

If there's a race we select the ready chan first, so I think we're good.

we must handle barrier messages correctly

I have a PR in the pipeline to fix that.

we must continue to handle messages after the notify message

So that means that kube-play needs to run until all services are down (and the "service container" exits). Only then the notify proxies are safe to be closed. Right?

@alexlarsson
Copy link
Contributor Author

we must continue to handle messages after the notify message

So that means that kube-play needs to run until all services are down (and the "service container" exits). Only then the notify proxies are safe to be closed. Right?

In theory yes, but I think in practice you can probably shut it down after you have gotten a notify and a barrier from each expected container, or some timeout after notify.

@vrothberg
Copy link
Member

@alexlarsson do you think it's fair to assume that the barrier comes right after ready in the same message? If so, #16709 does the trick but I am not 100 percent sure.

@alexlarsson
Copy link
Contributor Author

@vrothberg I don't think it comes in the same message. Its two separate messages, but probably sent directly after the notify.

@vrothberg
Copy link
Member

I am not sure whether it's best to just require a non-blocking ready. That may help in avoiding other expectations toward the protocol that Podman doesn't implement.

@vrothberg
Copy link
Member

Interestingly the systemd-notify in our CI (Fedora:31) is not sending barrier messages at all. Will continue digging on Monday.

vrothberg added a commit to vrothberg/libpod that referenced this issue Dec 6, 2022
As outlined in containers#16076, a subsequent BARRIER *may* follow the READY
message sent by a container.  To correctly imitate the behavior of
systemd's NOTIFY_SOCKET, the notify proxies span up by `kube play` must
hence process messages for the entirety of the workload.

We know that the workload is done and that all containers and pods have
exited when the service container exits.  Hence, all proxies are closed
at that time.

The above changes imply that Podman runs for the entirety of the
workload and will henceforth act as the MAINPID when running inside of
systemd.  Prior to this change, the service container acted as the
MAINPID which is now not possible anymore; Podman would be killed
immediately on exit of the service container and could not clean up.

The kube template now correctly transitions to in-active instead of
failed in systemd.

Fixes: containers#16076
Fixes: containers#16515
Signed-off-by: Valentin Rothberg <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 8, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants