Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libimage: pull: increase timeout running under systemd #1616

Conversation

vrothberg
Copy link
Member

Set the EXTEND_TIMEOUT_USEC over DBUS when pulling an image from a registry and when running under systemd. This will prevent a frequent issue when running Quadlets and exceeding the default systemd start timeout of 90 seconds when pulling the image takes too long.

Fixes: containers/podman/issues/18353

Entirely untested but it may aid in deciding what to do with containers/podman/issues/18353.
@rhatdan @ygalblum WDYT?

@ygalblum
Copy link
Contributor

Look like this will work. I do have two comments though:

  1. Since the intention is to never let Systemd timeout during pull I think we can extend it to a larger number and have less loops.
  2. Not sure I love the fact that the extend mechanism is not controlled by the caller at all. That is, the decision is made according to the existence of the environment variable without allowing the caller to do anything about it.

libimage/pull.go Outdated Show resolved Hide resolved
libimage/pull.go Outdated Show resolved Hide resolved
libimage/pull.go Outdated Show resolved Hide resolved
@vrothberg
Copy link
Member Author

vrothberg commented Aug 16, 2023

Look like this will work. I do have two comments though:

1. Since the intention is to never let Systemd timeout during pull I think we can extend it to a larger number and have less loops.

What do you consider a reasonable number? 20 seconds?

2. Not sure I love the fact that the extend mechanism is not controlled by the caller at all. That is, the decision is made according to the existence of the environment variable without allowing the caller to do anything about it.

I was going back and forth on making it conditional. In a way it is already conditional via the env. It doesn't really make sense in podman system service but for that, the server should unset the SOCKET after using it.

@vrothberg vrothberg force-pushed the fix-containers/podman/issues/18353 branch from 393d0fb to b95e0be Compare August 16, 2023 13:03
libimage/pull.go Outdated Show resolved Hide resolved
@vrothberg vrothberg force-pushed the fix-containers/podman/issues/18353 branch from b95e0be to 7b49c21 Compare August 16, 2023 13:23
@ygalblum
Copy link
Contributor

What do you consider a reasonable number? 20 seconds?

In the first PR, @rhatdan proposed 30 seconds. Maybe go with that

the server should unset the SOCKET after using it

Do we know that it does?

@vrothberg
Copy link
Member Author

What do you consider a reasonable number? 20 seconds?

In the first PR, @rhatdan proposed 30 seconds. Maybe go with that

SGTM

the server should unset the SOCKET after using it

Do we know that it does?

With containers/podman#19644 it will. I was convinced it was doing it (but double checked).

@ygalblum
Copy link
Contributor

With containers/podman#19644 it will. I was convinced it was doing it (but double checked).

Then I think we're good

@rhatdan
Copy link
Member

rhatdan commented Aug 16, 2023

LGTM

@vrothberg vrothberg force-pushed the fix-containers/podman/issues/18353 branch from 7b49c21 to 72981d0 Compare August 17, 2023 09:02
@vrothberg
Copy link
Member Author

Now with 30 seconds. I'll keep it as a draft until I've figured out how to test it.

@Luap99
Copy link
Member

Luap99 commented Aug 17, 2023

The one problem with this approach is that it never actually checks for actual data progress. This works fine in the good case but if we now run into a deadlock in the pull code this goroutine will extend the time-out forever and the systemd unit will never time out. I don't think it is likely to happen but certainly a thing to consider and why I think the correct thing to do is to check for actual pull progress and only send a new message if we know new progress has been made.

libimage/pull.go Outdated Show resolved Hide resolved
@vrothberg
Copy link
Member Author

[...] I think the correct thing to do is to check for actual pull progress and only send a new message if we know new progress has been made

Dan asked the same thing in #1616 (comment) but I don't feel very comfortable doing it. The standard lib will take care of throwing an error in case the downloads starve. There may be a bug in our stack for sure that may cause it to block. But that would also be true if we'd check for pull progress where we some bug may indicate progress indefinitely. Schema 1 images don't indicate the size of blobs.

@ygalblum
Copy link
Contributor

The standard lib will take care of throwing an error in case the downloads starve

I agree with @vrothberg here with one caveat. Can a call to podman pull hang forever (assuming the user does not kill it)? If no, then we're good. If yes, maybe the fix should be there.

@vrothberg
Copy link
Member Author

Can a call to podman pull hang forever (assuming the user does not kill it)?

In theory any call to podman can hang forever if there's a bug. My main point is that using pull progress as a line of defense isn't much more robust either since the standard lib already times out. So at the moment we rely on that. If we'd also rely on pull progress being made (ie., listen on the channel), I don't think we add much robustness.

Maybe we should limit the number of timeout extensions to 10 or something like that? So something super high that should largely be sufficient but deterministic. If users have gigantic images or so low bandwidth that 5 minutes + 90 seconds are not enough, then I'd argue that they need to tweak the timeout in their Quadlet files themselves.

@Luap99
Copy link
Member

Luap99 commented Aug 17, 2023

Well I don't think there is any known deadlock in the pull code. This was more about being actually correct in all cases even when we introduce a deadlock in the future. Of course a normal podman pull would face the same thing.

I agree that this solution is simple enough and fixes a important problem that many users run into. My comments was only meant as maybe there is a even better way? If we agree that there is not without adding a lot of complexity and work I am favour of merging.

@Luap99
Copy link
Member

Luap99 commented Aug 17, 2023

If users have gigantic images or so low bandwidth that 5 minutes + 90 seconds are not enough

I don't think this is how it works, form the sd_notify(3):

 EXTEND_TIMEOUT_USEC=...
 Tells the service manager to extend the startup, runtime or shutdown service timeout corresponding the current state. The value specified is a time in microseconds during which the service
 must send a new message. A service timeout will occur if the message isn't received, but only if the runtime of the current state is beyond the original maximum times of TimeoutStartSec=,
 RuntimeMaxSec=, and TimeoutStopSec=. See systemd.service(5) for effects on the service timeouts.

So it will time-out when it does not revive a new EXTEND_TIMEOUT_USEC message in the given time and it is over the normal default configured timeout. It does not add the time to the default value.

@vrothberg
Copy link
Member Author

So it will time-out when it does not revive a new EXTEND_TIMEOUT_USEC message in the given time and it is over the normal default configured timeout. It does not add the time to the default value.

Thanks. Yes, once it exceeds beyond the specific timeout, the timeout will be ignored.

@vrothberg
Copy link
Member Author

My comments was only meant as maybe there is a even better way?

Absolutely. I think it's more than fair to reflect on how to prevent a potential deadlock. So far, I didn't like the progress-check proposal for the reasons mentioned above.

What do you think about a fixed limit of how often Podman would increase the timeout? I'd guess it's fair to help in most cases but the extreme ones.

libimage/pull.go Outdated Show resolved Hide resolved
@Luap99
Copy link
Member

Luap99 commented Aug 17, 2023

I think having a limit of 5min sounds good to me. If someone pull super large images then they likely should pre populate them and not have their service start-up delayed so long and if they truly want that I agree they can still set the normal TimeoutStartSec= in the unit.

@vrothberg
Copy link
Member Author

In that case, I think we can simplify the entire thing and try handling that in Quadlet?

If the default timeout is bigger than 5 minutes (not sure if/how Quadlet can access this value) or if a custom limit is set, Quadlet leaves things as they are. Otherwise, it'll add a TimeoutStartSec=5mins in the unit.

@Luap99
Copy link
Member

Luap99 commented Aug 17, 2023

You do not need to access anything if the default is bigger then systemd will use the bigger value and not time-out, again the docs are very explicit about the behaviour:

but only if the runtime of the current state is beyond the original maximum times of TimeoutStartSec=,
RuntimeMaxSec=, and TimeoutStopSec=. See systemd.service(5) for effects on the service timeouts.

And setting a default TimeoutStartSec=5min is also not a good idea because that would effect all of podman and and there we definitely have more deadlocks. To me the extend feature is great because we only extend for things we know should take longer, i.e. pulls. There might be other special cases in the future were we want to extend it for example chown or selinux label setting on large volumes.

@vrothberg vrothberg force-pushed the fix-containers/podman/issues/18353 branch from 72981d0 to 20a1a0f Compare August 17, 2023 14:49
@ygalblum
Copy link
Contributor

Now I'm a bit confused. @vrothberg do you want to go forward with it?

@Luap99 another advantage of this change is that it limits the time extension to the pull process while in the Quadlet solution the longer value will be set to subsequent runs as well (runs that will not require pulling the image).

Having said that, now I'm also thinking about the health checks and the new sd-notify option. IIUC, once the extend message is sent, the process is expected to send additional ones or ready within the extension period. Am I correct in my understanding? If yes, can this cause an issue?

@vrothberg
Copy link
Member Author

And another qualdet problem is kube support, if I have a yaml with 10 images 5 mins may not be enough but this EXTEND_TIMEOUT_USEC patch here would be per image so it would give us much more time.

Very good point!

Of course this be seen as negative point as well.

Yes, I find it difficult to decide. There's pros and cons on both alternatives. I am OK to move forward with this idea. We can always change in the future. The important part is that we don't see a big blocker.

@vrothberg
Copy link
Member Author

vrothberg commented Aug 21, 2023

Now I'm a bit confused. @vrothberg do you want to go forward with it?

I am OK with moving forward. I pinged you and Dan to have some more thoughts as I find it hard to balance the alternatives.

Having said that, now I'm also thinking about the health checks and the new sd-notify option. IIUC, once the extend message is sent, the process is expected to send additional ones or ready within the extension period.

AFAIUI: once the default timeout has been exceeded, the EXTEND_TIMEOUT_USEC one extends the timeout for the specified time. On each send it will be extended by the specified value. If 10 seconds are left after a successful pull, then Podman has 10 seconds to finish the job until the timeout kicks in.

Am I correct in my understanding? If yes, can this cause an issue?

At the moment, I am not worried about health checks etc. If users expect a delayed start I think it's fair to make them responsible for taking care of the timeout. Probably something to highlight in the docs?

@ygalblum
Copy link
Contributor

Podman has 10 seconds to finish the job until the timeout kicks in

If users expect a delayed start I think it's fair to make them responsible for taking care of the timeout. Probably something to highlight in the docs?

Combining these two comments, it means that the 10 seconds are not only for Podman, but also for the health check to pass. Now, if the checks take 15 seconds from the start of the container the service will fail. So, maybe the extension period should be longer that 30 seconds?

@vrothberg
Copy link
Member Author

Now, if the checks take 15 seconds from the start of the container the service will fail. So, maybe the extension period should be longer that 30 seconds?

In that case, I would expect the --sdnotify=wait code to have a similar extension mechanism than we add here for pull. But I advocate for doing that outside this PR and always at the specific code sites.

@vrothberg vrothberg force-pushed the fix-containers/podman/issues/18353 branch from 20a1a0f to b7e3379 Compare August 21, 2023 13:29
libimage/pull.go Outdated Show resolved Hide resolved
libimage/pull.go Outdated Show resolved Hide resolved
@vrothberg vrothberg force-pushed the fix-containers/podman/issues/18353 branch 3 times, most recently from 061adba to e1e02eb Compare August 23, 2023 11:31
@vrothberg vrothberg changed the title WIP - libimage: pull: increase timeout running under systemd libimage: pull: increase timeout running under systemd Aug 23, 2023
@vrothberg vrothberg force-pushed the fix-containers/podman/issues/18353 branch from e1e02eb to 81cb638 Compare August 23, 2023 12:33
Set the `EXTEND_TIMEOUT_USEC` over DBUS when pulling an image from a
registry and when running under systemd.  This will prevent a frequent
issue when running Quadlets and exceeding the default systemd start
timeout of 90 seconds when pulling the image takes too long.

Fixes: containers/podman/issues/18353
Signed-off-by: Valentin Rothberg <[email protected]>
@vrothberg vrothberg force-pushed the fix-containers/podman/issues/18353 branch from 81cb638 to 4488724 Compare August 23, 2023 12:38
@vrothberg vrothberg marked this pull request as ready for review August 24, 2023 11:28
@vrothberg
Copy link
Member Author

containers/podman#19644 s passing (minus the CI farts), so this PR is good to go from my end.

@ygalblum @Luap99 @rhatdan PTanotherL

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rhatdan
Copy link
Member

rhatdan commented Aug 24, 2023

/approve
/lgtm
Nice work.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 24, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99, rhatdan, vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Luap99,rhatdan,vrothberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit b27c2ba into containers:main Aug 24, 2023
@vrothberg vrothberg deleted the fix-containers/podman/issues/18353 branch August 24, 2023 14:11
@mtrmac
Copy link
Contributor

mtrmac commented Aug 24, 2023

The standard lib will take care of throwing an error in case the downloads starve.

It… will? In principle, a very slow HTTP response body is not unusual. I can’t see any default timeout on reading a HTTP body, neither when making progress nor when the body is idle (if there were one, I think we would already have received bug reports about layer downloads not fitting into the default value).

AFAICS Dialer.KeepAlive ensures the socket is live, but not that the download is making any progress. So checking that at least one byte has arrived in the last 25 seconds seems reasonable to me.

(Warning: I didn’t actually try any of this in practice, I have just briefly read the HTTP 1.x version of the net/http implementation and public API.)

@vrothberg
Copy link
Member Author

vrothberg commented Aug 24, 2023

AFAICS Dialer.KeepAlive ensures the socket is live, but not that the download is making any progress. So checking that at least one byte has arrived in the last 25 seconds seems reasonable to me.

TLS timeouts are common at least. I expected similar timeouts when the download stalls. Wouldn't it otherwise be just too easy to create a DOS server/registry?

@mtrmac
Copy link
Contributor

mtrmac commented Aug 24, 2023

It’s quite possible that the server has timeouts set up to protect itself.

I haven’t found any in the standard library (although I didn’t look that hard), and AFAIK c/image does nothing special other than respect the provided context.Context, if that is configured with a timeout. There is a separate timeout for setting up the connection, but not one for consuming the response body.

@vrothberg
Copy link
Member Author

@mtrmac would it make sense to move such a monitoring logic into c/image/copy directly? I see benefit for all consumers and outside the context of running under systemd.

@mtrmac
Copy link
Contributor

mtrmac commented Aug 24, 2023

Compare https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/ .


I think adding a c/image option to limit… something (e.g. time the connection can be idle while downloading the body) could make sense; it would have more sense if we had a user with a specific use case.

But that would, without other work, do nothing to extend the systemd-imposed timeout.

@mtrmac
Copy link
Contributor

mtrmac commented Aug 24, 2023

(Note that it is already possible to set a global timeout. I think that one is easy for users to understand; once we start explaining network idle times and various connections states, we have probably lost a lot of users’ attention.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

quadlet fails with a timeout when starting
6 participants