-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Randomly fails to become healthy #71
Comments
@frenkel, have you seen the following issues?
I wonder if some of the proposed workarounds could be applied to your situation. This issue, a lot of the time, boils down to a configuration or something within the runtime environment rather than an actual bug within kamal-proxy or kamal itself. So it's worth pursuing the avenue of adjusting the configuration first before trying other things. |
Thanks for the reply, but it works fine in the other 14 out of 15 deploys. I use thruster and port 80, so no need to change the app_port. |
I am having a hard time reproducing this locally. Even with a container that deliberately is slow (starts slow and staggers replies to the Can you share some more details, if possible, about your setup, application and deployment, etc. Anything custom? Anything special or bespoke that we might not be handling correctly? I also wonder if there would be a way for you to build (or come up with) a minimal reproducer that triggers this issue reliably. If not, then no worries. We can think of something... |
Thanks for looking into this. I hadn’t thought about a timeout as the docs say it is 30 seconds and the first 200 is after 17 seconds. Do you think I need to increase it above 30 seconds? The application is nothing special. Just Rails 7.2 with postgresql, puma and thruster. The server is a staging server that doesn’t receive much traffic, just one or two test users. We did experience it on a production server as well though. Is the “ context canceled” error normal? And the “Error: null”? |
This is my config: deploy.staging.yml
deploy.yml
|
@frenkel, thank you for more details! A few more questions...
That said, looking at the system logs under the system-wide You could try to double the time out as an experiment, to see what the outcome would be. As you have a staging environment handy, then there should be, hopefully, no harm done in doing so. You know, for science! See for reference: The Some of the other logs also come from Thruster and originate from the request logging middleware: The upstream wasn't available for a short time per: 2024-11-22T08:19:10.7616212Z 2024-11-22T08:18:41.619040000Z {"time":"2024-11-22T09:18:41.618760816+01:00","level":"INFO","msg":"Unable to proxy request","path":"/up","error":"dial tcp 127.0.0.1:3000: connect: connection refused"} Then, the listener eventually became ready to listen for an incoming connection, but it did not respond to the health check in a timely manner, hence the timeout. There are some 502s, too, but I am not sure about these. Perhaps the upstream (backend) is not ready yet. Where possible, you could enable the debug log level and then run deployments to your staging environment, perhaps several times, with the aim of triggering the issue. If you won't be able to reproduce this with repeated deployments of the same release/version, then perhaps the issue is with any new version? Possibly, something runs as part of a new release (e.g., expensive migration, some extra scripts and tasks, etc.) that would stagger the service from starting up in a timely manner. @kevinmcconnell, I am not sure if this issue belongs to kamal-proxy, and maybe it would be better to move it under the main Kamal project? |
Thanks for the help, but I don’t see how any of this affects two http code 200s as being ignored. This is just Debian stable with docker from the repositories. There is nothing of note in the logs. Debug level logging won’t help here as it only adds SQL queries which I won’t be able to post. |
@frenkel, I am trying to understand what happens on your machine. Without a reliable reproducer, we can only speculate, especially since I cannot reproduce this issue locally. Yes, I know what |
Of course, I understand. Some of those just seemed so unrelated. I will try to experiment and research it some more. |
@frenkel could you share the logs from the proxy container, from the start of the deployment through to the failure? The proxy should be logging its activity, and will include the point where if finds the target to be healthy (if it does). An example of proxy logs for a successful deployment:
The "target failed to become healthy" error should only happen if the timeout elapses before the proxy receives a healthy response, and my guess is that's what's happening here. But the proxy logs should confirm if that's the case. (The successful The "context cancelled" errors are, as @kwilczynski says, due to the target failing to respond to a healthcheck request in time. So those are expected in this case. The default timeout for each individual healthcheck request is 5s, so those healthcheck requests are being cancelled after 5s and retried. Retrying should continue until the overall deployment timeout elapses (30s by default). |
Apparently the log setting of my kamal config is not passed to kamal-proxy, so the logs of the kamal-proxy container use the default json docker logs which seem to have been cut of at the end of November the 23rd. Earlier logs are removed and I cannot see the lines you asked for. As soon as it occurs again I'll get those logs and post them here. By the way: should kamal-proxy honor the log settings in kamal configuration file? If so, should I file a bug there? |
Sounds good, thanks! I think it's likely that when you get the failure, it's because the app took too long to become healthy. But when we have those logs we'll be able to confirm. I've also found that the logs do indeed sometimes show some additional healthchecks after the timeout elapses. Those checks are fairly harmless (their result is ignored) but they make it confusing to debug situations like this one. I'll have a fix for that in place shortly.
It's worth checking on the Kamal project about this. I think it makes sense for the log settings to apply everywhere, however it's possible for a single proxy container to be shared by multiple apps, so there may be some complications to consider. So it's worth bringing up the question I think. |
This just happened twice, this are the kamal-proxy logs as requested by @kevinmcconnell
Based on the "context deadline exceeded" it does sound like a timeout indeed. Weird thing is it happened two times and the third time it worked without problems. There were no database migrations to run and no visitors on the server. What could cause this? It feels wrong to just blindly increase a default timeout when the application does not have anything special in it and is not that big (39k LOC). |
@frenkel thanks for sending those along. From these logs, it does look like the application is taking more than 30s to reach a healthy state. The default deployment timeout is 30s, and you can see where it stops trying 30s into the deployment, at this line:
Kamal Proxy is behaving as expected here. If your application sometimes takes longer than 30s to start, then increasing the timeout is the right thing to do. It may simply need more time to reliably be ready than it's currently getting. However if you feel the application should be starting more quickly, then I think you'll need to investigate that on the application side. Given that the deployments are working sometimes, it could be worth comparing the logs when it succeeds -- if it's often fairly close to that 30s mark, maybe you're just seeing small fluctuations in startup time that cause it to sometimes run over? |
Thanks. Normal deploys are healthy within 20 seconds, so I guess I'll have to do some more digging to fix this problem for me. |
We've been using kamal with kamal-proxy for a few weeks now and overall it works great. We do have a random case (1 in 15?) where the deployment fails because the new version doesn't become healthy according to Kamal, although the logs do show correct status code 200 for the healthcheck. Attached the logs of such a failure:
What could be the cause of this and how can we prevent it?
The text was updated successfully, but these errors were encountered: