Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomly fails to become healthy #71

Closed
frenkel opened this issue Nov 22, 2024 · 15 comments
Closed

Randomly fails to become healthy #71

frenkel opened this issue Nov 22, 2024 · 15 comments

Comments

@frenkel
Copy link

frenkel commented Nov 22, 2024

We've been using kamal with kamal-proxy for a few weeks now and overall it works great. We do have a random case (1 in 15?) where the deployment fails because the new version doesn't become healthy according to Kamal, although the logs do show correct status code 200 for the healthcheck. Attached the logs of such a failure:

2024-11-22T08:19:10.7613515Z 2024-11-22T08:18:40.620061000Z {"time":"2024-11-22T09:18:40.619937363+01:00","level":"INFO","msg":"Request","path":"/up","status":502,"dur":0,"method":"GET","req_content_length":0,"req_content_type":"","resp_content_length":626,"resp_content_type":"text/html","remote_addr":"172.18.0.3:52158","user_agent":"Go-http-client/1.1","cache":"miss","query":""}
2024-11-22T08:19:10.7616212Z 2024-11-22T08:18:41.619040000Z {"time":"2024-11-22T09:18:41.618760816+01:00","level":"INFO","msg":"Unable to proxy request","path":"/up","error":"dial tcp 127.0.0.1:3000: connect: connection refused"}
2024-11-22T08:19:10.7618864Z 2024-11-22T08:18:41.632961000Z {"time":"2024-11-22T09:18:41.629230642+01:00","level":"INFO","msg":"Request","path":"/up","status":502,"dur":10,"method":"GET","req_content_length":0,"req_content_type":"","resp_content_length":626,"resp_content_type":"text/html","remote_addr":"172.18.0.3:52166","user_agent":"Go-http-client/1.1","cache":"miss","query":""}
2024-11-22T08:19:10.7620828Z 2024-11-22T08:18:42.275092000Z [21] Puma starting in cluster mode...
2024-11-22T08:19:10.7621369Z 2024-11-22T08:18:42.276457000Z [21] * Puma version: 6.4.3 (ruby 3.1.6-p260) ("The Eagle of Durango")
2024-11-22T08:19:10.7621866Z 2024-11-22T08:18:42.277440000Z [21] *  Min threads: 3
2024-11-22T08:19:10.7622224Z 2024-11-22T08:18:42.277496000Z [21] *  Max threads: 3
2024-11-22T08:19:10.7622593Z 2024-11-22T08:18:42.277526000Z [21] *  Environment: staging
2024-11-22T08:19:10.7622974Z 2024-11-22T08:18:42.278793000Z [21] *   Master PID: 21
2024-11-22T08:19:10.7623347Z 2024-11-22T08:18:42.279629000Z [21] *      Workers: 2
2024-11-22T08:19:10.7623991Z 2024-11-22T08:18:42.280388000Z [21] *     Restarts: (✔) hot (✖) phased
2024-11-22T08:19:10.7624421Z 2024-11-22T08:18:42.281572000Z [21] * Preloading application
2024-11-22T08:19:10.7624867Z 2024-11-22T08:18:42.281619000Z [21] * Listening on http://0.0.0.0:3000
2024-11-22T08:19:10.7625292Z 2024-11-22T08:18:42.281651000Z [21] Use Ctrl-C to stop
2024-11-22T08:19:10.7625738Z 2024-11-22T08:18:42.312462000Z [21] - Worker 0 (PID: 36) booted in 0.02s, phase: 0
2024-11-22T08:19:10.7626247Z 2024-11-22T08:18:42.328429000Z [21] - Worker 1 (PID: 37) booted in 0.03s, phase: 0
2024-11-22T08:19:10.7626948Z 2024-11-22T08:18:47.622613000Z {"time":"2024-11-22T09:18:47.620309603+01:00","level":"INFO","msg":"Unable to proxy request","path":"/up","error":"context canceled"}
2024-11-22T08:19:10.7628306Z 2024-11-22T08:18:47.623511000Z {"time":"2024-11-22T09:18:47.620907601+01:00","level":"INFO","msg":"Request","path":"/up","status":502,"dur":5001,"method":"GET","req_content_length":0,"req_content_type":"","resp_content_length":626,"resp_content_type":"text/html","remote_addr":"172.18.0.3:52172","user_agent":"Go-http-client/1.1","cache":"miss","query":""}
2024-11-22T08:19:10.7629858Z 2024-11-22T08:18:53.380502000Z {"time":"2024-11-22T09:18:53.379131964+01:00","level":"INFO","msg":"Unable to proxy request","path":"/up","error":"context canceled"}
2024-11-22T08:19:10.7631211Z 2024-11-22T08:18:53.381686000Z {"time":"2024-11-22T09:18:53.379478313+01:00","level":"INFO","msg":"Request","path":"/up","status":502,"dur":5755,"method":"GET","req_content_length":0,"req_content_type":"","resp_content_length":626,"resp_content_type":"text/html","remote_addr":"172.18.0.3:52184","user_agent":"Go-http-client/1.1","cache":"miss","query":""}
2024-11-22T08:19:10.7632528Z 2024-11-22T08:18:58.388605000Z {"time":"2024-11-22T09:18:58.383575978+01:00","level":"INFO","msg":"Unable to proxy request","path":"/up","error":"context canceled"}
2024-11-22T08:19:10.7633872Z 2024-11-22T08:18:58.389712000Z {"time":"2024-11-22T09:18:58.383848059+01:00","level":"INFO","msg":"Request","path":"/up","status":502,"dur":4999,"method":"GET","req_content_length":0,"req_content_type":"","resp_content_length":626,"resp_content_type":"text/html","remote_addr":"172.18.0.3:46734","user_agent":"Go-http-client/1.1","cache":"miss","query":""}
2024-11-22T08:19:10.7635860Z 2024-11-22T08:18:59.716690000Z {"time":"2024-11-22T09:18:59.716474635+01:00","level":"INFO","msg":"Request","path":"/up","status":200,"dur":1331,"method":"GET","req_content_length":0,"req_content_type":"","resp_content_length":73,"resp_content_type":"text/html; charset=utf-8","remote_addr":"172.18.0.3:46748","user_agent":"Go-http-client/1.1","cache":"miss","query":""}
2024-11-22T08:19:10.7637841Z 2024-11-22T08:18:59.729143000Z {"time":"2024-11-22T09:18:59.728924097+01:00","level":"INFO","msg":"Request","path":"/up","status":200,"dur":9,"method":"GET","req_content_length":0,"req_content_type":"","resp_content_length":73,"resp_content_type":"text/html; charset=utf-8","remote_addr":"172.18.0.3:46750","user_agent":"Go-http-client/1.1","cache":"miss","query":""}
2024-11-22T08:19:19.7084400Z   INFO [fdc801ec] Running docker container ls --all --filter name=^platform-web-staging-18258ef278ae34862e4efeb855abf1960a3a0e37$ --quiet | xargs docker inspect --format '{{json .State.Health}}' on staging.toolsfactory.nl
2024-11-22T08:19:19.7086005Z   INFO [fdc801ec] Finished in 0.453 seconds with exit status 0 (successful).
2024-11-22T08:19:19.7086594Z  ERROR null
2024-11-22T08:19:19.7087661Z   INFO [bb6b6f8e] Running docker container ls --all --filter name=^platform-web-staging-18258ef278ae34862e4efeb855abf1960a3a0e37$ --quiet | xargs docker stop on staging.toolsfactory.nl
2024-11-22T08:19:19.7088948Z   INFO [bb6b6f8e] Finished in 8.501 seconds with exit status 0 (successful).
2024-11-22T08:19:19.7089717Z Releasing the deploy lock...
2024-11-22T08:19:20.0281473Z   Finished all in 68.9 seconds
2024-11-22T08:19:20.0305461Z   �[31mERROR (SSHKit::Command::Failed): Exception while executing on host staging.toolsfactory.nl: docker exit status: 1
2024-11-22T08:19:20.0306650Z docker stdout: Nothing written
2024-11-22T08:19:20.0307308Z docker stderr: Error: target failed to become healthy

What could be the cause of this and how can we prevent it?

@kwilczynski
Copy link

kwilczynski commented Nov 24, 2024

@frenkel, have you seen the following issues?

I wonder if some of the proposed workarounds could be applied to your situation.

This issue, a lot of the time, boils down to a configuration or something within the runtime environment rather than an actual bug within kamal-proxy or kamal itself. So it's worth pursuing the avenue of adjusting the configuration first before trying other things.

@frenkel
Copy link
Author

frenkel commented Nov 24, 2024

Thanks for the reply, but it works fine in the other 14 out of 15 deploys. I use thruster and port 80, so no need to change the app_port.

@kwilczynski
Copy link

I am having a hard time reproducing this locally. Even with a container that deliberately is slow (starts slow and staggers replies to the /up endpoint) with its responses.

Can you share some more details, if possible, about your setup, application and deployment, etc. Anything custom? Anything special or bespoke that we might not be handling correctly?

I also wonder if there would be a way for you to build (or come up with) a minimal reproducer that triggers this issue reliably. If not, then no worries. We can think of something...

@frenkel
Copy link
Author

frenkel commented Nov 24, 2024

Thanks for looking into this. I hadn’t thought about a timeout as the docs say it is 30 seconds and the first 200 is after 17 seconds. Do you think I need to increase it above 30 seconds?

The application is nothing special. Just Rails 7.2 with postgresql, puma and thruster. The server is a staging server that doesn’t receive much traffic, just one or two test users. We did experience it on a production server as well though.

Is the “ context canceled” error normal? And the “Error: null”?

@frenkel
Copy link
Author

frenkel commented Nov 25, 2024

This is my config:

deploy.staging.yml

servers:
  web:
    hosts:
      - staging.example.com
  jobs:
    hosts:
      - staging.example.com
    cmd: "bin/jobs"

proxy:
  ssl: true
  hosts:
    - beheer.example.dev
    - inloggen.example.dev
    - development.example.dev
    - development0.example.dev
    - development1.example.dev
    - development2.example.dev
    - development3.example.dev
    - development4.example.dev
    - development5.example.dev
    - development6.example.dev
    - development7.example.dev
    - development8.example.dev
    - development9.example.dev
    - signup.example.dev
    - demo.example.dev

env:
  clear:
    RAILS_ENV: staging
    ASSUME_SSL: true
    RAILS_MAX_THREADS: 3
    WEB_CONCURRENCY: 2

ssh:
  user: staging

deploy.yml

service: platform
image: example-com/platform
logging:
  driver: journald

servers: [] # overridden in destination specific files

registry:
  server: example-container-registry.com
  username: platform
  password:
    - KAMAL_REGISTRY_PASSWORD

builder:
  arch: amd64
  cache:
    type: gha
    options: mode=max
    image: kamal-app-build-cache

# Inject ENV variables into containers (secrets come from .kamal/secrets).
#
env:
  clear:
    TZ: Europe/Amsterdam
  secret:
    - DATABASE_URL

aliases:
  console: app exec --interactive --reuse "bin/rails console"
  shell: app exec --interactive --reuse "bash"
  logs: app logs -f
  dbc: app exec --interactive --reuse "bin/rails dbconsole"

volumes:
  - "app_storage:/rails/storage"
  - "app_cache:/rails/tmp/cache" # for session storage
  - "app_credentials:/rails/config/credentials"

asset_path: /rails/public/assets

require_destination: true

accessories:
  db:
    image: postgres:15
    roles:
      - web
    port: "127.0.0.1:5432:5432"
    env:
      secret:
        - POSTGRES_PASSWORD
        - POSTGRES_USER
        - POSTGRES_DB
    volumes:
      - "db_data:/var/lib/postgresql/data"

@kwilczynski
Copy link

@frenkel, thank you for more details! A few more questions...

  • What is the version of the OS you deploy to? (you can include the content of the /etc/os-release file here)
    • Adding output of cat /proc/version and /proc/cmdline would be useful, too
  • What is your Docker version (you can include the output of the docker info and docker version command here)

That said, looking at the system logs under the system-wide /var/log directory and the output of the dmesg command is also a good idea. Perhaps there is some signal there.

You could try to double the time out as an experiment, to see what the outcome would be. As you have a staging environment handy, then there should be, hopefully, no harm done in doing so. You know, for science!

See for reference:

The context canceled error we are seeing is a timeout and comes from Thruster, per:

Some of the other logs also come from Thruster and originate from the request logging middleware:

The upstream wasn't available for a short time per:

2024-11-22T08:19:10.7616212Z 2024-11-22T08:18:41.619040000Z {"time":"2024-11-22T09:18:41.618760816+01:00","level":"INFO","msg":"Unable to proxy request","path":"/up","error":"dial tcp 127.0.0.1:3000: connect: connection refused"}

Then, the listener eventually became ready to listen for an incoming connection, but it did not respond to the health check in a timely manner, hence the timeout.

There are some 502s, too, but I am not sure about these. Perhaps the upstream (backend) is not ready yet.

Where possible, you could enable the debug log level and then run deployments to your staging environment, perhaps several times, with the aim of triggering the issue.

If you won't be able to reproduce this with repeated deployments of the same release/version, then perhaps the issue is with any new version? Possibly, something runs as part of a new release (e.g., expensive migration, some extra scripts and tasks, etc.) that would stagger the service from starting up in a timely manner.

@kevinmcconnell, I am not sure if this issue belongs to kamal-proxy, and maybe it would be better to move it under the main Kamal project?

@frenkel
Copy link
Author

frenkel commented Nov 25, 2024

Thanks for the help, but I don’t see how any of this affects two http code 200s as being ignored.

This is just Debian stable with docker from the repositories. There is nothing of note in the logs. Debug level logging won’t help here as it only adds SQL queries which I won’t be able to post.
dmesg output are kernel messages, so they have no useful content for a container that doesn’t get marked as healthy.

@kwilczynski
Copy link

@frenkel, I am trying to understand what happens on your machine.

Without a reliable reproducer, we can only speculate, especially since I cannot reproduce this issue locally.

Yes, I know what dmesg is for.

@frenkel
Copy link
Author

frenkel commented Nov 25, 2024

Of course, I understand. Some of those just seemed so unrelated. I will try to experiment and research it some more.
It seems a successful deploy doesn’t log the healthcheck requests. Any idea how to see them?

@kevinmcconnell
Copy link
Collaborator

@frenkel could you share the logs from the proxy container, from the start of the deployment through to the failure? The proxy should be logging its activity, and will include the point where if finds the target to be healthy (if it does). An example of proxy logs for a successful deployment:

{"time":"2024-11-26T08:47:33.832254938Z","level":"INFO","msg":"Deploying","service":"s1","hosts":null,"target":"example-web-1","tls":false}
{"time":"2024-11-26T08:47:33.833367109Z","level":"INFO","msg":"Healthcheck succeeded","status":200}
{"time":"2024-11-26T08:47:33.833382878Z","level":"INFO","msg":"Target health updated","target":"example-web-1","success":true,"state":"healthy"}
{"time":"2024-11-26T08:47:33.833398898Z","level":"INFO","msg":"Deployed","service":"s1","hosts":null,"target":"example-web-1"}

The "target failed to become healthy" error should only happen if the timeout elapses before the proxy receives a healthy response, and my guess is that's what's happening here. But the proxy logs should confirm if that's the case. (The successful /up responses are a bit odd, but I wonder if the healthcheck ran past the end of the timeout; I'll look into that.)

The "context cancelled" errors are, as @kwilczynski says, due to the target failing to respond to a healthcheck request in time. So those are expected in this case. The default timeout for each individual healthcheck request is 5s, so those healthcheck requests are being cancelled after 5s and retried. Retrying should continue until the overall deployment timeout elapses (30s by default).

@frenkel
Copy link
Author

frenkel commented Nov 26, 2024

Apparently the log setting of my kamal config is not passed to kamal-proxy, so the logs of the kamal-proxy container use the default json docker logs which seem to have been cut of at the end of November the 23rd. Earlier logs are removed and I cannot see the lines you asked for. As soon as it occurs again I'll get those logs and post them here.

By the way: should kamal-proxy honor the log settings in kamal configuration file? If so, should I file a bug there?

@kevinmcconnell
Copy link
Collaborator

As soon as it occurs again I'll get those logs and post them here.

Sounds good, thanks! I think it's likely that when you get the failure, it's because the app took too long to become healthy. But when we have those logs we'll be able to confirm.

I've also found that the logs do indeed sometimes show some additional healthchecks after the timeout elapses. Those checks are fairly harmless (their result is ignored) but they make it confusing to debug situations like this one. I'll have a fix for that in place shortly.

By the way: should kamal-proxy honor the log settings in kamal configuration file? If so, should I file a bug there?

It's worth checking on the Kamal project about this. I think it makes sense for the log settings to apply everywhere, however it's possible for a single proxy container to be shared by multiple apps, so there may be some complications to consider. So it's worth bringing up the question I think.

@frenkel
Copy link
Author

frenkel commented Dec 2, 2024

This just happened twice, this are the kamal-proxy logs as requested by @kevinmcconnell

{"time":"2024-12-02T10:41:00.399091035Z","level":"INFO","msg":"Deploying","service":"platform-web-staging","hosts":["development.example.com","development0.example.com","development1.example.com","development2.example.com","development3.example.com","development4.example.com","development5.example.com","development6.example.com","development7.example.com","development8.example.com","development9.example.com","signup.example.com","demo.example.com"],"target":"a7535f905149:80","tls":true}
{"time":"2024-12-02T10:41:00.40444214Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:00.404759145Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:01.403435727Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:01.403481459Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:02.401718355Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:02.401757396Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:03.40207107Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:03.402110765Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:04.401743346Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:04.401784087Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:05.403318895Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:05.403371174Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:06.401822548Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:06.401884483Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:07.401796005Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:07.402225342Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:08.401988234Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:08.402056927Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:09.403367853Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:09.403589898Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:10.407372314Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:10.407429313Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:11.402515147Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:11.402667581Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:12.401935481Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:12.40199967Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:13.402440644Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:13.402491164Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:14.401851909Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:14.401904792Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:15.402410838Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:15.402467005Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:16.401890471Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:16.401944788Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:17.401947231Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:17.402013948Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:18.403595159Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": dial tcp 172.18.0.6:80: connect: connection refused"}
{"time":"2024-12-02T10:41:18.403656Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:19.409178893Z","level":"INFO","msg":"Healthcheck failed","status":502}
{"time":"2024-12-02T10:41:19.409233901Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:20.404535335Z","level":"INFO","msg":"Healthcheck failed","status":502}
{"time":"2024-12-02T10:41:20.404569404Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:21.408519529Z","level":"INFO","msg":"Healthcheck failed","status":502}
{"time":"2024-12-02T10:41:21.408547892Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:22.403725398Z","level":"INFO","msg":"Healthcheck failed","status":502}
{"time":"2024-12-02T10:41:22.40374979Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:23.404734526Z","level":"INFO","msg":"Healthcheck failed","status":502}
{"time":"2024-12-02T10:41:23.404765689Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:24.833441462Z","level":"INFO","msg":"Healthcheck failed","status":502}
{"time":"2024-12-02T10:41:24.833465402Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:25.40361428Z","level":"INFO","msg":"Healthcheck failed","status":502}
{"time":"2024-12-02T10:41:25.403636498Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:26.403818823Z","level":"INFO","msg":"Healthcheck failed","status":502}
{"time":"2024-12-02T10:41:26.403886946Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:27.6138383Z","level":"INFO","msg":"Healthcheck failed","status":502}
{"time":"2024-12-02T10:41:27.613875421Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:28.405168044Z","level":"INFO","msg":"Healthcheck failed","status":502}
{"time":"2024-12-02T10:41:28.405221554Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}
{"time":"2024-12-02T10:41:30.401027238Z","level":"INFO","msg":"Target failed to become healthy","hosts":["development.example.com","development0.example.com","development1.example.com","development2.example.com","development3.example.com","development4.example.com","development5.example.com","development6.example.com","development7.example.com","development8.example.com","development9.example.com","signup.example.com","demo.example.com"],"target":"a7535f905149:80"}
{"time":"2024-12-02T10:41:34.453484174Z","level":"INFO","msg":"Healthcheck failed","error":"Get \"http://a7535f905149:80/up\": context deadline exceeded"}
{"time":"2024-12-02T10:41:34.453533045Z","level":"INFO","msg":"Target health updated","target":"a7535f905149:80","success":false,"state":"adding"}

Based on the "context deadline exceeded" it does sound like a timeout indeed. Weird thing is it happened two times and the third time it worked without problems. There were no database migrations to run and no visitors on the server. What could cause this? It feels wrong to just blindly increase a default timeout when the application does not have anything special in it and is not that big (39k LOC).

@kevinmcconnell
Copy link
Collaborator

@frenkel thanks for sending those along. From these logs, it does look like the application is taking more than 30s to reach a healthy state. The default deployment timeout is 30s, and you can see where it stops trying 30s into the deployment, at this line:

{"time":"2024-12-02T10:41:30.401027238Z","level":"INFO","msg":"Target failed to become healthy","hosts":["development.example.com","development0.example.com","development1.example.com","development2.example.com","development3.example.com","development4.example.com","development5.example.com","development6.example.com","development7.example.com","development8.example.com","development9.example.com","signup.example.com","demo.example.com"],"target":"a7535f905149:80"}

Kamal Proxy is behaving as expected here.

If your application sometimes takes longer than 30s to start, then increasing the timeout is the right thing to do. It may simply need more time to reliably be ready than it's currently getting.

However if you feel the application should be starting more quickly, then I think you'll need to investigate that on the application side. Given that the deployments are working sometimes, it could be worth comparing the logs when it succeeds -- if it's often fairly close to that 30s mark, maybe you're just seeing small fluctuations in startup time that cause it to sometimes run over?

@frenkel
Copy link
Author

frenkel commented Dec 2, 2024

Thanks. Normal deploys are healthy within 20 seconds, so I guess I'll have to do some more digging to fix this problem for me.

@frenkel frenkel closed this as completed Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants