Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health check not working with docker #19

Closed
filiphr opened this issue May 10, 2021 · 17 comments
Closed

Health check not working with docker #19

filiphr opened this issue May 10, 2021 · 17 comments
Assignees
Labels
bug Something isn't working

Comments

@filiphr
Copy link

filiphr commented May 10, 2021

For some reason the health check is not being picked up correctly by docker.

When I do

docker inspect --format "{{json .Config.Healthcheck }} <container>

Then I get null back. In addition to that when I do docker container ls then the status is not displayed for the running container.

e.g.

CONTAINER ID   IMAGE                      COMMAND                  CREATED              STATUS              PORTS                     NAMES
925ac8dbba4b   gvenzl/oracle-xe:11-slim   "container-entrypoin…"   About a minute ago   Up About a minute   0.0.0.0:49161->1521/tcp   exciting_wu

When it works correctly the status is in branches after the time (e.g. "Up About a minute (healthy)")

GitHub Actions calls docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" <container> to check the status of the container.

I tried playing by passing a custom --health-cmd for the GitHub actions to work, but unfortunately I still couldn't make it work. GitHub only provides the status and nothing more so I can't even check the health logs to see what is happening in the container.

@filiphr
Copy link
Author

filiphr commented May 10, 2021

I did some investigation and compared it to how it is done for the Oracle Database images in the oracle/docker-images repo and the only difference I could see is how the HEALTHCHECK configuration looks like.

I tried the following:

FROM gvenzl/oracle-xe:11-slim

HEALTHCHECK --interval=1m --start-period=5m \
   CMD "${ORACLE_BASE}"/healthcheck.sh >/dev/null || exit 1

and with this it works correctly.

e.g.

CONTAINER ID   IMAGE         COMMAND                  CREATED         STATUS                   PORTS                     NAMES
182d85cd92ee   oracle-test   "container-entrypoin…"   2 minutes ago   Up 2 minutes (healthy)   0.0.0.0:49161->1521/tcp   dreamy_kalam

I cannot explain why adding --interval=1m --start-period=5m (I guess the values are not important) makes it work correctly with docker.

@gvenzl gvenzl self-assigned this May 11, 2021
@gvenzl gvenzl added enhancement New feature or request bug Something isn't working and removed enhancement New feature or request labels May 11, 2021
@gvenzl
Copy link
Owner

gvenzl commented May 11, 2021

Thanks @filiphr, I'll have to take a look at that.

I remember something about buildah not picking up the Docker healthcheck or that it doesn't even exist for OCI based images, see: containers/buildah#2388

Since I'm building these images via buildah and running them with podman, I think I never came across that issue again.

I'll add --interval, perhaps that's the issue here, could make sense that the healthcheck is only issued once and never again. If that doesn't do the trick, I'll have to dig a bit deeper and see what the current status of healthcheck with OCI based images is.

@filiphr
Copy link
Author

filiphr commented May 11, 2021

It could be the way the image was build indeed. Anyways I realized what the problem was, I was forgetting the ORACLE_PASSWORD 🤦 . It doesn't help that the logs in GitHub actions just say "unhealthy".

I managed to get GitHub actions understand the command by using the following options.

--health-cmd="\${ORACLE_BASE}/healthcheck.sh >/dev/null || exit 1"
--health-interval 20s
--health-timeout 10s
--health-retries 10

Perhaps this could be added to the docs as an option for the health command?

I am still investigating why GitHub Actions ends up in an unhealthy state after 3 or 4 minutes of starting it (it never reaches the healthy state).

@gvenzl
Copy link
Owner

gvenzl commented May 16, 2021

Awesome, thanks a lot, @filiphr!
Happy to hear that you got it working again!

Yeah absolutely, GitHub Action is an outstanding action item for me altogether. I think it's worthwhile writing up how to use these images with GitHub action altogether. Will definitely make sure I'll document this part here as well!

Meanwhile, I'm trying to reproduce the issue with the health check. I think to remember that when I did add --interval back then, it would not build at all via buildah but throw an error.
About to test out that theory.

@filiphr
Copy link
Author

filiphr commented May 16, 2021

Well I only got it working that it will recognize the health command. However, the service doesn't really start. I've tried some things out in https://github.com/filiphr/oracle-docker-test.

Thanks for documenting it

@gvenzl
Copy link
Owner

gvenzl commented May 16, 2021

Hm, they all die 32 seconds in saying that the initialization didn't work. Looks like still some sort of a timeout but I'm not very versed with GH Actions, so need to read myself into this. Thanks a lot for flagging this and for the tests you shared. I'll keep you posted and please do the same if you find anything.

@filiphr
Copy link
Author

filiphr commented May 16, 2021

It is more than once. GitHub checks it several times with different backoff policy. There are at least 5 checks with 32 seconds apart. The container doesn't report that it is up and running.

Locally with the same command it works great. Perhaps we need to reach out to someone at GitHub to help us out

@gvenzl
Copy link
Owner

gvenzl commented May 16, 2021

Oh, so it's the accumulation of e.g. 5 x 32 sec. That now makes sense why the overall run is 4mins+ in total.

I'm just about reading myself into the doc of GitHub Actions. But it will take a while...

@filiphr
Copy link
Author

filiphr commented May 16, 2021

It actually does 12 checks.

Waiting for all services to be ready
  /usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" db5ef0ca618092f031d8785ef1cc634f879299d0b331275df5208c0d91eea6f1
  starting
  oracle service is starting, waiting 2 seconds before checking again.
  /usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" db5ef0ca618092f031d8785ef1cc634f879299d0b331275df5208c0d91eea6f1
  starting
  oracle service is starting, waiting 4 seconds before checking again.
  /usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" db5ef0ca618092f031d8785ef1cc634f879299d0b331275df5208c0d91eea6f1
  starting
  oracle service is starting, waiting 7 seconds before checking again.
  /usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" db5ef0ca618092f031d8785ef1cc634f879299d0b331275df5208c0d91eea6f1
  starting
  oracle service is starting, waiting 17 seconds before checking again.
  /usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" db5ef0ca618092f031d8785ef1cc634f879299d0b331275df5208c0d91eea6f1
  starting
  oracle service is starting, waiting 30 seconds before checking again.
  /usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" db5ef0ca618092f031d8785ef1cc634f879299d0b331275df5208c0d91eea6f1
  starting
  oracle service is starting, waiting 32 seconds before checking again.
  /usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" db5ef0ca618092f031d8785ef1cc634f879299d0b331275df5208c0d91eea6f1
  starting
  oracle service is starting, waiting 32 seconds before checking again.
  /usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" db5ef0ca618092f031d8785ef1cc634f879299d0b331275df5208c0d91eea6f1
  starting
  oracle service is starting, waiting 32 seconds before checking again.
  /usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" db5ef0ca618092f031d8785ef1cc634f879299d0b331275df5208c0d91eea6f1
  starting
  oracle service is starting, waiting 32 seconds before checking again.
  /usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" db5ef0ca618092f031d8785ef1cc634f879299d0b331275df5208c0d91eea6f1
  starting
  oracle service is starting, waiting 32 seconds before checking again.
  /usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" db5ef0ca618092f031d8785ef1cc634f879299d0b331275df5208c0d91eea6f1
  unhealthy

Is it possible due to the way the container is build? When I inspect the image with wagoodman/dive I can only see two layers.

In case you want to try out different builds you could try and use the support for private registries.

@gvenzl
Copy link
Owner

gvenzl commented May 17, 2021

I think I got it.
There is actually an invalid = between --health-cmd and the command itself. However, that still didn't do the trick, looks like the string itself was still failing or invalid.

Try this instead:

    services:
      oracle:
        image: gvenzl/oracle-xe:11-slim
        env:
          ORACLE_RANDOM_PASSWORD: true
        ports:
          - 1521:1521
        options: >-
          --health-cmd healthcheck.sh
          --health-interval 20s
          --health-timeout 10s
          --health-retries 10

Seems to work fine:

Action definition: https://github.com/gvenzl/github-actions-tests/blob/main/.github/workflows/learn-github-actions.yml

Job: https://github.com/gvenzl/github-actions-tests/runs/2597012866?check_suite_focus=true

@filiphr
Copy link
Author

filiphr commented May 17, 2021

This is embarrassing 🤦 . I tried so many things, didn't try healthcheck.sh. I tried with the ${ORACLE_BASE} in front of it, but that didn't work. Thanks a lot for giving it a go.

I think that this was the last step and now I would be able to migrate our GitHub actions to use your image. Thanks a lot for all the help

@gvenzl
Copy link
Owner

gvenzl commented May 17, 2021

Don't be, I googled around and found many variations of that option and no easy way to tell which one actually works! :D

Certainly looks like the ="... is not the correct syntax but I also tried without the = and that didn't work either, so I assumed that perhaps the environment variable wasn't being resolved but not sure. 🤷🏻‍♂️

In any case, give that a go and see whether it works, please.

Meanwhile, I will start documenting the GitHub Actions steps once I did some more tests and combinations.

@filiphr
Copy link
Author

filiphr commented May 17, 2021

I just checked and ="... should work. At least it does for our DB2, SQL Server and MySQL variants.

I think that the problem is is in the evaluation in the environment variable. No idea how that is being resolved, since GitHub actions also resolves variables.

Anyways, your last example (just using healthcheck.sh is the most straight forward approach and easiest to use. It is similar to the one by Postgres.

@gvenzl
Copy link
Owner

gvenzl commented May 19, 2021

Ha, interesting, at least that explains why I found references with and without = out there :D
That's good to know though, thank you!

Great, that's good to hear! I will keep this issue open as a reminder for myself to document the GitHub Action part.

One last question looking at your other database variants: do you know whether health-check then in general does not work without providing a health-check command? Or have just all database images moved on to no longer build Docker images but OCI buildah images that do not have the Docker style health-check any longer?

@filiphr
Copy link
Author

filiphr commented May 19, 2021

One last question looking at your other database variants: do you know whether health-check then in general does not work without providing a health-check command? Or have just all database images moved on to no longer build Docker images but OCI buildah images that do not have the Docker style health-check any longer?

That's a good question. I didn't realise this earlier. Not sure if the other DBs have a HEALTCHECK in their definitions. Seems like we didn't need to define the health check only for our Oracle Docker image that we build using this docker file (the official one from Oracle) using docker.

I've found the issues for some of the other DBs and why they have not added HEALTCHECK or they have an open feature request only

I raised this issue because it worked for us with our previous Oracle image. However, after having read all of this and our previous discussion, only documenting it would be more than enough for me. I remember that it took me a while to figure our the health checks I need to do for every single DB last time. Your GitHub Action repo as an example is also great

@gvenzl
Copy link
Owner

gvenzl commented Jun 5, 2021

Thanks a lot for that research, @filiphr!

I actually always found the HEALTHCHECK in Docker rather useful, how else would one be able to tell whether a container is in good shape without breaking into the black box and looking (hence why I also put it in in the official Docker image back then ;) ).

While I understand the arguments that they bring up in these issues, and the general standpoint from Docker, I don't fully agree with them. But I don't think there is much that can be done. As they say, Kubernetes has its own interpretation of the readiness of a container, and Podman currently does not support it at all, and I sense that the future lies more in Podman than Docker. So even if it's fixed for Docker, it will be an issue for other environments and hence not reliable.

In any case, the healthcheck.sh script stays within the images here, just like Postgres has its pg_ready.

I will document the usage of the healthcheck with GitHub Actions directly in the ReadMe or an FAQ, which should hopefully help future users and give them exactly what they need.

Once again, thanks so much for your patience and efforts here, I really appreciate it!

@filiphr
Copy link
Author

filiphr commented Jun 5, 2021

It was my pleasure @gvenzl, glad I could help in any way possible.

I really appreciate the work you are doing, it has made our life testing Oracle DB way easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants