Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the hub health check in deployer #586

Closed
2 of 3 tasks
Tracked by #879
sgibson91 opened this issue Aug 6, 2021 · 16 comments
Closed
2 of 3 tasks
Tracked by #879

Improve the hub health check in deployer #586

sgibson91 opened this issue Aug 6, 2021 · 16 comments
Assignees
Labels
Enhancement An improvement to something or creating something new.

Comments

@sgibson91
Copy link
Member

sgibson91 commented Aug 6, 2021

Summary

In CI/CD we are seeing a lot of failed deployments when we test a hub's health. The most common reason is that the tests triggered a node to be spun and up and things timed out before that node was ready to accept the test pods. Can we add some delays to our testing framework that allow the nodes enough time to be spun up and/or do something like the mybinder.org tests where 3 attempts are made before reporting failure?

https://github.com/jupyterhub/mybinder.org-deploy/blob/4e818778a4d6980427c6a519f99508dcfdbbc8ea/.github/workflows/cd.yml#L138-L144

Tasks to complete

@sgibson91 sgibson91 added Task Actions that don't involve changing our code or docs. Enhancement An improvement to something or creating something new. and removed Task Actions that don't involve changing our code or docs. labels Aug 6, 2021
@damianavila
Copy link
Contributor

Can we add some delays to our testing framework that allow the nodes enough time to be spun up and/or do something like the mybinder.org tests where 3 attempts are made before reporting failure?

I like the idea!!

@yuvipanda
Copy link
Member

I opened catalyst-cooperative/pudl-examples#4 to include dask-gateway in the catalyst coop image, so tests can pass for that.

@choldgraf
Copy link
Member

@sgibson91 noted a weird error message we should investigate here: #819 (comment)

@sgibson91
Copy link
Member Author

So I think there are 2 causes for node scale-up when we run our hub health checks:

  1. The addition of the deployment-service-check pod (which is mimicking a user) triggers a scale up. We can try to make this more efficient by not timing out tests when the pod is in any other state than Ready, trying multiple times, adding delays as described above. But generally, we should eat this cost I think.
  2. When we test a daskhub, we also test that a dask cluster can be requested and scaled up. This causes another scale-up of the dask nodepools. Is it really efficient to be testing that feature every time we deploy a hub? Could we not run that kind of test as a cron job in the cluster instead? And run it manually when we know we've triggered a change in the dask helm chart?

@damianavila
Copy link
Contributor

But generally, we should eat this cost I think.

👍

Could we not run that kind of test as a cron job in the cluster instead? And run it manually when we know we've triggered a change in the dask helm chart?

I would be OK with that, I think...
In fact, I would love to have automatic running of dask tests when we change things that are dask-related, but not sure how realistic that request would be...

@choldgraf
Copy link
Member

I think @sgibson91's proposals sound good on both points!

@sgibson91
Copy link
Member Author

Linking another issue in here too as I think it's a sub-issue of this one (if we were to turn this into a project, for instance)

@sgibson91
Copy link
Member Author

Another avenue for testing is being discussed here #1024 (comment)

@sgibson91
Copy link
Member Author

For posterity, I also just received this warning from pytest

===================================================================== warnings summary ======================================================================
../../../../../../usr/local/Caskroom/miniconda/base/envs/pilot-hubs/lib/python3.9/site-packages/pytest_asyncio/plugin.py:191
  /usr/local/Caskroom/miniconda/base/envs/pilot-hubs/lib/python3.9/site-packages/pytest_asyncio/plugin.py:191: DeprecationWarning: The 'asyncio_mode' default value will change to 'strict' in future, please explicitly use 'asyncio_mode=strict' or 'asyncio_mode=auto' in pytest configuration file.
    config.issue_config_time_warning(LEGACY_MODE, stacklevel=2)

@sgibson91
Copy link
Member Author

PR #1185 separated the hub health check from the logic that deploys a hub. This separation will allow to implement retry logic on the hub health check without needing to redeploy a hub with every retry.

@damianavila damianavila moved this to In Progress ⚡ in Sprint Board Apr 13, 2022
@sgibson91
Copy link
Member Author

PR #1189 now means we retry the hub health check a maximum of 3 times with a timeout of 10 minutes before we declare a failure in CI

@sgibson91
Copy link
Member Author

PR #1190 separates the dask scaling test from the dask compute test. I suggest we only use the --check-dask-scaling flag locally when we know we are working on dask-gateway-related features

@sgibson91
Copy link
Member Author

I believe we can also close this as well given recent work. @consideRatio would you mind writing up the remaining todos we had from our little sprint as new issues please?

Repository owner moved this from In progress to Complete in DEPRECATED Engineering and Product Backlog Apr 14, 2022
@damianavila
Copy link
Contributor

@consideRatio would you mind writing up the remaining todos we had from our little sprint as new issues please?

Were those remaining todos already captured, @sgibson91 and @consideRatio?

@consideRatio
Copy link
Contributor

I think we are good to move on, but I'm not sure - I didn't recognize the error linked out to the third checkox.

@sgibson91
Copy link
Member Author

Were those remaining todos already captured

@damianavila the issues are #1206 and #1232

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement An improvement to something or creating something new.
Projects
No open projects
Development

No branches or pull requests

5 participants