-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the hub health check in deployer #586
Comments
I like the idea!! |
I opened catalyst-cooperative/pudl-examples#4 to include dask-gateway in the catalyst coop image, so tests can pass for that. |
@sgibson91 noted a weird error message we should investigate here: #819 (comment) |
So I think there are 2 causes for node scale-up when we run our hub health checks:
|
👍
I would be OK with that, I think... |
I think @sgibson91's proposals sound good on both points! |
Linking another issue in here too as I think it's a sub-issue of this one (if we were to turn this into a project, for instance) |
Another avenue for testing is being discussed here #1024 (comment) |
For posterity, I also just received this warning from pytest
|
PR #1185 separated the hub health check from the logic that deploys a hub. This separation will allow to implement retry logic on the hub health check without needing to redeploy a hub with every retry. |
PR #1189 now means we retry the hub health check a maximum of 3 times with a timeout of 10 minutes before we declare a failure in CI |
PR #1190 separates the dask scaling test from the dask compute test. I suggest we only use the |
I believe we can also close this as well given recent work. @consideRatio would you mind writing up the remaining todos we had from our little sprint as new issues please? |
Were those remaining todos already captured, @sgibson91 and @consideRatio? |
I think we are good to move on, but I'm not sure - I didn't recognize the error linked out to the third checkox. |
@damianavila the issues are #1206 and #1232 |
Summary
In CI/CD we are seeing a lot of failed deployments when we test a hub's health. The most common reason is that the tests triggered a node to be spun and up and things timed out before that node was ready to accept the test pods. Can we add some delays to our testing framework that allow the nodes enough time to be spun up and/or do something like the mybinder.org tests where 3 attempts are made before reporting failure?
https://github.com/jupyterhub/mybinder.org-deploy/blob/4e818778a4d6980427c6a519f99508dcfdbbc8ea/.github/workflows/cd.yml#L138-L144
Tasks to complete
The text was updated successfully, but these errors were encountered: