Improve the hub health check in deployer #586

sgibson91 · 2021-08-06T09:47:50Z

Summary

In CI/CD we are seeing a lot of failed deployments when we test a hub's health. The most common reason is that the tests triggered a node to be spun and up and things timed out before that node was ready to accept the test pods. Can we add some delays to our testing framework that allow the nodes enough time to be spun up and/or do something like the mybinder.org tests where 3 attempts are made before reporting failure?

https://github.com/jupyterhub/mybinder.org-deploy/blob/4e818778a4d6980427c6a519f99508dcfdbbc8ea/.github/workflows/cd.yml#L138-L144

Tasks to complete

Investigate the major causes of health check failures that should not be failures, identify one or two that we can design tests around
- Comment here: Temporarily disable singleuser networkpolicy for Pangeo #819 (comment)
Update our health check infrastructure to be robust to these failures

damianavila · 2021-08-06T20:08:22Z

Can we add some delays to our testing framework that allow the nodes enough time to be spun up and/or do something like the mybinder.org tests where 3 attempts are made before reporting failure?

I like the idea!!

yuvipanda · 2021-11-09T20:25:39Z

I opened catalyst-cooperative/pudl-examples#4 to include dask-gateway in the catalyst coop image, so tests can pass for that.

choldgraf · 2021-11-09T22:01:35Z

@sgibson91 noted a weird error message we should investigate here: #819 (comment)

sgibson91 · 2022-01-26T10:00:37Z

So I think there are 2 causes for node scale-up when we run our hub health checks:

The addition of the deployment-service-check pod (which is mimicking a user) triggers a scale up. We can try to make this more efficient by not timing out tests when the pod is in any other state than Ready, trying multiple times, adding delays as described above. But generally, we should eat this cost I think.
When we test a daskhub, we also test that a dask cluster can be requested and scaled up. This causes another scale-up of the dask nodepools. Is it really efficient to be testing that feature every time we deploy a hub? Could we not run that kind of test as a cron job in the cluster instead? And run it manually when we know we've triggered a change in the dask helm chart?

damianavila · 2022-01-26T21:29:04Z

But generally, we should eat this cost I think.

👍

Could we not run that kind of test as a cron job in the cluster instead? And run it manually when we know we've triggered a change in the dask helm chart?

I would be OK with that, I think...
In fact, I would love to have automatic running of dask tests when we change things that are dask-related, but not sure how realistic that request would be...

choldgraf · 2022-01-28T00:50:23Z

I think @sgibson91's proposals sound good on both points!

sgibson91 · 2022-02-02T14:48:31Z

Linking another issue in here too as I think it's a sub-issue of this one (if we were to turn this into a project, for instance)

Refactor for clarity that deployer/tests aren't just developer tests #971

sgibson91 · 2022-02-28T11:37:33Z

Another avenue for testing is being discussed here #1024 (comment)

sgibson91 · 2022-02-28T11:47:05Z

For posterity, I also just received this warning from pytest

===================================================================== warnings summary ======================================================================
../../../../../../usr/local/Caskroom/miniconda/base/envs/pilot-hubs/lib/python3.9/site-packages/pytest_asyncio/plugin.py:191
  /usr/local/Caskroom/miniconda/base/envs/pilot-hubs/lib/python3.9/site-packages/pytest_asyncio/plugin.py:191: DeprecationWarning: The 'asyncio_mode' default value will change to 'strict' in future, please explicitly use 'asyncio_mode=strict' or 'asyncio_mode=auto' in pytest configuration file.
    config.issue_config_time_warning(LEGACY_MODE, stacklevel=2)

sgibson91 · 2022-04-13T09:22:15Z

PR #1185 separated the hub health check from the logic that deploys a hub. This separation will allow to implement retry logic on the hub health check without needing to redeploy a hub with every retry.

sgibson91 · 2022-04-13T12:41:27Z

PR #1189 now means we retry the hub health check a maximum of 3 times with a timeout of 10 minutes before we declare a failure in CI

sgibson91 · 2022-04-13T14:43:45Z

PR #1190 separates the dask scaling test from the dask compute test. I suggest we only use the --check-dask-scaling flag locally when we know we are working on dask-gateway-related features

sgibson91 · 2022-04-14T08:59:00Z

I believe we can also close this as well given recent work. @consideRatio would you mind writing up the remaining todos we had from our little sprint as new issues please?

damianavila · 2022-04-18T21:08:31Z

@consideRatio would you mind writing up the remaining todos we had from our little sprint as new issues please?

Were those remaining todos already captured, @sgibson91 and @consideRatio?

consideRatio · 2022-04-18T22:33:41Z

I think we are good to move on, but I'm not sure - I didn't recognize the error linked out to the third checkox.

sgibson91 · 2022-04-25T10:18:48Z

Were those remaining todos already captured

@damianavila the issues are #1206 and #1232

sgibson91 added Task Actions that don't involve changing our code or docs. Enhancement An improvement to something or creating something new. and removed Task Actions that don't involve changing our code or docs. labels Aug 6, 2021

choldgraf mentioned this issue Aug 16, 2021

Infrastructure Professionalization for the Managed JupyterHub Service v1 #611

Closed

choldgraf added the 🏷️ optimization label Sep 2, 2021

choldgraf added this to DEPRECATED Engineering and Product Backlog Nov 9, 2021

choldgraf mentioned this issue Nov 9, 2021

Parallelize the deployment of hubs on a cluster #818

Closed

sgibson91 mentioned this issue Nov 9, 2021

dask_gateway test fails on catalyst cooperative hub #797

Closed

choldgraf mentioned this issue Dec 7, 2021

CI/CD infrastructure for hubs that is automated, tested, and parallelized #879

Closed

7 tasks

choldgraf removed this from DEPRECATED Engineering and Product Backlog Dec 7, 2021

sgibson91 mentioned this issue Feb 28, 2022

Generate token for Health Check service in the z2jh helm chart instead of in the deployer #1024

Closed

damianavila added this to Sprint Board Apr 13, 2022

damianavila moved this to In Progress ⚡ in Sprint Board Apr 13, 2022

damianavila removed this from Sprint Board Apr 13, 2022

damianavila added this to DEPRECATED Engineering and Product Backlog Apr 13, 2022

damianavila moved this to In progress in DEPRECATED Engineering and Product Backlog Apr 13, 2022

damianavila assigned consideRatio and sgibson91 Apr 13, 2022

sgibson91 closed this as completed Apr 14, 2022

Repository owner moved this from In progress to Complete in DEPRECATED Engineering and Product Backlog Apr 14, 2022

damianavila mentioned this issue Jul 11, 2022

[blog] Quarter 2 update 2i2c-org/team-compass#452

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the hub health check in deployer #586

Improve the hub health check in deployer #586

sgibson91 commented Aug 6, 2021 •

edited by consideRatio

Loading

damianavila commented Aug 6, 2021

yuvipanda commented Nov 9, 2021

choldgraf commented Nov 9, 2021

sgibson91 commented Jan 26, 2022

damianavila commented Jan 26, 2022

choldgraf commented Jan 28, 2022

sgibson91 commented Feb 2, 2022

sgibson91 commented Feb 28, 2022

sgibson91 commented Feb 28, 2022

sgibson91 commented Apr 13, 2022

sgibson91 commented Apr 13, 2022

sgibson91 commented Apr 13, 2022

sgibson91 commented Apr 14, 2022

damianavila commented Apr 18, 2022

consideRatio commented Apr 18, 2022

sgibson91 commented Apr 25, 2022

Improve the hub health check in deployer #586

Improve the hub health check in deployer #586

Comments

sgibson91 commented Aug 6, 2021 • edited by consideRatio Loading

Summary

Tasks to complete

damianavila commented Aug 6, 2021

yuvipanda commented Nov 9, 2021

choldgraf commented Nov 9, 2021

sgibson91 commented Jan 26, 2022

damianavila commented Jan 26, 2022

choldgraf commented Jan 28, 2022

sgibson91 commented Feb 2, 2022

sgibson91 commented Feb 28, 2022

sgibson91 commented Feb 28, 2022

sgibson91 commented Apr 13, 2022

sgibson91 commented Apr 13, 2022

sgibson91 commented Apr 13, 2022

sgibson91 commented Apr 14, 2022

damianavila commented Apr 18, 2022

consideRatio commented Apr 18, 2022

sgibson91 commented Apr 25, 2022

sgibson91 commented Aug 6, 2021 •

edited by consideRatio

Loading