Enable openscapes (and re-enable farallon) deployment by the CI #691

damianavila · 2021-09-17T22:01:00Z

This is essentially using the previous stuff I have been working on in #673 to enable the automatic deployment of openscapes hubs.

This is also reverting #689 after the key rotation on #688.

Finally, it is also cleaning up old stuff such as the static kubeconfig encrypted files and removing the function to auth with those.

…the CI

damianavila · 2021-09-17T22:03:29Z

It is worth to mention this is not using the usual "deployer" user, it is using an existing "2i2cAdministrator" user as discussed here.

damianavila · 2021-09-17T22:05:56Z

Manual deployment seems to work as expected:

$ python3 deployer deploy openscapes staging --skip-hub-health-test
  kops has set your kubectl context to openscapeshub.k8s.local
  Running helm upgrade --install --create-namespace --wait --namespace staging staging hub-templates/daskhub -f /var/folders/mn/8h0hm_395l31nrtn235h29900000gn/T/tmpuze7eh39 -f /var/folders/mn/8h0hm_395l31nrtn235h29900000gn/T/tmpg6iqwta7
  Release "staging" has been upgraded. Happy Helming!
  NAME: staging
  LAST DEPLOYED: Fri Sep 17 17:40:04 2021
  NAMESPACE: staging
  STATUS: deployed
  REVISION: 11
  TEST SUITE: None

but tests seems to be failing:

$ python3 deployer deploy openscapes staging
  kops has set your kubectl context to openscapeshub.k8s.local
  Running helm upgrade --install --create-namespace --wait --namespace staging staging hub-templates/daskhub -f /var/folders/mn/8h0hm_395l31nrtn235h29900000gn/T/tmpwdvpeoos -f /var/folders/mn/8h0hm_395l31nrtn235h29900000gn/T/tmpu4earntb
  Release "staging" has been upgraded. Happy Helming!
  NAME: staging
  LAST DEPLOYED: Fri Sep 17 17:44:31 2021
  NAMESPACE: staging
  STATUS: deployed
  REVISION: 12
  TEST SUITE: None
  Running hub health check...
  Health check failed!

damianavila · 2021-09-17T22:11:39Z

When I looked into the deployment service pod:

$ kubectl describe pod jupyter-deployment-2dservice-2dcheck --namespace staging
  ...
  Events
    Type     Reason            Age                From                Message
    ----     ------            ----               ----                -------
    Normal   TriggeredScaleUp  13m                cluster-autoscaler  pod triggered scale-up: [{notebook-m5-large.openscapeshub.k8s.local 0->1 (max: 20)}]
    Warning  FailedScheduling  10m (x4 over 13m)  default-scheduler   0/1 nodes are available: 1 Insufficient memory.
    Warning  FailedScheduling  10m (x2 over 10m)  default-scheduler   0/2 nodes are available: 1 Insufficient memory, 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
    Normal   Scheduled         10m                default-scheduler   Successfully assigned staging/jupyter-deployment-2dservice-2dcheck to ip-172-20-44-51.us-west-2.compute.internal
    Normal   Pulling           9m58s              kubelet             Pulling image "busybox"
    Normal   Pulled            9m56s              kubelet             Successfully pulled image "busybox" in 1.676682658s
    Normal   Created           9m56s              kubelet             Created container volume-mount-ownership-fix
    Normal   Started           9m56s              kubelet             Started container volume-mount-ownership-fix
    Normal   Pulling           9m55s              kubelet             Pulling image "783616723547.dkr.ecr.us-west-2.amazonaws.com/user-image:d78bb6c"
    Normal   Pulled            8m42s              kubelet             Successfully pulled image "783616723547.dkr.ecr.us-west-2.amazonaws.com/user-image:d78bb6c" in 1m12.464977662s
    Normal   Created           8m35s              kubelet             Created container notebook
    Normal   Started           8m35s              kubelet             Started container notebook

I first thought this was a timeout because a new node was needed to spin up, but when I run it again I saw the same failure and the logs say:

$ kubectl logs jupyter-deployment-2dservice-2dcheck --namespace staging
  [I 2021-09-17 21:03:22.815 LabApp] JupyterLab extension loaded from /srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterlab
  [I 2021-09-17 21:03:22.815 LabApp] JupyterLab application directory is /srv/conda/envs/notebook/share/jupyter/lab
  [I 2021-09-17 21:03:22.823 SingleUserNotebookApp extension:22] nteract extension loaded from /srv/conda/envs/notebook/lib/python3.7/site-packages/nteract_on_jupyter
  [I 2021-09-17 21:03:23.624 SingleUserNotebookApp mixins:557] Starting jupyterhub-singleuser server version 1.3.0
  [I 2021-09-17 21:03:23.628 SingleUserNotebookApp log:181] 302 GET /user/deployment-service-check/ -> /user/deployment-service-check/lab? (@100.112.53.116) 2.30ms
  [W 2021-09-17 21:03:23.633 SingleUserNotebookApp _version:73] jupyterhub version 1.4.2 != jupyterhub-singleuser version 1.3.0. This could cause failure to authenticate and result in redirect loops!
  [I 2021-09-17 21:03:23.633 SingleUserNotebookApp notebookapp:2257] Serving notebooks from local directory: /home/jovyan
  [I 2021-09-17 21:03:23.633 SingleUserNotebookApp notebookapp:2257] Jupyter Notebook 6.2.0 is running at:
  [I 2021-09-17 21:03:23.633 SingleUserNotebookApp notebookapp:2257] http://jupyter-deployment-2dservice-2dcheck:8888/user/deployment-service-check/
  [I 2021-09-17 21:03:23.633 SingleUserNotebookApp notebookapp:2258] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
  [I 2021-09-17 21:03:23.635 SingleUserNotebookApp mixins:538] Updating Hub with activity every 300 seconds
  [I 2021-09-17 21:03:25.630 SingleUserNotebookApp log:181] 200 GET /user/deployment-service-check/api/kernelspecs ([email protected]) 28.51ms
  [I 2021-09-17 21:03:25.937 SingleUserNotebookApp kernelmanager:179] Kernel started: d37f6219-b465-4a67-a656-65a412fd52f2, name: python3
  [I 2021-09-17 21:03:25.937 SingleUserNotebookApp kernelmanager:443] Culling kernels with idle durations > 3600 seconds at 300 second intervals ...
  [I 2021-09-17 21:03:25.937 SingleUserNotebookApp kernelmanager:447] Culling kernels even with connected clients
  [I 2021-09-17 21:03:25.939 SingleUserNotebookApp log:181] 201 POST /user/deployment-service-check/api/kernels ([email protected]) 27.22ms
  [W 2021-09-17 21:03:26.838 SingleUserNotebookApp zmqhandlers:281] No session ID specified
  [I 2021-09-17 21:03:26.844 SingleUserNotebookApp log:181] 101 GET /user/deployment-service-check/api/kernels/d37f6219-b465-4a67-a656-65a412fd52f2/channels ([email protected]) 7.49ms
  [I 2021-09-17 21:03:28.058 SingleUserNotebookApp kernelmanager:222] Starting buffering for d37f6219-b465-4a67-a656-65a412fd52f2:a8c071fb-70f71f292d43bcf6dee0fb65

This feels to me like a flaky test and unrelated to the auth stuff that seems to be working as expected.
Btw, I also see a lot of test failures in the latest merges: https://github.com/2i2c-org/pilot-hubs/actions

damianavila · 2021-09-17T22:15:02Z

@2i2c-org/tech-team, thoughts about the test failures?
Also, any feedback on the PR is welcome 😜

damianavila · 2021-09-18T14:51:05Z

Well, it seems that redirecting the test output (to prevent the CI from leaking secrets) is preventing us to get the actual error...

With the following diff

diff --git a/deployer/hub.py b/deployer/hub.py
index 5d99f3d..5755d3b 100644
--- a/deployer/hub.py
+++ b/deployer/hub.py
@@ -458,7 +458,20 @@ class Hub:
                 # This can contain sensitive info - so we hide stderr
                 # FIXME: Don't use pytest - just call a function instead
                 print("Running hub health check...")
-                with open(os.devnull, 'w') as dn, redirect_stderr(dn), redirect_stdout(dn):
+                # Show errors locally but redirect on CI
+                gh_ci = os.environ.get('CI', "false")
+                if gh_ci == "true":
+                    print("Testing on CI, redirected output")
+                    with open(os.devnull, 'w') as dn, redirect_stderr(dn), redirect_stdout(dn):
+                        exit_code = pytest.main([
+                            "-q",
+                            "deployer/tests",
+                            "--hub-url", hub_url,
+                            "--api-token", service_api_token,
+                            "--hub-type", self.spec['template']
+                        ])
+                else:
+                    print("Testing locally, do not redirect output")

I was able to get more info:

>                                   raise ValueError(f'execution of cell={i} did not match expected result diff={diff}')
E                                   ValueError: execution of cell=2 did not match expected result diff=--- 
E                                   +++ 
E                                   @@ -0,0 +1,3 @@
E                                   +1+.+0

../../../../miniconda/envs/pilot-hubs/lib/python3.9/site-packages/jhub_client/execute.py:98: ValueError
------------------------------------------------------------------------------------------------------------ Captured stdout call ------------------------------------------------------------------------------------------------------------
Starting hub https://staging.openscapes.2i2c.cloud health validation...
Running dask_test_notebook.ipynb test notebook...
Hub https://staging.openscapes.2i2c.cloud not healthy! Stopping further deployments. Exception was execution of cell=2 did not match expected result diff=--- 
+++ 
@@ -0,0 +1,3 @@
+1+.+0.
------------------------------------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------------------------------------
ERROR    jhub_client.execute:execute.py:97 kernel result did not match expected result diff=--- 
+++ 
@@ -0,0 +1,3 @@
+1+.+0
========================================================================================================== short test summary info ===========================================================================================================
FAILED deployer/tests/test_hub_health.py::test_hub_healthy - ValueError: execution of cell=2 did not match expected result diff=--- 
1 failed in 304.45s (0:05:04)
Health check failed!

damianavila · 2021-09-18T18:25:34Z

OK, I will continue on Monday, but the failure seems real...
And it seems we do not have dask installed in the image!
In a daskhub!!
I think I need to go back in time and check previous issues to understand why we ended up in this situation.
I guess this is probably a daskhub that was never user with dask... instead, it was used as a basehub, but it is just a quick guess...

sgibson91 · 2021-09-20T09:35:48Z

diff --git a/deployer/hub.py b/deployer/hub.py
index 5d99f3d..5755d3b 100644
--- a/deployer/hub.py
+++ b/deployer/hub.py
@@ -458,7 +458,20 @@ class Hub:
                 # This can contain sensitive info - so we hide stderr
                 # FIXME: Don't use pytest - just call a function instead
                 print("Running hub health check...")
-                with open(os.devnull, 'w') as dn, redirect_stderr(dn), redirect_stdout(dn):
+                # Show errors locally but redirect on CI
+                gh_ci = os.environ.get('CI', "false")
+                if gh_ci == "true":
+                    print("Testing on CI, redirected output")
+                    with open(os.devnull, 'w') as dn, redirect_stderr(dn), redirect_stdout(dn):
+                        exit_code = pytest.main([
+                            "-q",
+                            "deployer/tests",
+                            "--hub-url", hub_url,
+                            "--api-token", service_api_token,
+                            "--hub-type", self.spec['template']
+                        ])
+                else:
+                    print("Testing locally, do not redirect output")

I like this. Can we commit this in a new PR?

damianavila · 2021-09-20T14:22:58Z

I like this. Can we commit this in a new PR?

Sure, I was planning to do that 😉 !

damianavila · 2021-09-21T01:12:27Z

OK, coming back to this one, I was able to add myself as admin, start a notebook server and run the dask test notebook and the error was:

I was able to track the decision to provide a daskhub here: #363 (comment), but once I looked into the openscapes image repo, I did not see any dask references:

Some questions:
1- @choldgraf, do you have any recollection about a request to have a daskhub instead of a base one from openscapes folks?
2- @yuvipanda, according to the info in the issues I linked before, you created the image repo from an existing tutorial. Could it be the case that you forgot to add dask stuff in the environment?

Interestingly, farallon and carbonplan inherited their images from the pangeo one but the openspace_image repo was just using environment/requirement files. This is why this failure is present in openscapes whereas it is not being surfaced in other AWS-based hubs. I guess we have some follow-up to prevent this from happening again. Btw, we got lucky that openscapes people did not try to use dask 😜 .

But... going back to this PR, I think the openscapes failure should not prevent the merging of this PR (and eventually fix the failure with another ticket/issue), but happy to know if you disagree (btw, adding reviewers now 😉).

choldgraf · 2021-09-21T04:20:28Z

If I recall, the open scapes folks said that they wanted Dask, but not Dask Gateway. So maybe it's enough to just use a base hub with Dask installed.

damianavila · 2021-09-21T12:32:23Z

So maybe it's enough to just use a base hub with Dask installed.

I would agree with that idea, although I think that is probably a discussion for another issue (I will create one later today).

damianavila · 2021-09-21T12:43:13Z

Sure, I was planning to do that 😉 !

Opened #694 to track that one.

damianavila · 2021-09-21T12:51:50Z

So maybe it's enough to just use a base hub with Dask installed.

I would agree with that idea, although I think that is probably a discussion for another issue (I will create one later today).

Opened #695 to track this one.

sgibson91

thank you @damianavila!

choldgraf

I can't comment on the specific implementation here, but the general decision looks good to me, and I agree with @damianavila's assessment that they just need Dask, not DG (I've commented in #695 as well). I'm +1 on merging this and taking yet another important step towards AUTOMATE ALL THE THINGS :-)

damianavila · 2021-09-23T21:07:57Z

Thanks @sgibson91 and @choldgraf for your approvals!

damianavila added 5 commits September 17, 2021 11:30

Update openscapes config to use automatic aws authentication

3baf04a

Enable openscapes and re-enable farallon automatic hub deployment by …

4584f5e

…the CI

Remove support for authentication by static kubeconfig file

3759763

Delete non-used static kubeconfig files

13ad164

Add encrypted openscapes file containing credentials for the deployer

08338b5

damianavila linked an issue Sep 17, 2021 that may be closed by this pull request

Automate hub deployments on AWS #647

Closed

5 tasks

damianavila mentioned this pull request Sep 17, 2021

Automate hub deployments on AWS #647

Closed

5 tasks

damianavila requested review from sgibson91, yuvipanda, choldgraf and GeorgianaElena September 21, 2021 01:13

damianavila mentioned this pull request Sep 21, 2021

Redirecting the test output (to prevent the CI from leaking secrets) is preventing us to get the actual error locally #694

Closed

2 tasks

damianavila mentioned this pull request Sep 21, 2021

Move openscapes to basehub #695

Closed

This was referenced Sep 21, 2021

Add documentation about best-practices in deploying hubs on AWS #696

Closed

Team Sync - Monday, September 20th 2i2c-org/team-compass#247

Closed

sgibson91 approved these changes Sep 21, 2021

View reviewed changes

choldgraf approved these changes Sep 21, 2021

View reviewed changes

damianavila merged commit fed2891 into master Sep 23, 2021

damianavila deleted the enable_kops_hubs branch September 23, 2021 21:07

damianavila removed a link to an issue Sep 23, 2021

Automate hub deployments on AWS #647

Closed

5 tasks

damianavila mentioned this pull request Sep 27, 2021

Team Sync - Monday, September 27th 2i2c-org/team-compass#251

Closed

damianavila mentioned this pull request Feb 19, 2022

Persistent CI/CD failure for uwhackweeks test: dask_test_notebook.ipynb #998

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable openscapes (and re-enable farallon) deployment by the CI #691

Enable openscapes (and re-enable farallon) deployment by the CI #691

damianavila commented Sep 17, 2021

damianavila commented Sep 17, 2021

damianavila commented Sep 17, 2021

damianavila commented Sep 17, 2021

damianavila commented Sep 17, 2021

damianavila commented Sep 18, 2021

damianavila commented Sep 18, 2021 •

edited

Loading

sgibson91 commented Sep 20, 2021

damianavila commented Sep 20, 2021

damianavila commented Sep 21, 2021

choldgraf commented Sep 21, 2021

damianavila commented Sep 21, 2021 •

edited

Loading

damianavila commented Sep 21, 2021

damianavila commented Sep 21, 2021

sgibson91 left a comment

choldgraf left a comment

damianavila commented Sep 23, 2021

Enable openscapes (and re-enable farallon) deployment by the CI #691

Enable openscapes (and re-enable farallon) deployment by the CI #691

Conversation

damianavila commented Sep 17, 2021

damianavila commented Sep 17, 2021

damianavila commented Sep 17, 2021

damianavila commented Sep 17, 2021

damianavila commented Sep 17, 2021

damianavila commented Sep 18, 2021

damianavila commented Sep 18, 2021 • edited Loading

sgibson91 commented Sep 20, 2021

damianavila commented Sep 20, 2021

damianavila commented Sep 21, 2021

choldgraf commented Sep 21, 2021

damianavila commented Sep 21, 2021 • edited Loading

damianavila commented Sep 21, 2021

damianavila commented Sep 21, 2021

sgibson91 left a comment

Choose a reason for hiding this comment

choldgraf left a comment

Choose a reason for hiding this comment

damianavila commented Sep 23, 2021

damianavila commented Sep 18, 2021 •

edited

Loading

damianavila commented Sep 21, 2021 •

edited

Loading