Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable openscapes (and re-enable farallon) deployment by the CI #691

Merged
merged 5 commits into from
Sep 23, 2021

Conversation

damianavila
Copy link
Contributor

This is essentially using the previous stuff I have been working on in #673 to enable the automatic deployment of openscapes hubs.

This is also reverting #689 after the key rotation on #688.

Finally, it is also cleaning up old stuff such as the static kubeconfig encrypted files and removing the function to auth with those.

@damianavila damianavila linked an issue Sep 17, 2021 that may be closed by this pull request
5 tasks
@damianavila damianavila mentioned this pull request Sep 17, 2021
5 tasks
@damianavila
Copy link
Contributor Author

It is worth to mention this is not using the usual "deployer" user, it is using an existing "2i2cAdministrator" user as discussed here.

@damianavila
Copy link
Contributor Author

Manual deployment seems to work as expected:

$ python3 deployer deploy openscapes staging --skip-hub-health-test
  kops has set your kubectl context to openscapeshub.k8s.local
  Running helm upgrade --install --create-namespace --wait --namespace staging staging hub-templates/daskhub -f /var/folders/mn/8h0hm_395l31nrtn235h29900000gn/T/tmpuze7eh39 -f /var/folders/mn/8h0hm_395l31nrtn235h29900000gn/T/tmpg6iqwta7
  Release "staging" has been upgraded. Happy Helming!
  NAME: staging
  LAST DEPLOYED: Fri Sep 17 17:40:04 2021
  NAMESPACE: staging
  STATUS: deployed
  REVISION: 11
  TEST SUITE: None

but tests seems to be failing:

$ python3 deployer deploy openscapes staging
  kops has set your kubectl context to openscapeshub.k8s.local
  Running helm upgrade --install --create-namespace --wait --namespace staging staging hub-templates/daskhub -f /var/folders/mn/8h0hm_395l31nrtn235h29900000gn/T/tmpwdvpeoos -f /var/folders/mn/8h0hm_395l31nrtn235h29900000gn/T/tmpu4earntb
  Release "staging" has been upgraded. Happy Helming!
  NAME: staging
  LAST DEPLOYED: Fri Sep 17 17:44:31 2021
  NAMESPACE: staging
  STATUS: deployed
  REVISION: 12
  TEST SUITE: None
  Running hub health check...
  Health check failed!

@damianavila
Copy link
Contributor Author

When I looked into the deployment service pod:

$ kubectl describe pod jupyter-deployment-2dservice-2dcheck --namespace staging
  ...
  Events
    Type     Reason            Age                From                Message
    ----     ------            ----               ----                -------
    Normal   TriggeredScaleUp  13m                cluster-autoscaler  pod triggered scale-up: [{notebook-m5-large.openscapeshub.k8s.local 0->1 (max: 20)}]
    Warning  FailedScheduling  10m (x4 over 13m)  default-scheduler   0/1 nodes are available: 1 Insufficient memory.
    Warning  FailedScheduling  10m (x2 over 10m)  default-scheduler   0/2 nodes are available: 1 Insufficient memory, 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
    Normal   Scheduled         10m                default-scheduler   Successfully assigned staging/jupyter-deployment-2dservice-2dcheck to ip-172-20-44-51.us-west-2.compute.internal
    Normal   Pulling           9m58s              kubelet             Pulling image "busybox"
    Normal   Pulled            9m56s              kubelet             Successfully pulled image "busybox" in 1.676682658s
    Normal   Created           9m56s              kubelet             Created container volume-mount-ownership-fix
    Normal   Started           9m56s              kubelet             Started container volume-mount-ownership-fix
    Normal   Pulling           9m55s              kubelet             Pulling image "783616723547.dkr.ecr.us-west-2.amazonaws.com/user-image:d78bb6c"
    Normal   Pulled            8m42s              kubelet             Successfully pulled image "783616723547.dkr.ecr.us-west-2.amazonaws.com/user-image:d78bb6c" in 1m12.464977662s
    Normal   Created           8m35s              kubelet             Created container notebook
    Normal   Started           8m35s              kubelet             Started container notebook

I first thought this was a timeout because a new node was needed to spin up, but when I run it again I saw the same failure and the logs say:

$ kubectl logs jupyter-deployment-2dservice-2dcheck --namespace staging
  [I 2021-09-17 21:03:22.815 LabApp] JupyterLab extension loaded from /srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterlab
  [I 2021-09-17 21:03:22.815 LabApp] JupyterLab application directory is /srv/conda/envs/notebook/share/jupyter/lab
  [I 2021-09-17 21:03:22.823 SingleUserNotebookApp extension:22] nteract extension loaded from /srv/conda/envs/notebook/lib/python3.7/site-packages/nteract_on_jupyter
  [I 2021-09-17 21:03:23.624 SingleUserNotebookApp mixins:557] Starting jupyterhub-singleuser server version 1.3.0
  [I 2021-09-17 21:03:23.628 SingleUserNotebookApp log:181] 302 GET /user/deployment-service-check/ -> /user/deployment-service-check/lab? (@100.112.53.116) 2.30ms
  [W 2021-09-17 21:03:23.633 SingleUserNotebookApp _version:73] jupyterhub version 1.4.2 != jupyterhub-singleuser version 1.3.0. This could cause failure to authenticate and result in redirect loops!
  [I 2021-09-17 21:03:23.633 SingleUserNotebookApp notebookapp:2257] Serving notebooks from local directory: /home/jovyan
  [I 2021-09-17 21:03:23.633 SingleUserNotebookApp notebookapp:2257] Jupyter Notebook 6.2.0 is running at:
  [I 2021-09-17 21:03:23.633 SingleUserNotebookApp notebookapp:2257] http://jupyter-deployment-2dservice-2dcheck:8888/user/deployment-service-check/
  [I 2021-09-17 21:03:23.633 SingleUserNotebookApp notebookapp:2258] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
  [I 2021-09-17 21:03:23.635 SingleUserNotebookApp mixins:538] Updating Hub with activity every 300 seconds
  [I 2021-09-17 21:03:25.630 SingleUserNotebookApp log:181] 200 GET /user/deployment-service-check/api/kernelspecs ([email protected]) 28.51ms
  [I 2021-09-17 21:03:25.937 SingleUserNotebookApp kernelmanager:179] Kernel started: d37f6219-b465-4a67-a656-65a412fd52f2, name: python3
  [I 2021-09-17 21:03:25.937 SingleUserNotebookApp kernelmanager:443] Culling kernels with idle durations > 3600 seconds at 300 second intervals ...
  [I 2021-09-17 21:03:25.937 SingleUserNotebookApp kernelmanager:447] Culling kernels even with connected clients
  [I 2021-09-17 21:03:25.939 SingleUserNotebookApp log:181] 201 POST /user/deployment-service-check/api/kernels ([email protected]) 27.22ms
  [W 2021-09-17 21:03:26.838 SingleUserNotebookApp zmqhandlers:281] No session ID specified
  [I 2021-09-17 21:03:26.844 SingleUserNotebookApp log:181] 101 GET /user/deployment-service-check/api/kernels/d37f6219-b465-4a67-a656-65a412fd52f2/channels ([email protected]) 7.49ms
  [I 2021-09-17 21:03:28.058 SingleUserNotebookApp kernelmanager:222] Starting buffering for d37f6219-b465-4a67-a656-65a412fd52f2:a8c071fb-70f71f292d43bcf6dee0fb65   

This feels to me like a flaky test and unrelated to the auth stuff that seems to be working as expected.
Btw, I also see a lot of test failures in the latest merges: https://github.com/2i2c-org/pilot-hubs/actions

@damianavila
Copy link
Contributor Author

@2i2c-org/tech-team, thoughts about the test failures?
Also, any feedback on the PR is welcome 😜

@damianavila
Copy link
Contributor Author

Well, it seems that redirecting the test output (to prevent the CI from leaking secrets) is preventing us to get the actual error...

With the following diff

diff --git a/deployer/hub.py b/deployer/hub.py
index 5d99f3d..5755d3b 100644
--- a/deployer/hub.py
+++ b/deployer/hub.py
@@ -458,7 +458,20 @@ class Hub:
                 # This can contain sensitive info - so we hide stderr
                 # FIXME: Don't use pytest - just call a function instead
                 print("Running hub health check...")
-                with open(os.devnull, 'w') as dn, redirect_stderr(dn), redirect_stdout(dn):
+                # Show errors locally but redirect on CI
+                gh_ci = os.environ.get('CI', "false")
+                if gh_ci == "true":
+                    print("Testing on CI, redirected output")
+                    with open(os.devnull, 'w') as dn, redirect_stderr(dn), redirect_stdout(dn):
+                        exit_code = pytest.main([
+                            "-q",
+                            "deployer/tests",
+                            "--hub-url", hub_url,
+                            "--api-token", service_api_token,
+                            "--hub-type", self.spec['template']
+                        ])
+                else:
+                    print("Testing locally, do not redirect output")

I was able to get more info:

>                                   raise ValueError(f'execution of cell={i} did not match expected result diff={diff}')
E                                   ValueError: execution of cell=2 did not match expected result diff=--- 
E                                   +++ 
E                                   @@ -0,0 +1,3 @@
E                                   +1+.+0

../../../../miniconda/envs/pilot-hubs/lib/python3.9/site-packages/jhub_client/execute.py:98: ValueError
------------------------------------------------------------------------------------------------------------ Captured stdout call ------------------------------------------------------------------------------------------------------------
Starting hub https://staging.openscapes.2i2c.cloud health validation...
Running dask_test_notebook.ipynb test notebook...
Hub https://staging.openscapes.2i2c.cloud not healthy! Stopping further deployments. Exception was execution of cell=2 did not match expected result diff=--- 
+++ 
@@ -0,0 +1,3 @@
+1+.+0.
------------------------------------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------------------------------------
ERROR    jhub_client.execute:execute.py:97 kernel result did not match expected result diff=--- 
+++ 
@@ -0,0 +1,3 @@
+1+.+0
========================================================================================================== short test summary info ===========================================================================================================
FAILED deployer/tests/test_hub_health.py::test_hub_healthy - ValueError: execution of cell=2 did not match expected result diff=--- 
1 failed in 304.45s (0:05:04)
Health check failed!

@damianavila
Copy link
Contributor Author

damianavila commented Sep 18, 2021

OK, I will continue on Monday, but the failure seems real...
And it seems we do not have dask installed in the image!
In a daskhub!!
I think I need to go back in time and check previous issues to understand why we ended up in this situation.
I guess this is probably a daskhub that was never user with dask... instead, it was used as a basehub, but it is just a quick guess...

@sgibson91
Copy link
Member

diff --git a/deployer/hub.py b/deployer/hub.py
index 5d99f3d..5755d3b 100644
--- a/deployer/hub.py
+++ b/deployer/hub.py
@@ -458,7 +458,20 @@ class Hub:
                 # This can contain sensitive info - so we hide stderr
                 # FIXME: Don't use pytest - just call a function instead
                 print("Running hub health check...")
-                with open(os.devnull, 'w') as dn, redirect_stderr(dn), redirect_stdout(dn):
+                # Show errors locally but redirect on CI
+                gh_ci = os.environ.get('CI', "false")
+                if gh_ci == "true":
+                    print("Testing on CI, redirected output")
+                    with open(os.devnull, 'w') as dn, redirect_stderr(dn), redirect_stdout(dn):
+                        exit_code = pytest.main([
+                            "-q",
+                            "deployer/tests",
+                            "--hub-url", hub_url,
+                            "--api-token", service_api_token,
+                            "--hub-type", self.spec['template']
+                        ])
+                else:
+                    print("Testing locally, do not redirect output")

I like this. Can we commit this in a new PR?

@damianavila
Copy link
Contributor Author

I like this. Can we commit this in a new PR?

Sure, I was planning to do that 😉 !

@damianavila
Copy link
Contributor Author

OK, coming back to this one, I was able to add myself as admin, start a notebook server and run the dask test notebook and the error was:

Screen Shot 2021-09-20 at 21 28 01

I was able to track the decision to provide a daskhub here: #363 (comment), but once I looked into the openscapes image repo, I did not see any dask references:

Some questions:
1- @choldgraf, do you have any recollection about a request to have a daskhub instead of a base one from openscapes folks?
2- @yuvipanda, according to the info in the issues I linked before, you created the image repo from an existing tutorial. Could it be the case that you forgot to add dask stuff in the environment?

Interestingly, farallon and carbonplan inherited their images from the pangeo one but the openspace_image repo was just using environment/requirement files. This is why this failure is present in openscapes whereas it is not being surfaced in other AWS-based hubs. I guess we have some follow-up to prevent this from happening again. Btw, we got lucky that openscapes people did not try to use dask 😜 .

But... going back to this PR, I think the openscapes failure should not prevent the merging of this PR (and eventually fix the failure with another ticket/issue), but happy to know if you disagree (btw, adding reviewers now 😉).

@choldgraf
Copy link
Member

If I recall, the open scapes folks said that they wanted Dask, but not Dask Gateway. So maybe it's enough to just use a base hub with Dask installed.

@damianavila
Copy link
Contributor Author

damianavila commented Sep 21, 2021

So maybe it's enough to just use a base hub with Dask installed.

I would agree with that idea, although I think that is probably a discussion for another issue (I will create one later today).

@damianavila
Copy link
Contributor Author

Sure, I was planning to do that 😉 !

Opened #694 to track that one.

@damianavila
Copy link
Contributor Author

So maybe it's enough to just use a base hub with Dask installed.

I would agree with that idea, although I think that is probably a discussion for another issue (I will create one later today).

Opened #695 to track this one.

Copy link
Member

@sgibson91 sgibson91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @damianavila!

Copy link
Member

@choldgraf choldgraf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on the specific implementation here, but the general decision looks good to me, and I agree with @damianavila's assessment that they just need Dask, not DG (I've commented in #695 as well). I'm +1 on merging this and taking yet another important step towards AUTOMATE ALL THE THINGS :-)

@damianavila damianavila merged commit fed2891 into master Sep 23, 2021
@damianavila damianavila deleted the enable_kops_hubs branch September 23, 2021 21:07
@damianavila
Copy link
Contributor Author

Thanks @sgibson91 and @choldgraf for your approvals!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants