Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Incident] PaleoHack User sessions won't start #790

Closed
4 tasks done
choldgraf opened this issue Oct 28, 2021 · 18 comments
Closed
4 tasks done

[Incident] PaleoHack User sessions won't start #790

choldgraf opened this issue Oct 28, 2021 · 18 comments
Assignees

Comments

@choldgraf
Copy link
Member

choldgraf commented Oct 28, 2021

Summary

The PaleoHackWeek hub is not putting users on the new nodes that were pre-cooked for the event. Here are the hub logs:

2021-10-28T17:38:46Z [Warning] 0/23 nodes are available: 20 Insufficient cpu, 3 node(s) didn't match node selector.
2021-10-28T17:38:48Z [Normal] pod didn't trigger scale-up: 3 node(s) didn't match node selector, 1 max node group size reached

In addition, I'm not sure if this is relevant but it seems like all of the CPU commit %s are pinned at 62.6% and I'm not sure why. I wonder if that's related to the 20 Insufficient cpu error above. Here's an image of the grafana plot:

image

Timeline (if relevant)

See the comments below for the timeline details, all happened within a few hours. Here is a summary:


After-action report

What went wrong

Two major things:

  • We had a sub-optimal helm config that was causing Kubernetes to request all of the resources available to a pod.
  • The PaleoHack image didn't have nbgitpuller on it, so their nbgitpuller links didn't work

Action items

These are only sample subheadings. Every action item should have a GitHub issue
(even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in infrastructure/, they can be in other repositories.

Process improvements

  1. Add steps to send to the community representative to make sure the hub is working properly before the event (# )

Documentation improvements

  1. Improve documentation and best practices around events docs#111

Technical improvements

  1. Enforce that guarantees are always set for userpods, not just limits: Set 0.05 CPU guarantee for all user pods #801
  2. Validate our config against helm schema in the PR via CI: Helm template validation in our CI/CD #279

Actions

  • Incident has been dealt with or is over
  • Sections above are filled out
  • Incident title and after-action report is cleaned up
  • All actionable items above have linked GitHub Issues
@choldgraf
Copy link
Member Author

Assigning @damianavila and @yuvipanda for visibility since I think they are the ones that did the pre-cooking

@consideRatio
Copy link
Contributor

This is because the pods requests the same they do as they are limited to, making only a single user fit on each node.

    resources:
      limits:
        cpu: "2"
        memory: "4294967296"
      requests:
        cpu: "2"
        memory: "268435456"

By setting only limit, the requests were defaulting to the limit. The solution is to set requests and limit where requests are far lower than the limit.

@consideRatio
Copy link
Contributor

I've opened #791, this will make users fit into the already started nodes. If they all get full, old users could leave and rejoin to make room by not requesting soo much as they did initially.

@consideRatio
Copy link
Contributor

image

This is the current status, the users that are stuck needs to restart their session no matter what, because they all request a lot of memory and cpu still.

If I could convey the fix to everyone, it would be, don't try to launch until its ready, then everyone can go.

Hmm...

I can instead kill pending pods and restart the hub though, I'll do that. Then they are able to launch quicker.

@choldgraf
Copy link
Member Author

Note that the deploy is still running! It is cycling through all of the hubs in the 2i2c pilot cluster, and is taking some time to get to the paleohack hub. I'm just writing this down here for future reference in the debrief, because it seems sub-optimal that we are waiting for the deploys on all hubs even though only the paleohack config has changed :-)

@choldgraf
Copy link
Member Author

The deploy to paleohack failed, I'm seeing this error that seems relevant:

Error: UPGRADE FAILED: values don't meet the specifications of the schema(s) in the following chart(s):
jupyterhub:
- singleuser.cpu: Additional property requests is not allowed
- singleuser.memory: Additional property requests is not allowed

and full logs: https://github.com/2i2c-org/infrastructure/runs/4038127331?check_suite_focus=true

@yuvipanda
Copy link
Member

It should be singleuser.cpu.guarantee. I'm just doing a quick manual deploy

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Oct 28, 2021
I think z2jh should support the kubernetes native terminology
(requests) as well

Ref 2i2c-org#790
@consideRatio
Copy link
Contributor

I think everything seems to be in order now, and all users that are pending have been whiped. So, a restart by users should now make them fit on nodes.

@damianavila
Copy link
Contributor

The paleohack image seems to be missing nbgitpuller somehow...

@consideRatio
Copy link
Contributor

consideRatio commented Oct 28, 2021

:'(

I'm headed away for a before closing hours climbing session at the gym.

Technical notes

  1. I realized that manually deploying one hub was far faster than waiting for the entire 2i2c cluster chain-deploy of hubs
  2. I had a failed upgrade, and when that happens you have to manually disrupt the helm upgrade
    • If you do helm list, it should show you an installation of a helm chart, but if its missing, you must manually delete a corrupted k8s secret representing the latest installation attempt.
    • helm list -> shows nothing, then do...
    • kubectl get secrets, then kubectl delete <some helm secret with highest number>
    • helm list -> now shows an installation with a certain revision of the installation
    • helm ugprade will now work (the deployer script will now work again)
  3. The value of having a helm template workflow is large, we should have Helm template validation in our CI/CD #279 which is a quite quick thing to run and could spot errors in ~1 minute of the time a PR is opened.

@choldgraf
Copy link
Member Author

choldgraf commented Oct 28, 2021

Update: paleohack image nbgitpuller

The nbgitpuller library was missing from the Paleohack environment, which is stored here:

https://github.com/LinkedEarth/paleoHackathon/blob/main/environment.yml

We have since added it back in, and the image is now being built by the repo2docker action. It had been removed in an earlier step presumably because it didn't seem necessary for the user's workflow.

Future reference thoughts:

  • For hubs based on events, we should send a reminder to the event organizer to test out the full workflow for their users before the event itself. This could help us catch issues like the wrong library being installed.
  • We should document when functionality that seems "baked in" to the hub requires certain packages to be installed, like nbgitpuller

@damianavila
Copy link
Contributor

Opened #793 that was manually deployed so we don't need to wait for the whole CI to succeed.

@choldgraf
Copy link
Member Author

choldgraf commented Oct 28, 2021

Update: configurator image clash

We also discovered that there was an image specified in the configurator that was pointing to an older tag, so we believe that a manual update did not over-write this image. We had to delete that field in the configurator in order to get it to work properly.

@choldgraf choldgraf moved this to In Progress ⚡ in Sprint Board Oct 28, 2021
@choldgraf
Copy link
Member Author

Update: Incident resolved

We've confirmed that users are able to log on to the new nodes, and that the environment has been properly updated.

@yuvipanda
Copy link
Member

I also asked for and got more quota for disk space - I tried to temporarily just increase the number of nodes to 50 to unblock people but ran into not enough disk quota.

@yuvipanda
Copy link
Member

Some quick action items:

  1. Add 'make sure nbgitpuller is installed in the image' to our pre-event checklist (if we have one yet? if not, we should make one)
  2. Validate our config against helm schema in the PR via CI
  3. Enforce that guarantees are always set for userpods, not just limits.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Nov 1, 2021
When limit is set but not guarantee, guarantee is set to match limit! For most
use cases, when we set limit but not guarantee, we want to offer no guarantee and
a limit. This doesn't seem to be possible at all (need to investigate why). In the
meantime, setting a super low guarantee here means we can guard against issues like
2i2c-org#790, where setting the limit
but not guarantee just gave users a huge guarantee, causing node spin ups to fail

Ref 2i2c-org#790
@choldgraf
Copy link
Member Author

I just had a quick debrief with the PaleoHack team and we discussed a bit of this issue as well. Here are some raw notes. Some of them are relevant to this incident specifically but not all of them, just putting them here so we don't lose them.


  • Things that would help
    • Provide a checklist about what people should do before an event
      • Run the notebooks
      • Click the links etc
    • Documentation improvements
      • Environment generation
        • It was confusing to have to go to a separate repository to learn how to build a docker image
        • More context around the instructions would be helpful
        • Make it clear that people can re-use the quay.io images for other JupyterHubs etc
        • Improve explanations around what "quay.io" is in general
      • Tutorials for "what do you do once the hub is set up?"
        • E.g., step by step guide to sharing a notebook on a hub
      • Recommendations for how communities can filter the best participants
        • E.g., recommend that communities run a notebook with some basic functionality so that they know what skills they're expected to learn
      • How to define multiple user profiles for use on the hub
        • It wasn't obvious how to do this as an admin
    • Provide some best-practices for certain kinds of events and communities

@choldgraf
Copy link
Member Author

OK I've updated the top comment and included links to follow-up issues, closing this one now! thanks everybody for your hard work :-)

Repository owner moved this from In Progress ⚡ to Done 🎉 in Sprint Board Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

4 participants