[Incident] PaleoHack User sessions won't start #790

choldgraf · 2021-10-28T17:43:08Z

Summary

The PaleoHackWeek hub is not putting users on the new nodes that were pre-cooked for the event. Here are the hub logs:

2021-10-28T17:38:46Z [Warning] 0/23 nodes are available: 20 Insufficient cpu, 3 node(s) didn't match node selector.
2021-10-28T17:38:48Z [Normal] pod didn't trigger scale-up: 3 node(s) didn't match node selector, 1 max node group size reached

In addition, I'm not sure if this is relevant but it seems like all of the CPU commit %s are pinned at 62.6% and I'm not sure why. I wonder if that's related to the 20 Insufficient cpu error above. Here's an image of the grafana plot:

Timeline (if relevant)

See the comments below for the timeline details, all happened within a few hours. Here is a summary:

After-action report

What went wrong

Two major things:

We had a sub-optimal helm config that was causing Kubernetes to request all of the resources available to a pod.
The PaleoHack image didn't have nbgitpuller on it, so their nbgitpuller links didn't work

Action items

These are only sample subheadings. Every action item should have a GitHub issue
(even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in infrastructure/, they can be in other repositories.

Process improvements

Add steps to send to the community representative to make sure the hub is working properly before the event (# )

Documentation improvements

Improve documentation and best practices around events docs#111

Technical improvements

Enforce that guarantees are always set for userpods, not just limits: Set 0.05 CPU guarantee for all user pods #801
Validate our config against helm schema in the PR via CI: Helm template validation in our CI/CD #279

Actions

Incident has been dealt with or is over
Sections above are filled out
Incident title and after-action report is cleaned up
All actionable items above have linked GitHub Issues

The text was updated successfully, but these errors were encountered:

choldgraf · 2021-10-28T17:44:31Z

Assigning @damianavila and @yuvipanda for visibility since I think they are the ones that did the pre-cooking

consideRatio · 2021-10-28T17:51:18Z

This is because the pods requests the same they do as they are limited to, making only a single user fit on each node.

    resources:
      limits:
        cpu: "2"
        memory: "4294967296"
      requests:
        cpu: "2"
        memory: "268435456"

By setting only limit, the requests were defaulting to the limit. The solution is to set requests and limit where requests are far lower than the limit.

consideRatio · 2021-10-28T17:54:09Z

I've opened #791, this will make users fit into the already started nodes. If they all get full, old users could leave and rejoin to make room by not requesting soo much as they did initially.

consideRatio · 2021-10-28T18:01:41Z

This is the current status, the users that are stuck needs to restart their session no matter what, because they all request a lot of memory and cpu still.

If I could convey the fix to everyone, it would be, don't try to launch until its ready, then everyone can go.

Hmm...

I can instead kill pending pods and restart the hub though, I'll do that. Then they are able to launch quicker.

choldgraf · 2021-10-28T18:04:36Z

Note that the deploy is still running! It is cycling through all of the hubs in the 2i2c pilot cluster, and is taking some time to get to the paleohack hub. I'm just writing this down here for future reference in the debrief, because it seems sub-optimal that we are waiting for the deploys on all hubs even though only the paleohack config has changed :-)

choldgraf · 2021-10-28T18:06:13Z

The deploy to paleohack failed, I'm seeing this error that seems relevant:

Error: UPGRADE FAILED: values don't meet the specifications of the schema(s) in the following chart(s):
jupyterhub:
- singleuser.cpu: Additional property requests is not allowed
- singleuser.memory: Additional property requests is not allowed

and full logs: https://github.com/2i2c-org/infrastructure/runs/4038127331?check_suite_focus=true

yuvipanda · 2021-10-28T18:08:52Z

It should be singleuser.cpu.guarantee. I'm just doing a quick manual deploy

I think z2jh should support the kubernetes native terminology (requests) as well Ref 2i2c-org#790

consideRatio · 2021-10-28T18:15:24Z

I think everything seems to be in order now, and all users that are pending have been whiped. So, a restart by users should now make them fit on nodes.

damianavila · 2021-10-28T18:20:30Z

The paleohack image seems to be missing nbgitpuller somehow...

consideRatio · 2021-10-28T18:26:16Z

:'(

I'm headed away for a before closing hours climbing session at the gym.

Technical notes

I realized that manually deploying one hub was far faster than waiting for the entire 2i2c cluster chain-deploy of hubs
I had a failed upgrade, and when that happens you have to manually disrupt the helm upgrade
- If you do helm list, it should show you an installation of a helm chart, but if its missing, you must manually delete a corrupted k8s secret representing the latest installation attempt.
- helm list -> shows nothing, then do...
- kubectl get secrets, then kubectl delete <some helm secret with highest number>
- helm list -> now shows an installation with a certain revision of the installation
- helm ugprade will now work (the deployer script will now work again)
The value of having a helm template workflow is large, we should have Helm template validation in our CI/CD #279 which is a quite quick thing to run and could spot errors in ~1 minute of the time a PR is opened.

choldgraf · 2021-10-28T18:26:27Z

Update: paleohack image nbgitpuller

The nbgitpuller library was missing from the Paleohack environment, which is stored here:

https://github.com/LinkedEarth/paleoHackathon/blob/main/environment.yml

We have since added it back in, and the image is now being built by the repo2docker action. It had been removed in an earlier step presumably because it didn't seem necessary for the user's workflow.

Future reference thoughts:

For hubs based on events, we should send a reminder to the event organizer to test out the full workflow for their users before the event itself. This could help us catch issues like the wrong library being installed.
We should document when functionality that seems "baked in" to the hub requires certain packages to be installed, like nbgitpuller

damianavila · 2021-10-28T18:49:21Z

Opened #793 that was manually deployed so we don't need to wait for the whole CI to succeed.

choldgraf · 2021-10-28T18:50:18Z

Update: configurator image clash

We also discovered that there was an image specified in the configurator that was pointing to an older tag, so we believe that a manual update did not over-write this image. We had to delete that field in the configurator in order to get it to work properly.

choldgraf · 2021-10-28T19:04:32Z

Update: Incident resolved

We've confirmed that users are able to log on to the new nodes, and that the environment has been properly updated.

yuvipanda · 2021-10-29T08:05:17Z

I also asked for and got more quota for disk space - I tried to temporarily just increase the number of nodes to 50 to unblock people but ran into not enough disk quota.

yuvipanda · 2021-10-29T08:11:03Z

Some quick action items:

Add 'make sure nbgitpuller is installed in the image' to our pre-event checklist (if we have one yet? if not, we should make one)
Validate our config against helm schema in the PR via CI
Enforce that guarantees are always set for userpods, not just limits.

When limit is set but not guarantee, guarantee is set to match limit! For most use cases, when we set limit but not guarantee, we want to offer no guarantee and a limit. This doesn't seem to be possible at all (need to investigate why). In the meantime, setting a super low guarantee here means we can guard against issues like 2i2c-org#790, where setting the limit but not guarantee just gave users a huge guarantee, causing node spin ups to fail Ref 2i2c-org#790

choldgraf · 2021-11-10T03:08:18Z

I just had a quick debrief with the PaleoHack team and we discussed a bit of this issue as well. Here are some raw notes. Some of them are relevant to this incident specifically but not all of them, just putting them here so we don't lose them.

Things that would help
- Provide a checklist about what people should do before an event
  - Run the notebooks
  - Click the links etc
- Documentation improvements
  - Environment generation
    - It was confusing to have to go to a separate repository to learn how to build a docker image
    - More context around the instructions would be helpful
    - Make it clear that people can re-use the quay.io images for other JupyterHubs etc
    - Improve explanations around what "quay.io" is in general
  - Tutorials for "what do you do once the hub is set up?"
    - E.g., step by step guide to sharing a notebook on a hub
  - Recommendations for how communities can filter the best participants
    - E.g., recommend that communities run a notebook with some basic functionality so that they know what skills they're expected to learn
  - How to define multiple user profiles for use on the hub
    - It wasn't obvious how to do this as an admin
- Provide some best-practices for certain kinds of events and communities

choldgraf · 2021-11-10T21:39:10Z

OK I've updated the top comment and included links to follow-up issues, closing this one now! thanks everybody for your hard work :-)

choldgraf added type: Hub Incident labels Oct 28, 2021

choldgraf assigned yuvipanda and damianavila Oct 28, 2021

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Oct 28, 2021

Use JupyterHub specific terminology for setting resource guarantees

6680d49

I think z2jh should support the kubernetes native terminology (requests) as well Ref 2i2c-org#790

yuvipanda mentioned this issue Oct 28, 2021

Use JupyterHub specific terminology for setting resource guarantees #792

Merged

choldgraf added this to Sprint Board Oct 28, 2021

choldgraf moved this to In Progress ⚡ in Sprint Board Oct 28, 2021

choldgraf mentioned this issue Oct 29, 2021

Add note about nbgitpuller installed in environment 2i2c-org/docs#110

Merged

This was referenced Oct 30, 2021

Document process for pre-warming a hub for an event #787

Closed

Team Sync - Monday, November 1st 2i2c-org/team-compass#279

Closed

yuvipanda mentioned this issue Nov 1, 2021

Set 0.05 CPU guarantee for all user pods #801

Merged

choldgraf unassigned yuvipanda Nov 1, 2021

choldgraf self-assigned this Nov 1, 2021

choldgraf mentioned this issue Nov 3, 2021

Add issue template for event hubs #803

Merged

choldgraf mentioned this issue Nov 10, 2021

Improve documentation and best practices around events 2i2c-org/docs#111

Closed

5 tasks

choldgraf closed this as completed Nov 10, 2021

Repository owner moved this from In Progress ⚡ to Done 🎉 in Sprint Board Nov 10, 2021

choldgraf mentioned this issue Nov 10, 2021

Helm template validation in our CI/CD #279

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Incident] PaleoHack User sessions won't start #790

[Incident] PaleoHack User sessions won't start #790

choldgraf commented Oct 28, 2021 •

edited

Loading

choldgraf commented Oct 28, 2021

consideRatio commented Oct 28, 2021

consideRatio commented Oct 28, 2021

consideRatio commented Oct 28, 2021

choldgraf commented Oct 28, 2021

choldgraf commented Oct 28, 2021

yuvipanda commented Oct 28, 2021

consideRatio commented Oct 28, 2021

damianavila commented Oct 28, 2021

consideRatio commented Oct 28, 2021 •

edited

Loading

choldgraf commented Oct 28, 2021 •

edited

Loading

damianavila commented Oct 28, 2021

choldgraf commented Oct 28, 2021 •

edited

Loading

choldgraf commented Oct 28, 2021

yuvipanda commented Oct 29, 2021

yuvipanda commented Oct 29, 2021

choldgraf commented Nov 10, 2021

choldgraf commented Nov 10, 2021

[Incident] PaleoHack User sessions won't start #790

[Incident] PaleoHack User sessions won't start #790

Comments

choldgraf commented Oct 28, 2021 • edited Loading

Summary

Timeline (if relevant)

After-action report

What went wrong

Action items

Process improvements

Documentation improvements

Technical improvements

Actions

choldgraf commented Oct 28, 2021

consideRatio commented Oct 28, 2021

consideRatio commented Oct 28, 2021

consideRatio commented Oct 28, 2021

choldgraf commented Oct 28, 2021

choldgraf commented Oct 28, 2021

yuvipanda commented Oct 28, 2021

consideRatio commented Oct 28, 2021

damianavila commented Oct 28, 2021

consideRatio commented Oct 28, 2021 • edited Loading

Technical notes

choldgraf commented Oct 28, 2021 • edited Loading

Update: paleohack image nbgitpuller

damianavila commented Oct 28, 2021

choldgraf commented Oct 28, 2021 • edited Loading

Update: configurator image clash

choldgraf commented Oct 28, 2021

Update: Incident resolved

yuvipanda commented Oct 29, 2021

yuvipanda commented Oct 29, 2021

choldgraf commented Nov 10, 2021

choldgraf commented Nov 10, 2021

choldgraf commented Oct 28, 2021 •

edited

Loading

consideRatio commented Oct 28, 2021 •

edited

Loading

choldgraf commented Oct 28, 2021 •

edited

Loading

choldgraf commented Oct 28, 2021 •

edited

Loading