-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Incident] PaleoHack User sessions won't start #790
Comments
Assigning @damianavila and @yuvipanda for visibility since I think they are the ones that did the pre-cooking |
This is because the pods requests the same they do as they are limited to, making only a single user fit on each node.
By setting only limit, the requests were defaulting to the limit. The solution is to set requests and limit where requests are far lower than the limit. |
I've opened #791, this will make users fit into the already started nodes. If they all get full, old users could leave and rejoin to make room by not requesting soo much as they did initially. |
This is the current status, the users that are stuck needs to restart their session no matter what, because they all request a lot of memory and cpu still. If I could convey the fix to everyone, it would be, don't try to launch until its ready, then everyone can go. Hmm... I can instead kill pending pods and restart the hub though, I'll do that. Then they are able to launch quicker. |
Note that the deploy is still running! It is cycling through all of the hubs in the 2i2c pilot cluster, and is taking some time to get to the paleohack hub. I'm just writing this down here for future reference in the debrief, because it seems sub-optimal that we are waiting for the deploys on all hubs even though only the paleohack config has changed :-) |
The deploy to paleohack failed, I'm seeing this error that seems relevant:
and full logs: https://github.com/2i2c-org/infrastructure/runs/4038127331?check_suite_focus=true |
It should be singleuser.cpu.guarantee. I'm just doing a quick manual deploy |
I think z2jh should support the kubernetes native terminology (requests) as well Ref 2i2c-org#790
I think everything seems to be in order now, and all users that are pending have been whiped. So, a restart by users should now make them fit on nodes. |
The paleohack image seems to be missing nbgitpuller somehow... |
:'( I'm headed away for a before closing hours climbing session at the gym. Technical notes
|
Update: paleohack image nbgitpullerThe nbgitpuller library was missing from the Paleohack environment, which is stored here: https://github.com/LinkedEarth/paleoHackathon/blob/main/environment.yml We have since added it back in, and the image is now being built by the repo2docker action. It had been removed in an earlier step presumably because it didn't seem necessary for the user's workflow. Future reference thoughts:
|
Opened #793 that was manually deployed so we don't need to wait for the whole CI to succeed. |
Update: configurator image clashWe also discovered that there was an image specified in the configurator that was pointing to an older tag, so we believe that a manual update did not over-write this image. We had to delete that field in the configurator in order to get it to work properly. |
Update: Incident resolvedWe've confirmed that users are able to log on to the new nodes, and that the environment has been properly updated. |
I also asked for and got more quota for disk space - I tried to temporarily just increase the number of nodes to 50 to unblock people but ran into not enough disk quota. |
Some quick action items:
|
When limit is set but not guarantee, guarantee is set to match limit! For most use cases, when we set limit but not guarantee, we want to offer no guarantee and a limit. This doesn't seem to be possible at all (need to investigate why). In the meantime, setting a super low guarantee here means we can guard against issues like 2i2c-org#790, where setting the limit but not guarantee just gave users a huge guarantee, causing node spin ups to fail Ref 2i2c-org#790
I just had a quick debrief with the PaleoHack team and we discussed a bit of this issue as well. Here are some raw notes. Some of them are relevant to this incident specifically but not all of them, just putting them here so we don't lose them.
|
OK I've updated the top comment and included links to follow-up issues, closing this one now! thanks everybody for your hard work :-) |
Summary
The PaleoHackWeek hub is not putting users on the new nodes that were pre-cooked for the event. Here are the hub logs:
In addition, I'm not sure if this is relevant but it seems like all of the CPU commit %s are pinned at 62.6% and I'm not sure why. I wonder if that's related to the
20 Insufficient cpu
error above. Here's an image of the grafana plot:Timeline (if relevant)
See the comments below for the timeline details, all happened within a few hours. Here is a summary:
After-action report
What went wrong
Two major things:
Action items
These are only sample subheadings. Every action item should have a GitHub issue
(even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in
infrastructure/
, they can be in other repositories.Process improvements
Documentation improvements
Technical improvements
Actions
The text was updated successfully, but these errors were encountered: