diff --git a/docs/incidents/2020-10-28-memory-overload.md b/docs/incidents/2020-10-28-memory-overload.md deleted file mode 100644 index 197a1ea75f..0000000000 --- a/docs/incidents/2020-10-28-memory-overload.md +++ /dev/null @@ -1,59 +0,0 @@ -# 2020-08-28 - Memory overload on WER cluster - -## Summary - -On 2020-08-28, WER reported [stuck pages](https://github.com/2i2c-org/docs/issues/27) for students. A total outage, nothing usable. - -After investigation, we determined that the core pods didn't have appropriate resource guarantees set. There was also no dedicated core pool, so the WER students overloaded CPU & RAM of the nodes. This starved everything of resources, causing issues. - -This was resolved by: - -1. Giving core pods [more resource guarantees](https://github.com/2i2c-org/infrastructure/commit/88767d85c306784754560dedc1d5ac7abdb8a2a0) -2. [Removing memory overcommit](https://github.com/2i2c-org/infrastructure/pull/88) for WER students, since they seem to be using a good chunk of their memory limit. - -## Timeline - -All times in IST - -### 08:52 PM - -Incoming report that many students can not access the hub, and it is [frozen](https://github.com/2i2c-org/docs/issues/27#issue-731543843) - -### 09:02 PM - -Activity bump [is noticed](https://github.com/2i2c-org/docs/issues/27#issuecomment-718014094) but regular -fixes (incognito, restarting servers, etc) don't seem to fix things - -### 09:21 PM - -Looking at resource utilization on the nodes, resource exhaustion is clear - - -```bash -$ kubectl top node -NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% -gke-low-touch-hubs-cluster-core-pool-b7edea69-00sc 220m 11% 6151Mi 58% -gke-low-touch-hubs-cluster-core-pool-b7edea69-gwrg 1944m 100% 10432Mi 98% -``` - -There were only core nodes - no separate user nodes. The suspicion is that the user pods are using up just enough resources that the core pods are being starved. - -### 09:23 PM - -Based on [tests on how much RAM WER needs](https://github.com/2i2c-org/docs/issues/15), we had set a limit of 2G but guarantee of only 512M - a 4x overcommit as we often do. However, the tests revealed that users almost always use just under 1G of RAM, so our overcommit should've been just 2x. We just [remove overcommit](https://github.com/2i2c-org/infrastructure/pull/88) for now. This will also probably spawn another node, thus easing pressure on the other existing nodes. - -### 09:24 PM - -We [bump resource guarantees](https://github.com/2i2c-org/infrastructure/commit/88767d85c306784754560dedc1d5ac7abdb8a2a0) for all the core pods as well, so they will have enough to operate even if the nodes get full. This restarts the pods, and moves some to a new node - which also helps. Things seem to return to normal. - -### 09:46 PM - -The [issue is closed](https://github.com/2i2c-org/docs/issues/27#issuecomment-718044571) and everything seems fine - -## Action Items - -- Make sure user pods are in a separate pool, so they do not create pressure on the core pods -- Set limits on the support infrastructure (prometheus, grafana, ingress) as well -- Document and think about overcommit ratios for memory usage -- Setup better Grafana dashboards to monitor resource usage -- Document how folks can get `kubectl` access to the cluster, so others can look into issues too diff --git a/docs/incidents/index.md b/docs/incidents/index.md deleted file mode 100644 index a1d6fb42da..0000000000 --- a/docs/incidents/index.md +++ /dev/null @@ -1,12 +0,0 @@ -# Incident reports - -The 2i2c infrastructure is constantly evolving. As people use our hubs, different issues often arise that result in new needs for development and updates to the deployment. 2i2c follows a process of transparent and blameless post-mortems to address these issues now, and prevent them from happening in the future. - -Below is a list of issue reports for the 2i2c Hub infrastructure. - -```{toctree} -:glob: -:maxdepth: 1 - -./* -``` diff --git a/docs/index.md b/docs/index.md index e1d18b6f72..c3c63751b9 100644 --- a/docs/index.md +++ b/docs/index.md @@ -71,7 +71,6 @@ reference/hubs reference/ci-cd/index reference/terraform.md reference/tools -incidents/index ``` ## Contributing