🔍 Investigation Needed: Incident #386206 and Auto-Resolution #3193

murad-ali-MoJ · 2024-02-05T09:16:54Z

Description

When I was on support on 02/02/24, I consistently received emails from PagerDuty indicating that I had an open incident assigned to me:

INCIDENT: #386206
Cluster: rds-eks-production-control-panel-psg-db-encrypted-low-freeable-memory

Issue Details

However, upon checking the incident details, it appears to be resolved. I suspect there might have been a memory issue in one of the clusters, and the system may have automatically resolved it.

Steps to Reproduce

N/A (incident was automatically resolved)

Expected Behavior

Provide insights or investigation on the potential memory issue in the mentioned cluster.

Additional Information

jacobwoffenden · 2024-02-12T08:37:36Z

Summary

Control Panel's RDS instance is db.t3.micro (2vCPU, 1GB RAM) https://github.com/ministryofjustice/data-platform/blob/4b9253cd26e7f494b3374abc1064ee543941dc6e/terraform/aws/analytical-platform-production/cluster/terraform.tfvars#L54

The alarm is configured https://github.com/ministryofjustice/data-platform/blob/4b9253cd26e7f494b3374abc1064ee543941dc6e/terraform/aws/analytical-platform-production/cluster/cloudwatch-alarms.tf#L223-L240 to alert when freeable memory drops below 128MB https://github.com/ministryofjustice/data-platform/blob/4b9253cd26e7f494b3374abc1064ee543941dc6e/terraform/aws/analytical-platform-production/cluster/terraform.tfvars#L102

128MB is a sufficient threshold to alert on for a database with 1GB RAM

Suggested Action

Migrate to db.t4g.small (2 vCPU, 2GB RAM) or db.t4g.medium (2 vCPU, 4GB RAM)

BrianEllwood · 2024-02-12T12:17:59Z

looking at the logs this issue looks to have stopped on the 5/2/24

when the freeable memory jumped from an average of around 127M to 212M.

I have not been able to find a reason for this in any logs still available.

BrianEllwood · 2024-02-12T12:25:31Z

Looking at the documentation for a RDS PostgreSQL instance there does not appear to be any memory structures that could be tuned to help with this alert when it occurs.

If we did want to fix this (not currently occurring) issue we would have to increase the instance size as suggested.

BrianEllwood · 2024-02-13T08:38:18Z

Looking at the incidents raised in pager duty i can see that this issue was first alerted 17/1/24

BrianEllwood · 2024-02-14T09:51:23Z

As there is currently no issue I would advise to take no action and investigate if it reoccurs in future. I will move the ticket into review and if the team agress I will close the ticket.

If not i will increase the instance size in the dev cluster as a test before implementing into prod.

BrianEllwood · 2024-02-15T11:23:24Z

Closing this ticket, new issue raised to look at log retention here

murad-ali-MoJ added the data-platform-apps-and-tools This issue is owned by Data Platform Apps and Tools label Feb 5, 2024

murad-ali-MoJ added this to Analytical Platform Feb 5, 2024

murad-ali-MoJ moved this to 🧐 To Do in Analytical Platform Feb 5, 2024

murad-ali-MoJ changed the title ~~Investigation Needed: Incident #386206 and Auto-Resolution~~ 🔍 Investigation Needed: Incident #386206 and Auto-Resolution Feb 5, 2024

BrianEllwood self-assigned this Feb 12, 2024

BrianEllwood moved this from 🧐 To Do to 💨 In Progress in Analytical Platform Feb 12, 2024

BrianEllwood moved this from 💨 In Progress to 👀 In Review in Analytical Platform Feb 14, 2024

BrianEllwood closed this as completed Feb 15, 2024

github-project-automation bot moved this from 🔬 In Review to 🎉 Done 🎉 in Analytical Platform Feb 15, 2024

BrianEllwood mentioned this issue Feb 15, 2024

RDS increase log retention period #3324

Closed

2 tasks

github-actions bot mentioned this issue Mar 1, 2024

Monthly issue metrics report #3502

Closed

BrianEllwood mentioned this issue Mar 27, 2024

Add CloudWatch logging to RDS instance #3883

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔍 Investigation Needed: Incident #386206 and Auto-Resolution #3193

🔍 Investigation Needed: Incident #386206 and Auto-Resolution #3193

murad-ali-MoJ commented Feb 5, 2024

jacobwoffenden commented Feb 12, 2024 •

edited

Loading

BrianEllwood commented Feb 12, 2024 •

edited

Loading

BrianEllwood commented Feb 12, 2024

BrianEllwood commented Feb 13, 2024

BrianEllwood commented Feb 14, 2024 •

edited

Loading

BrianEllwood commented Feb 15, 2024

🔍 Investigation Needed: Incident #386206 and Auto-Resolution #3193

🔍 Investigation Needed: Incident #386206 and Auto-Resolution #3193

Comments

murad-ali-MoJ commented Feb 5, 2024

Description

Issue Details

Steps to Reproduce

Expected Behavior

Additional Information

jacobwoffenden commented Feb 12, 2024 • edited Loading

Summary

Suggested Action

BrianEllwood commented Feb 12, 2024 • edited Loading

BrianEllwood commented Feb 12, 2024

BrianEllwood commented Feb 13, 2024

BrianEllwood commented Feb 14, 2024 • edited Loading

BrianEllwood commented Feb 15, 2024

jacobwoffenden commented Feb 12, 2024 •

edited

Loading

BrianEllwood commented Feb 12, 2024 •

edited

Loading

BrianEllwood commented Feb 14, 2024 •

edited

Loading