Improve Monitoring #4619

sambodeme · 2025-01-14T18:50:21Z

At a glance

We want to know as early as possible of any issue. It is also important to understand how the system is being used and to track overall performance.

Acceptance Criteria

Anyone at a glance to know if the system is working.
We can identify changes in performance and usage

Considerations

We already have a New Relic instance, so this may be a good source.
Key metrics for any API/endpoint:
Volume (numbers of calls)
Errors (number of errors)
Latency (how long the calls are taking)
Availability (usually a combination of error rate and just uptime, ideally as % ie 99.99%)
We had a database job that was silent failing, be sure to include this and other async github jobs

This task can easily balloon out, I would suggest picking a few key endpoints (verify UEI, submit audit, database backup) to demonstrate how we can expand it to additional endpoints. We should then document what was done, how, and which endpoints we should tackle next.

Aggregate the above into a single metric that is easy to see “is it healthy?”

API is currently lacking new relic metrics

TF allows configuration of new relic dashboards, ideally this stuff all be in code.

gsa-jrothacker · 2025-01-27T14:36:09Z

Done:

I have updated our New Relic dashboards through terraform. The new dashboard right now has two pages, i) a high level overview to show aggregated metrics across the entire site ii) Deep-Dive into each endpoint (right now configured for two endpoints as a POC, more to come) PR
Added --wait to all cf run-task commands in github. This will prevent github actions showing success when the task actually failed. PR
Configured New Relic and Slack integration.

Remaining:

Add all endpoints to the terraform dashboard. We may want to consider if we want to break up endpoints across additional pages, it may make each page be a category of feature to keep each page for growing too large.
Rochelle is working on moving Alerts (related to slack integration) into terraform, it would be great to combine the metrics for those alerts with the dashboards.

github-project-automation bot added this to FAC Jan 14, 2025

github-project-automation bot moved this to Triage in FAC Jan 14, 2025

sambodeme added the eng label Jan 14, 2025

sambodeme moved this from Triage to In Progress in FAC Jan 14, 2025

Leighdiddy assigned sambodeme Jan 14, 2025

gsa-jrothacker self-assigned this Jan 15, 2025

sambodeme assigned jperson1 Jan 15, 2025

gsa-jrothacker mentioned this issue Jan 17, 2025

Creating a dashboard that displays details at the endpoint level. #4643

Merged

18 tasks

rocheller123 self-assigned this Jan 21, 2025

gsa-jrothacker mentioned this issue Jan 27, 2025

Adding --wait to all cf run-task commands #4651

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Monitoring #4619

Improve Monitoring #4619

sambodeme commented Jan 14, 2025

gsa-jrothacker commented Jan 27, 2025 •

edited

Loading

Improve Monitoring #4619

Improve Monitoring #4619

Comments

sambodeme commented Jan 14, 2025

At a glance

Acceptance Criteria

Considerations

gsa-jrothacker commented Jan 27, 2025 • edited Loading

gsa-jrothacker commented Jan 27, 2025 •

edited

Loading