Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Monitoring #4619

Open
sambodeme opened this issue Jan 14, 2025 · 1 comment
Open

Improve Monitoring #4619

sambodeme opened this issue Jan 14, 2025 · 1 comment
Assignees
Labels

Comments

@sambodeme
Copy link
Contributor

At a glance

We want to know as early as possible of any issue. It is also important to understand how the system is being used and to track overall performance.

Acceptance Criteria

Anyone at a glance to know if the system is working.
We can identify changes in performance and usage

Considerations

We already have a New Relic instance, so this may be a good source.
Key metrics for any API/endpoint:
Volume (numbers of calls)
Errors (number of errors)
Latency (how long the calls are taking)
Availability (usually a combination of error rate and just uptime, ideally as % ie 99.99%)
We had a database job that was silent failing, be sure to include this and other async github jobs

This task can easily balloon out, I would suggest picking a few key endpoints (verify UEI, submit audit, database backup) to demonstrate how we can expand it to additional endpoints. We should then document what was done, how, and which endpoints we should tackle next.

Aggregate the above into a single metric that is easy to see “is it healthy?”

API is currently lacking new relic metrics

TF allows configuration of new relic dashboards, ideally this stuff all be in code.

@gsa-jrothacker
Copy link
Contributor

gsa-jrothacker commented Jan 27, 2025

Done:

  1. I have updated our New Relic dashboards through terraform. The new dashboard right now has two pages, i) a high level overview to show aggregated metrics across the entire site ii) Deep-Dive into each endpoint (right now configured for two endpoints as a POC, more to come) PR
  2. Added --wait to all cf run-task commands in github. This will prevent github actions showing success when the task actually failed. PR
  3. Configured New Relic and Slack integration.

Remaining:

  1. Add all endpoints to the terraform dashboard. We may want to consider if we want to break up endpoints across additional pages, it may make each page be a category of feature to keep each page for growing too large.
  2. Rochelle is working on moving Alerts (related to slack integration) into terraform, it would be great to combine the metrics for those alerts with the dashboards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Development

No branches or pull requests

4 participants