You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We want to know as early as possible of any issue. It is also important to understand how the system is being used and to track overall performance.
Acceptance Criteria
Anyone at a glance to know if the system is working.
We can identify changes in performance and usage
Considerations
We already have a New Relic instance, so this may be a good source.
Key metrics for any API/endpoint:
Volume (numbers of calls)
Errors (number of errors)
Latency (how long the calls are taking)
Availability (usually a combination of error rate and just uptime, ideally as % ie 99.99%)
We had a database job that was silent failing, be sure to include this and other async github jobs
This task can easily balloon out, I would suggest picking a few key endpoints (verify UEI, submit audit, database backup) to demonstrate how we can expand it to additional endpoints. We should then document what was done, how, and which endpoints we should tackle next.
Aggregate the above into a single metric that is easy to see “is it healthy?”
API is currently lacking new relic metrics
TF allows configuration of new relic dashboards, ideally this stuff all be in code.
The text was updated successfully, but these errors were encountered:
I have updated our New Relic dashboards through terraform. The new dashboard right now has two pages, i) a high level overview to show aggregated metrics across the entire site ii) Deep-Dive into each endpoint (right now configured for two endpoints as a POC, more to come) PR
Added --wait to all cf run-task commands in github. This will prevent github actions showing success when the task actually failed. PR
Configured New Relic and Slack integration.
Remaining:
Add all endpoints to the terraform dashboard. We may want to consider if we want to break up endpoints across additional pages, it may make each page be a category of feature to keep each page for growing too large.
Rochelle is working on moving Alerts (related to slack integration) into terraform, it would be great to combine the metrics for those alerts with the dashboards.
At a glance
We want to know as early as possible of any issue. It is also important to understand how the system is being used and to track overall performance.
Acceptance Criteria
Anyone at a glance to know if the system is working.
We can identify changes in performance and usage
Considerations
We already have a New Relic instance, so this may be a good source.
Key metrics for any API/endpoint:
Volume (numbers of calls)
Errors (number of errors)
Latency (how long the calls are taking)
Availability (usually a combination of error rate and just uptime, ideally as % ie 99.99%)
We had a database job that was silent failing, be sure to include this and other async github jobs
This task can easily balloon out, I would suggest picking a few key endpoints (verify UEI, submit audit, database backup) to demonstrate how we can expand it to additional endpoints. We should then document what was done, how, and which endpoints we should tackle next.
Aggregate the above into a single metric that is easy to see “is it healthy?”
API is currently lacking new relic metrics
TF allows configuration of new relic dashboards, ideally this stuff all be in code.
The text was updated successfully, but these errors were encountered: