-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't retry flushing CloudWatch metrics #720
Conversation
The api for flushing metrics is rate limited, in high load environments this may cause performance issues as we wait for this call to finish so if it fails we will not retry.
cc9e108
to
5557227
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brian Buchanan on the LWR team re-ran performance test with Jen's changes. His findings:
Prior to the change we saw ~1.3s avg duration at scale. We are down to ~400ms with the change. 400ms is still 4x the SSR Render Time metric. (Granted this can be inaccurate with the change)
They have a dashboard, here, too.
What a first PR @jennmills – wowie! 🥳
@@ -0,0 +1,62 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are unrelated to the changes in the PR, but dropping the retry logic in the metrics code caused our branch coverage to drop. I decided to add more tests somewhere else, instead of dropping the coverage requirement.
The metrics code itself has 100% test and branch coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for keeping up the test coverage!
Description
The
pwa-kit-runtime
is a project that allows us to run Node applications on the Managed Runtime. Inside this project we build an Express server that interfaces with AWS lambda. As our runtime is invoked, we send various metrics to Cloudfront to be displayed in various dashboards before we call the lambdas callback to ultimately send a response to the requesting client (the browser). Before we call this callback we make sure that any metrics we gathered are sent to Cloudfront.One caveat is that the Cloudfront api endpoint is rate limited, meaning that in times of extreme load a request might be denied. The way our application works, we will retry the sending of metrics multiple times (read below how that works). This causes slow response times.
We shouldn't do this as it's more important to respond to clients requests than it is to send off metrics that we hardly use.
Details
The API for flushing metrics is rate limited and in high load environments, we have seen this cause performance issues when waiting for the call to finish and retry if it fails.
Every retry it waits 50 milliseconds plus an additional 25 milliseconds for the next retry which can add up.
This PR includes changes to reduce the retries to 0 - effectively removing the retries so that the API does not cause performance issues.
Future Considerations
We should look to solve this in a better way. If possible perform the metric sending after making the callback call. This may or may no be possible as the lambda might be destroyed at this time. But its worth looking to how we can do this.
Types of Changes
Changes
As mentioned above we have reduced retries to 0 instead of 3 when flushing CloudWatch Metrics fail meaning it will only attempt the API call once.
How to Test-Drive This PR
After merging this PR the PWA Kit SDK Team will need to do a Release Preview (FYI @bendvc). When the Release Preview is ready to go we can reach out to the LWR team to test the changes.
Checklists
General