Don't retry flushing CloudWatch metrics #720

jennmills · 2022-09-15T21:22:42Z

Description

The pwa-kit-runtime is a project that allows us to run Node applications on the Managed Runtime. Inside this project we build an Express server that interfaces with AWS lambda. As our runtime is invoked, we send various metrics to Cloudfront to be displayed in various dashboards before we call the lambdas callback to ultimately send a response to the requesting client (the browser). Before we call this callback we make sure that any metrics we gathered are sent to Cloudfront.

One caveat is that the Cloudfront api endpoint is rate limited, meaning that in times of extreme load a request might be denied. The way our application works, we will retry the sending of metrics multiple times (read below how that works). This causes slow response times.

We shouldn't do this as it's more important to respond to clients requests than it is to send off metrics that we hardly use.

Details

The API for flushing metrics is rate limited and in high load environments, we have seen this cause performance issues when waiting for the call to finish and retry if it fails.

Every retry it waits 50 milliseconds plus an additional 25 milliseconds for the next retry which can add up.

This PR includes changes to reduce the retries to 0 - effectively removing the retries so that the API does not cause performance issues.

Future Considerations

We should look to solve this in a better way. If possible perform the metric sending after making the callback call. This may or may no be possible as the lambda might be destroyed at this time. But its worth looking to how we can do this.

Types of Changes

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Documentation update
Breaking change (could cause existing functionality to not work as expected)
Other changes (non-breaking changes that does not fit any of the above)

Breaking changes include:

Removing a public function or component or prop

Adding a required argument to a function

Changing the data type of a function parameter or return value

Adding a new peer dependency to package.json

Changes

As mentioned above we have reduced retries to 0 instead of 3 when flushing CloudWatch Metrics fail meaning it will only attempt the API call once.

How to Test-Drive This PR

After merging this PR the PWA Kit SDK Team will need to do a Release Preview (FYI @bendvc). When the Release Preview is ready to go we can reach out to the LWR team to test the changes.

Checklists

General

Changes are covered by test cases
CHANGELOG.md updated with a short description of changes (not required for documentation updates)

The api for flushing metrics is rate limited, in high load environments this may cause performance issues as we wait for this call to finish so if it fails we will not retry.

olibrook

Brian Buchanan on the LWR team re-ran performance test with Jen's changes. His findings:

Prior to the change we saw ~1.3s avg duration at scale. We are down to ~400ms with the change. 400ms is still 4x the SSR Render Time metric. (Granted this can be inaccurate with the change)

They have a dashboard, here, too.

What a first PR @jennmills – wowie! 🥳

olibrook · 2022-10-05T17:34:16Z

packages/pwa-kit-runtime/src/utils/ssr-server/utils.test.js

@@ -0,0 +1,62 @@
+/*


These tests are unrelated to the changes in the PR, but dropping the retry logic in the metrics code caused our branch coverage to drop. I decided to add more tests somewhere else, instead of dropping the coverage requirement.

The metrics code itself has 100% test and branch coverage.

kevinxh

Thanks for keeping up the test coverage!

Don't retry flushing CloudWatch metrics

dee0af8

The api for flushing metrics is rate limited, in high load environments this may cause performance issues as we wait for this call to finish so if it fails we will not retry.

jennmills requested a review from a team as a code owner September 15, 2022 21:22

jennmills requested review from bendvc and olibrook September 15, 2022 21:23

Merge branch 'develop' into performance-improvement

5557227

olibrook force-pushed the performance-improvement branch from cc9e108 to 5557227 Compare October 4, 2022 23:28

Merge branch 'develop' into performance-improvement

1f42efb

olibrook approved these changes Oct 4, 2022

View reviewed changes

olibrook added 4 commits October 4, 2022 20:38

Remove retry logic for CloudWatch and boost branch coverage

4c658fa

Format

cb2b08d

Bump coverage

999a93b

Lint

f3aa9b1

olibrook reviewed Oct 5, 2022

View reviewed changes

kevinxh approved these changes Oct 5, 2022

View reviewed changes

Changelog

a579e28

olibrook merged commit 0a44fa7 into develop Oct 5, 2022

wjhsf deleted the performance-improvement branch March 17, 2023 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't retry flushing CloudWatch metrics #720

Don't retry flushing CloudWatch metrics #720

jennmills commented Sep 15, 2022 •

edited by bendvc

Loading

olibrook left a comment

olibrook Oct 5, 2022

kevinxh left a comment

Don't retry flushing CloudWatch metrics #720

Don't retry flushing CloudWatch metrics #720

Conversation

jennmills commented Sep 15, 2022 • edited by bendvc Loading

Description

Details

Future Considerations

Types of Changes

Changes

How to Test-Drive This PR

Checklists

General

olibrook left a comment

Choose a reason for hiding this comment

olibrook Oct 5, 2022

Choose a reason for hiding this comment

kevinxh left a comment

Choose a reason for hiding this comment

jennmills commented Sep 15, 2022 •

edited by bendvc

Loading