Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't retry flushing CloudWatch metrics #720

Merged
merged 8 commits into from
Oct 5, 2022
Merged

Conversation

jennmills
Copy link
Contributor

@jennmills jennmills commented Sep 15, 2022

Description

The pwa-kit-runtime is a project that allows us to run Node applications on the Managed Runtime. Inside this project we build an Express server that interfaces with AWS lambda. As our runtime is invoked, we send various metrics to Cloudfront to be displayed in various dashboards before we call the lambdas callback to ultimately send a response to the requesting client (the browser). Before we call this callback we make sure that any metrics we gathered are sent to Cloudfront.

One caveat is that the Cloudfront api endpoint is rate limited, meaning that in times of extreme load a request might be denied. The way our application works, we will retry the sending of metrics multiple times (read below how that works). This causes slow response times.

We shouldn't do this as it's more important to respond to clients requests than it is to send off metrics that we hardly use.

Details

The API for flushing metrics is rate limited and in high load environments, we have seen this cause performance issues when waiting for the call to finish and retry if it fails.

Every retry it waits 50 milliseconds plus an additional 25 milliseconds for the next retry which can add up.

This PR includes changes to reduce the retries to 0 - effectively removing the retries so that the API does not cause performance issues.

Future Considerations

We should look to solve this in a better way. If possible perform the metric sending after making the callback call. This may or may no be possible as the lambda might be destroyed at this time. But its worth looking to how we can do this.

Types of Changes

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Documentation update
  • Breaking change (could cause existing functionality to not work as expected)
  • Other changes (non-breaking changes that does not fit any of the above)

Breaking changes include:

  • Removing a public function or component or prop
  • Adding a required argument to a function
  • Changing the data type of a function parameter or return value
  • Adding a new peer dependency to package.json

Changes

As mentioned above we have reduced retries to 0 instead of 3 when flushing CloudWatch Metrics fail meaning it will only attempt the API call once.

How to Test-Drive This PR

After merging this PR the PWA Kit SDK Team will need to do a Release Preview (FYI @bendvc). When the Release Preview is ready to go we can reach out to the LWR team to test the changes.

Checklists

General

  • Changes are covered by test cases
  • CHANGELOG.md updated with a short description of changes (not required for documentation updates)

The api for flushing metrics is rate limited, in high load environments this may cause performance issues as we wait for this call to finish so if it fails we will not retry.
@jennmills jennmills requested a review from a team as a code owner September 15, 2022 21:22
@olibrook olibrook force-pushed the performance-improvement branch from cc9e108 to 5557227 Compare October 4, 2022 23:28
Copy link
Contributor

@olibrook olibrook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brian Buchanan on the LWR team re-ran performance test with Jen's changes. His findings:

Prior to the change we saw ~1.3s avg duration at scale. We are down to ~400ms with the change. 400ms is still 4x the SSR Render Time metric. (Granted this can be inaccurate with the change)

They have a dashboard, here, too.

What a first PR @jennmills – wowie! 🥳

@@ -0,0 +1,62 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are unrelated to the changes in the PR, but dropping the retry logic in the metrics code caused our branch coverage to drop. I decided to add more tests somewhere else, instead of dropping the coverage requirement.

The metrics code itself has 100% test and branch coverage.

Copy link
Collaborator

@kevinxh kevinxh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for keeping up the test coverage!

@olibrook olibrook merged commit 0a44fa7 into develop Oct 5, 2022
@wjhsf wjhsf deleted the performance-improvement branch March 17, 2023 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants