Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: log all critical bounces #288

Merged
merged 30 commits into from
Sep 10, 2020
Merged

Conversation

mantariksh
Copy link
Contributor

@mantariksh mantariksh commented Sep 7, 2020

Problem

The current bounce collection and alarm system work as follows:

  1. When a bounce notification comes in, update hasBounced in the Bounce collection for each email recipient.
  2. If hasBounced is true for all the email recipients AND hasAlarmed is false, log a message containing CRITICAL BOUNCE and set hasAlarmed to true. If hasAlarmed is already true, do nothing.
  3. In CloudWatch, filter for messages containing CRITICAL BOUNCE. Every 5min, if there is at least 1 message containing CRITICAL BOUNCE in the last 5min, activate the alarm.

In other words, we do not log every single critical bounce. This is a mechanism to limit the rate of alarms; if there are critical bounces for one form ID, but lots of people submitting, we will still get only ONE alarm every 3 hours. This is because every 3 hours, the document in the Bounce collection will expire, upon which hasAlarmed will be set to its default of false. Once one critical bounce comes in, hasAlarmed will be set to true. Subsequent critical bounces will see that hasAlarmed is true and not log anything, hence will not activate the alarm again.

The issue with this system is that we are unable to track the number of critical bounces over time, since we only send one log message per critical bounce per form every 3 hours. This is problematic for two reasons:

  1. We cannot set an alarm based on the volume of critical bounces.
  2. We cannot track the rate of critical bounces over time. This is further exacerbated by the fact that the logs are sent to a special CloudWatch log group which has a 2-month TTL. This means that we cannot track bounces which are older than 2 months.

Solution

  1. Log ALL critical bounces, including those for which alarms have already been activated. We can then set an alarm based on the total number of critical bounces over a period of time.
  2. Log email notifications to the main log group, which allows us to search for a year's worth of bounce info for admins. The exception here is email notifications for Email Confirmations sent to form-fillers, since there are privacy concerns regarding keeping metadata on citizens for an extended period of time. Hence Email Confirmation notifications will still be logged to the short-term log group.

The following refactors and configuration changes were also made:

  • bounce.server.model was moved inside src/app/modules/bounce for better grouping of related code.
  • Tests for the bounce model were inlined in src/app/modules/bounce/__tests__. A couple of configuration changes were made to accommodate this:
    • tsconfig.build.json was updated so that __tests__ directories are not compiled.
    • jest.config.js was updated so that files in the __tests__ directories are not included when calculating test coverage.

Tests

Pre-release

  • Staging critical bounces alarm has been updated to search for logs in the main log group.
  • Admin email notifications go to the main log group and not formsg-email-notifications-staging.

Post-release

  • Production critical bounces alarm has been updated to search for logs in the main log group.
  • Production dashboard widgets have been updated.

@tshuli
Copy link
Contributor

tshuli commented Sep 8, 2020

Think you left out something in the Solution write-up "Log all critical bounces, including those for which {?}"

Copy link
Contributor

@karrui karrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, with exception of the logging parameters. meta.action should always be the calling function

src/app/modules/bounce/bounce.service.ts Outdated Show resolved Hide resolved
src/app/modules/bounce/bounce.service.ts Outdated Show resolved Hide resolved
src/app/modules/bounce/bounce.service.ts Outdated Show resolved Hide resolved
@mantariksh mantariksh force-pushed the improve-bounce-logging branch from 4f003dd to e8a5fb2 Compare September 10, 2020 08:22
@mantariksh mantariksh changed the base branch from develop to release-4.34.1 September 10, 2020 08:22
@mantariksh mantariksh changed the title feat: improve bounce logging feat: log all critical bounces Sep 10, 2020
@mantariksh mantariksh merged commit 22b8ab1 into release-4.34.1 Sep 10, 2020
@karrui karrui deleted the improve-bounce-logging branch November 18, 2020 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants