feat: add retries for enqueuing graphile jobs #11561

yakkomajuri · 2022-08-30T19:29:16Z

Problem

We're losing jobs when Aurora has small periods of unavailability due to scaling.

https://sentry.io/organizations/posthog2/issues/3477565970/?query=is%3Aunresolved+no+jobqueue&statsPeriod=14d

Changes

Retry enqueueing jobs into Graphile using an existing tested, traced, and instrumented retries mechanism
Refactors the mechanism referenced above to allow it to be used more generally for retries everywhere

How did you test this code?

Updated the tests for the retries mechanism
Added tests for the JobQueueManager
Tested manually

Not in scope

Tests for untouched bits of jobs functionality that are currently missing

yakkomajuri · 2022-08-30T19:57:41Z

I know what's up with the test errors, will address tomorrow. Just some tests that weren't updated

mariusandra · 2022-08-31T08:15:40Z

This is a disappointment with regards to Aurora. I remember we chose it just because it's a scalable cloud database that could scale up if we get a sudden spike of jobs.

If instead it's a scalable database, except when it scales (probably as we're coming into a spike), then what's the point 🤦

yakkomajuri · 2022-08-31T14:38:26Z

Come to think about it - I wonder if part of the problem is that the graphile-worker lib internally is trying to reuse a connection and doesn't even try to establish a new one after a disconnect?

macobo · 2022-09-01T06:18:49Z

plugin-server/src/main/job-queues/job-queue-manager.ts

+        await instrument(
+            this.pluginsServer.statsd,
+            {
+                metricName: jobName === JobName.PLUGIN_JOB ? 'vm.enqueuePluginJob' : 'vm.enqueueBufferJob',


So this is confusing.

It makes an implicit assertion that there's only two different types of jobs. Which might be true for now, but will it be long-term?

It prefixes the metric with vm. which just is a lie

Suggestion:

Suggested change

metricName: jobName === JobName.PLUGIN_JOB ? 'vm.enqueuePluginJob' : 'vm.enqueueBufferJob',

metricName: `job_queues_enqueue`

And add jobName as a tag instead.

I was making sure we kept the metric that was already established before, but can get rid of that in favor of this new metric

Let's break compatibility if it adds complexity to the code. Our metric retention period is relatively short anyways.

macobo · 2022-09-01T06:19:26Z

plugin-server/src/main/job-queues/job-queue-manager.ts

+                metricName: jobName === JobName.PLUGIN_JOB ? 'vm.enqueuePluginJob' : 'vm.enqueueBufferJob',
+                key: instrumentationContext?.key ?? '?',
+                tag: instrumentationContext?.tag ?? '?',
+                tags: { pluginServerMode, type: jobType },


There's no reason to add pluginServerMode tag - https://github.com/PostHog/posthog/blob/master/plugin-server/src/utils/db/hub.ts#L82-L84

macobo · 2022-09-01T06:21:35Z

plugin-server/src/main/job-queues/job-queue-manager.ts

+                        pluginServerMode,
+                    },
+                    tryFn: async () => this._enqueue(jobName, job),
+                    catchFn: () => status.error('🔴', 'Exhausted attempts to enqueue job.'),


Problems:

The log message isn't actionable or even clear. Suggestion: Exhausted attempts to enqueue job, job was dropped.

We should log the jobName, job objects as well

Q:

This is a catchFn not finallyFn? Does this swallow all errors or will sentry receive an error as well? T

plugin-server/src/utils/retries.ts

macobo · 2022-09-01T06:25:06Z

plugin-server/src/utils/retries.ts

+                    const nextRetryMs = getNextRetryMs(retryBaseMs, retryMultiplier, attempt)
+                    hub.statsd?.increment(`${metricPrefix}.${metricName}.RETRY`, {
+                        ...metricTags,
+                        attempt: attempt.toString(),


We should be really conservative with tags. What value does attempt tag give us?

I believe you set this up in the past, happy to remove now

macobo · 2022-09-01T06:25:56Z

plugin-server/src/utils/retries.ts

+                }
+                if (error instanceof RetryError && attempt < maxAttempts) {
+                    const nextRetryMs = getNextRetryMs(retryBaseMs, retryMultiplier, attempt)
+                    hub.statsd?.increment(`${metricPrefix}.${metricName}.RETRY`, {


Note: Reverse-engineering actual metrics from code in cases of interpolation like this is hell. Do we actually need this? Could we just use metricName and kill the metricPrefix?

Also what's with the weird capitalization?

Understand it was hard to parse the new changes from old, but this is effectively following what was already there.

macobo · 2022-09-01T06:27:30Z

plugin-server/src/utils/retries.ts

+            description: metricTags.plugin || '?',
+            data: {
+                metricName,
+                payload,


Does nesting objects like this work nicely in sentry?

Unsure - again, this was here before, just it was named "event"

macobo · 2022-09-01T06:30:23Z