Analyze queue options for long running SYNC GET #2198

githubbob42 · 2019-05-04T17:13:20Z

Description

We have a few options, we could pick a specialized priority job queue library (such as Kue or monq), we can build a job queue abstraction over a message queue library/platform such a Amazon SQS, RabbitMQ, ZeroMQ etc), or we can refine our own queue to handle multiple consumers (the queue use for processing sync POST).

Acceptance Criteria

The queue mechanism we chose to work with needs:

Consumers - to listen in on the queue and grab messages as they come in. We'll want the consumers to be run in multiple individual processes in order to process many sync messages concurrently.
Atomic - you don't want multiple consumers pulling down and processing the same message multiple times. When a consumer acknowledges a message, no other consumers should be able to process it.
Speed - Items need to be inserted into the queue quickly (as to not hold up the sync). Equally, the need to be pulled out on the queue quickly. The actual processing time may vary but the queue needs to handle large numbers of reads/writes.
Error handling - if a consumer dies mid-process, the sync message it was holding should be returned to the queue for another consumer to pick up.

Analysis

Specialized library

Kue:

Kue is a priority job queue backed by redis (rather than our existing server-side data storage platform, mongodb).

Pros:

Rich API for handling errors, including retries with delayed back off or exponential back off support.
Various logging options baked in allowing us to expose information at any point in the jobs lifetime.
graceful shutdown - sync workers can be told to stop processing on completing the active job, or, in the case of timeout or error, will mark the active job as failed.
Rich web UI to visualize the queue - can be mounted inside our express app (and secured with our existing oauth tokens). See http://www.screenr.com/oyNs for demo.
Allows each dyno to control how many jobs will be processed concurrently on a single worker (for instance, based on memory utilization etc).
documentation
large base of existing users

Cons:

requires install of redis.
codebase has only a few tests and a concerning Open Issue: Jobs stuck in inactive state Automattic/kue#130

monq:

Monq is a MongoDB-backed job queue.

Pros:

Backed by mongodb (our existing server-side data storage)
Allows for mongodb TTL (time-to-live) indexes - *should* allow us to expire queued job periodically (and potentially expire previous jobs for a given auth token such that a user only has only sync per session).
basic support for error handling with retries and delayed retries

Cons:

No ui
Lacking documentation (except for test coverage which is pretty good)
No explicit logging hooks (although it shouldn't be too hard to add our own logging)
No out-of-the-box support for handling a process terminating during the processing of a job
Small user base

refine our own queue implementation

To cover the acceptance criteria, we'd need to make a few changes to our queue to correctly handle multiple consumers.

Atomic operations:

We can handle the atomic requirement by using mongodbs findAndModify command in conjunction with adding a new "inProg" boolean to the message envelope. Since findAndModify obtains a write-lock on the affected database (blocking other operations until it has completed), we can safely track in-progress jobs and prevent multiple consumers picking up the same job.

To support findAndModify, we'd have to remove our current usage of tailable cursors and replace with polling as these features cannot be used together at the moment see: https://jira.mongodb.org/browse/SERVER-11753

Error Handling

If/when the heroku process is unexpectedly terminated while a message is being processed, we can attempt to clean up (listening in for SIGTERM, then setting inprog to false - or marking the packet with error: shutdown), however this isn't foolproof. In the case of unexpected termination, a job may be left in a stale state (in prog = true) with no corresponding worker.

As part of the findAndModify call, we can set a timestamp so that we know when the message was pulled down. Based on how long we expect consumers to take per message, we can query documents which are marked as in progress but the timestamp was some time ago.

Using this technique, we can update those documents to reset the "inProg" flag, effectively returning them back to the queue for another consumer to work on, or we can mark the document with an error.

githubbob42 · 2019-05-04T17:13:21Z

Story: #2196 User initiates a sync request on the client (Mingle)

githubbob42 added the Task label May 4, 2019

This was referenced May 4, 2019

Failed to download Static Resource (Filter File) #4386

Closed

Uncheck Job AltSync - Sync - Job remains in Mobile #4422

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyze queue options for long running SYNC GET #2198

Analyze queue options for long running SYNC GET #2198

githubbob42 commented May 4, 2019

githubbob42 commented May 4, 2019

Analyze queue options for long running SYNC GET #2198

Analyze queue options for long running SYNC GET #2198

Comments

githubbob42 commented May 4, 2019

Description

Acceptance Criteria

Analysis

githubbob42 commented May 4, 2019

Story: #2196 User initiates a sync request on the client (Mingle)