Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: ADR for outbox pattern and production modes #292

Merged
merged 15 commits into from
Feb 20, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions docs/decisions/0015-outbox-pattern-and-production-modes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
15. Outbox pattern and production modes
#######################################

Status
******

**Provisional**

Context
*******

Some of the event types in the Event Bus might be more sensitive than others to dropped, duplicated, or reordered events. The message broker itself is partially responsible for ensuring that these problems do not occur in transit, but we also need to ensure that the handoff of events to the broker is reliable.

These are the properties we wish to ensure in the general case:

- **Atomicity**: Many events are related to data that is written to the database in the same request, but transactions can either commit or abort. This gives us two sub-properties:

- **Atomic success**: When a transaction successfully commits in the IDA, any produced events relating to that data are durably transmitted to the message broker. This is more important for events intended to keep services synchronized (sending "latest state of entity" events), and may be less important for some kinds of notification events (especially anything used for tracking or statistics).
- **Atomic failure**: When a transaction fails, due to a rollback, network interruption, or application crash, no events related to those database writes are sent to the message broker. Otherwise, these events would be "counterfactuals" that misrepresent the producing service's internal state. This could result in strange behavior such as incorrect notifications to users, and potentially could produce security issues.

- **Ordering**: If multiple events are produced to the same topic, their ordering is preserved. This raises the question of "ordered according to what metric", as concurrency is in play, so the nature of this property may vary by event.

This is only in the general case, as some events may not be connected to database transactions, some consumers might tolerate violations of either atomic success or failure, and not all events may have strict notions of ordering. Hoever, in the general case violations of any of these can result in consistency failures between services that might not be corrected over any time scale.
timmc-edx marked this conversation as resolved.
Show resolved Hide resolved

It's also worth noting a goal we don't have, that of avoiding duplication. At-least-once delivery is acceptible; exactly-once delivery is not required. Double-sends of events are permissible as long as this only happens occasionally (for performance reasons) and does not entail a violation of ordering.

As of 2023-11-09 we produce events in two different ways relative to transactions:

- **Pre-commit send**: The event is produced to the event bus immediately upon the signal being sent, which will generally occur inside Django's request-level transaction (if using ``ATOMIC_REQUESTS``). This preserves atomicity in the success case as long as the broker is reachable, even if the IDA crashes -- but it does not preserve atomicity when the transaction fails. There is also no ordering guarantee in the case of concurrent requests.
- **Post-commit send**: The event is only sent from a ``django.db.transaction.on_commit`` callback. This preserves atomicity in the failure case, but the IDA could crash after transaction commit but before calling the broker -- or more commonly, the broker could be down or unreachable, and all of the post-commit-produced events would be lost during that interval. Ordering is also not preserved here.

We currently use an ad-hoc mix of pre-commit and post-commit send in edx-platform, depending on how particular OpenEdxPublicSignals are emitted. For example, the code path for ``COURSE_CATALOG_INFO_CHANGED`` involves an explicit call to ``django.db.transaction.on_commit`` in order to ensure a post-commit send is used. But most signals do not have any such call, and are likely sent pre-commit. This uncontrolled state reflects our iterative approach to the event bus as well as our choice to start with events that are backed by other synchronization measures which can correct for consistency issues. However, we'd like to start handling events that require stronger reliability guarantees, such as those in the ecommerce space.

Decision
********

We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to post-commit send, but openedx-events configuration will be enhanced to allow configuing each event to a production mode: Immediate, on_commit, or outbox.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there three modes or four here? Events default to post-commit, or you can configure immediate, on_commit, outbox. post-commit = on_commit?

from reading below it looks like post-commit is current adhoc behavior which all needs to be changed to on_commit future regulated behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, I should clean that up -- it's just three modes. Post-commit/on-commit are supposed to be the same thing. I wasn't sure whether or not to draw a distinction between the current ad-hoc behaviors and the intended modes via naming differences. Probably I shouldn't do that, and should use consistent names for the two. Maybe I'll go back and use the mode names in the Context section rather than using pre-commit and post-commit there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to consistently use names "immediate" and "on-commit".


In the outbox pattern, events are not produced immediately, but are appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database.

openedx-events will change to support three producer modes for sending events:

- ``immediate``: Whether or not there's a transaction, just send to the event bus immediately. This is the "pre-commit send" described in the Context section and is the current behavior for ``send(...)``.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just drop this one? What I'm getting at is that if all events run through one commit hook, they're easier to think about. Is there ever an important reason to get the event out before the DB write?

Although maybe if all events are sent the same one having this feature is close to free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... yes, I think we could! All of the current events are about a change in application state, so if a transaction rolls back then we should not send the event. I can imagine future events that are more about a request, or an attempt having been made, but we can add that mode if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed immediate as a mode, and described it instead as something that could be added in the future.

- ``on_commit``: Delay sending to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process).

This requires ensuring that any events that are currently being explicitly sent post-commit are changed to call ``get_producer().send(...)`` directly, after appropriate per-event configuration. ``emit_catalog_info_changed_signal`` is a known example of this.
- ``outbox``: Prep the signal for sending, and save in an outbox table for sending as soon as possible. The outbox table will be managed by `django-jaiminho`_. Deployers using this mode will also need to run a jaiminho management command in a perpetual worker process in order to relay events from the outbox to the broker and mark them as successfully sent. Another management command would be needed to periodically purge old processed events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if django-jaiminho could either move to the consequence section, or be a lighter touch on this ADR. I feel like this ADR is more general, and doesn't need to be tied to a specific implementation, although we do need to choose one, and it is fine to document what we will be trying. That also could be its own ADR, but my point is that if we decide on a different implementation, it really shouldn't supersede this ADR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could break out an implementation plan subsection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been moved to a new subsection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: review this update.


(TBD: Format for the event data in the outbox. No further event-specific DB queries should be required for producing the bytes for the wire format, but it should not be serialized in a way that is specific to Kafka, Redis, etc.)

(TBD: Safeguards around inadvertently changing the save-to-outbox function's name and module, since those are included in jaiminho's outbox records.)

openedx-events will add a per event type configuration field specifying the event’s producer mode in the form of a new key-topic field inside ``EVENT_BUS_PRODUCER_CONFIG``. It will also add a new Django setting ``EVENT_BUS_PRODUCER_MODE`` that names a mode to use when not otherwise specified (defaulting to ``on_commit``.)

``django-jaiminho`` will be added as a dependency of openedx-events and to the ``INSTALLED_APPS`` of relying IDAs.
robrap marked this conversation as resolved.
Show resolved Hide resolved

TBD: Observability of outbox size and event send errors.

.. _django-jaiminho: https://github.com/loadsmart/django-jaiminho

Consequences
************

- The event bus becomes far more reliable, and able to handle events that require at-least-once delivery. The need for manual re-producing of events should become very rare.
- Open edX becomes more complicated to run. Adding a new worker process to every service that produces events will further increase the orchestration needs of Open edX. (See alternatives section for a possible workaround.)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it? Can't I just configure all my events to be post-commit send in a config file?

or maybe I misunderstood and the three producer models are in code, not all the same in code and then at "send" do one of three things

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it doesn't become more complicated if you opt not to use outbox. (I just... have opinions here.) I can dial that back.

- Duplication becomes possible, so we would need a way to avoid sending the same event over and over again to the broker if the broker is failing to send acknowledgements. We may need to revisit existing events and improve documentation around ensuring that consumers can tolerate duplication, either by ensuring that events are idempotent or by keeping track of which event IDs have already been processed.
- The database will be required to store an unbounded number of events during a broker outage, worker outage, or event bus misconfiguration.

Rejected and Unplanned Alternatives
***********************************

Change Data Capture
timmc-edx marked this conversation as resolved.
Show resolved Hide resolved
===================

Change data capture (CDC) is a method of directly streaming database changes from one place to another by following the DB's transaction log. This provides the same transactionality benefits as the outbox method. `Debezium <https://debezium.io/>`_ is an example of such a system and can read directly from the database and produce to Kafka, where the data can then be transformed and routed to other systems. While a CDC platform could send data to the Open edX event bus, it would also be redundant with the event bus. In the example of Debezium, a deployment would still need a Kafka cluster even if they wanted to put event data into Redis.

CDC systems also source their data at a lower level than we're targeting with the event bus; Django usually insulates us from schema details via an ORM layer, but CDC involves reading table data directly. We'd have tight coupling with our DB schemas. And the eventing system we've chosen to build operates at a higher abstraction layer than database writes, creating another conceptual mismatch. Theoretically, a CDC system could also be responsible for reading events from an outbox, allowing high-level eventing, but this is unlikely to be more palatable than just running a management command in a loop.

Non-worker event production
===========================

The outbox pattern usually involves running a worker process that handles moving data from the outbox to the broker. However, it may be possible for deployers to avoid this with the use of some alternative middleware. For example, a custom middleware could flush events to the broker at the end of each event-producing request. The middleware's ``post_response`` would run outside of the request's main transaction. It would check if the request had created events, and if so, it would pull *at least that many* events from the outbox and produce them to the broker, then remove them from the outbox. If the server crashed before this could complete, later requests would eventually complete the work. This would also cover events produced by workers and other non-request-based processes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It seems like celery beat could be used as well. Same as it possibly could have been for event consumers.
  2. Is an alternative also to just another type (like immediate), or allow outbox to work like immediate if some part of it is not configured? Logging would just need to be used for produce errors, as we would need to do now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I would group celery-beat under the umbrella of worker processes. (Most of our IDAs don't have a celery worker, right?)
  2. I don't think there's any way for the adding-to-outbox side to know whether the removing-from-outbox side is properly configured.


Web responses that produce events would have higher latency, as they would have to finish an additional DB read, broker call, and DB write before returning the response to the user. Event latency would also increase and become more variable due to the opportunistic approach.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't current web calls that produce events already have the broker latency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might depend on the implementation. The Kafka implementation queues events for delivery to the broker, but that happens asynchronously to the request. Another implementation might do synchronous delivery (including raising an exception in the request thread if the broker communication fails!)


It's also conceivable that each Django server in the IDA could start a background process to act as an outbox-emptying worker.

We're not planning on implementating either of these, but they should be drop-in replacements for the long-running management command, and could be developed in the future by deployers who need such an arrangement.

References
**********

- Microservices.io on the transactional outbox pattern: https://microservices.io/patterns/data/transactional-outbox.html
- An introduction to jaiminho: https://engineering.loadsmart.com/blog/introducing-jaiminho
1 change: 1 addition & 0 deletions docs/decisions/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,4 @@ Architectural Decision Records (ADRs)
0012-producing-to-event-bus-via-settings
0013-special-exam-submission-and-review-events
0014-new-event-bus-producer-config
0015-outbox-pattern-and-production-modes