docs: ADR for outbox pattern and production modes #292

timmc-edx · 2023-12-02T01:42:56Z

Adapted and expanded from
https://openedx.atlassian.net/wiki/spaces/AC/pages/3922952193/Event+Bus+Reliability (with some corrections made to the information on Debezium and CDC).

See the rendered version of this PR as well.

Adapted and expanded from https://openedx.atlassian.net/wiki/spaces/AC/pages/3922952193/Event+Bus+Reliability (with some corrections made to the information on Debezium and CDC).

ashultz0 · 2023-12-04T18:55:50Z

docs/decisions/0015-outbox-pattern-and-production-modes.rst

+Decision
+********
+
+We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to post-commit send, but openedx-events configuration will be enhanced to allow configuing each event to a production mode: Immediate, on_commit, or outbox.


are there three modes or four here? Events default to post-commit, or you can configure immediate, on_commit, outbox. post-commit = on_commit?

from reading below it looks like post-commit is current adhoc behavior which all needs to be changed to on_commit future regulated behavior?

Ah yeah, I should clean that up -- it's just three modes. Post-commit/on-commit are supposed to be the same thing. I wasn't sure whether or not to draw a distinction between the current ad-hoc behaviors and the intended modes via naming differences. Probably I shouldn't do that, and should use consistent names for the two. Maybe I'll go back and use the mode names in the Context section rather than using pre-commit and post-commit there?

Updated to consistently use names "immediate" and "on-commit".

ashultz0 · 2023-12-04T18:59:39Z

docs/decisions/0015-outbox-pattern-and-production-modes.rst

+************
+
+- The event bus becomes far more reliable, and able to handle events that require at-least-once delivery. The need for manual re-producing of events should become very rare.
+- Open edX becomes more complicated to run. Adding a new worker process to every service that produces events will further increase the orchestration needs of Open edX. (See alternatives section for a possible workaround.)


does it? Can't I just configure all my events to be post-commit send in a config file?

or maybe I misunderstood and the three producer models are in code, not all the same in code and then at "send" do one of three things

True, it doesn't become more complicated if you opt not to use outbox. (I just... have opinions here.) I can dial that back.

ashultz0 · 2023-12-04T19:12:15Z

docs/decisions/0015-outbox-pattern-and-production-modes.rst

+
+The outbox pattern usually involves running a worker process that handles moving data from the outbox to the broker. However, it may be possible for deployers to avoid this with the use of some alternative middleware. For example, a custom middleware could flush events to the broker at the end of each event-producing request. The middleware's ``post_response`` would run outside of the request's main transaction. It would check if the request had created events, and if so, it would pull *at least that many* events from the outbox and produce them to the broker, then remove them from the outbox. If the server crashed before this could complete, later requests would eventually complete the work. This would also cover events produced by workers and other non-request-based processes.
+
+Web responses that produce events would have higher latency, as they would have to finish an additional DB read, broker call, and DB write before returning the response to the user. Event latency would also increase and become more variable due to the opportunistic approach.


Don't current web calls that produce events already have the broker latency?

Might depend on the implementation. The Kafka implementation queues events for delivery to the broker, but that happens asynchronously to the request. Another implementation might do synchronous delivery (including raising an exception in the request thread if the broker communication fails!)

docs/decisions/0015-outbox-pattern-and-production-modes.rst

ashultz0 · 2023-12-04T19:39:56Z

docs/decisions/0015-outbox-pattern-and-production-modes.rst

+
+openedx-events will change to support three producer modes for sending events:
+
+- ``immediate``: Whether or not there's a transaction, just send to the event bus immediately. This is the "pre-commit send" described in the Context section and is the current behavior for ``send(...)``.


Could we just drop this one? What I'm getting at is that if all events run through one commit hook, they're easier to think about. Is there ever an important reason to get the event out before the DB write?

Although maybe if all events are sent the same one having this feature is close to free.

Hmm... yes, I think we could! All of the current events are about a change in application state, so if a transaction rolls back then we should not send the event. I can imagine future events that are more about a request, or an attempt having been made, but we can add that mode if needed.

I've removed immediate as a mode, and described it instead as something that could be added in the future.

- Use "immediate" and "on-commit" names in the Context section to describe our current usage -- this will now match the mode names in the Decision section (which are just better names, too.) - Rename the "on_commit" mode to "on-commit" (underscore to hyphen) since it's easier to type that way, and to give a little distance from the implementation detail of `transaction.on_commit`. - Fix typo "configuing"

robrap

Nice job. Thank you for this.

docs/decisions/0015-outbox-pattern-and-production-modes.rst

robrap · 2023-12-05T17:51:47Z

docs/decisions/0015-outbox-pattern-and-production-modes.rst

+- **Immediate send**: The event is produced to the event bus immediately upon the signal being sent, which will generally occur inside Django's request-level transaction (if using ``ATOMIC_REQUESTS``). This preserves atomicity in the success case as long as the broker is reachable, even if the IDA crashes -- but it does not preserve atomicity when the transaction fails. There is also no ordering guarantee in the case of concurrent requests.
+- **On-commit send**: The event is only sent from a ``django.db.transaction.on_commit`` callback. This preserves atomicity in the failure case, but the IDA could crash after transaction commit but before calling the broker -- or more commonly, the broker could be down or unreachable, and all of the on-commit-produced events would be lost during that interval. Ordering is also not preserved here.
+
+We currently use an ad-hoc mix of immediate and on-commit send in edx-platform, depending on how particular OpenEdxPublicSignals are emitted. For example, the code path for ``COURSE_CATALOG_INFO_CHANGED`` involves an explicit call to ``django.db.transaction.on_commit`` in order to ensure an on-commit send is used. But most signals do not have any such call, and are likely sent immediately. This uncontrolled state reflects our iterative approach to the event bus as well as our choice to start with events that are backed by other synchronization measures which can correct for consistency issues. However, we'd like to start handling events that require stronger reliability guarantees, such as those in the ecommerce space.


This uncontrolled state reflects our iterative approach to the event bus as well as our choice to start with events that are backed by other synchronization measures which can correct for consistency issues.

I'm not sure exactly how to document the following thought, but it may be useful to clarify that OpenEdxPublicSignals always result in in-process events, and sometimes result in cross-service events published to the event bus. Some of the concerns you are documenting relate to both in-process and cross-service events (e.g. sending an event immediately, which may get out of sync with transaction failures), and some are specific to the event bus (e.g. broker availability).

Those other in-process receivers would generally be operating within the request-level transaction though, right? Any which do should have the desired reliability properties already.

Not exactly. Any non-trivial in-process event processing would still likely happen async using celery. It just wouldn't require the event bus for communicating across services.

Hmm. I guess I'm not sure what you're trying to say.

Sorry, maybe I'm getting confused. There are two steps:

The call to OpenEdxPublicSignal's send_event, and

The call to send an event to a broker in this signal's receiver.

The very first step might be happening immediately, or on-commit, depending on how it was implemented correct? I'm now unclear whether when you are referring to "immediate send" and "on-commit send", whether you are referring to step 1 above, or step 2 above. Also, I'm not clear from an implementation standpoint (without looking more deeply), what combinations of immediate/on-commit are possible across steps 1 and 2.

Once we are on the same page with this, maybe I can make my earlier questions more clear. If this is also not clear, than we can just discuss in person. :)

Yeah, that's... definitely confusing. I'll rework it to use distinct language. :-)

Great. And I think adding something of the nature of this comment also helps clarify:

If the code that sends to a signal does so in transaction.on_commit, the only possible behavior is producing to the broker after transaction completion. If instead the code sends to a signal as soon as possible, then it depends on what mode is configured for that signal -- the event might get published to the broker immediately, or delayed to commit-time.

Although it is somewhat logical/obvious, it also helps differentiate between the two.

I think the overall language change I want to make is this:

"Send" is used only for Django signals

"Publish" is used for emitting events to the event bus broker

I want to avoid "produce" for the most part, as it leads to awkward phrasing (in my opinion). Produce/consume and publish/subscribe are largely used as synonyms in this space, but publish/subscribe seems to be more accurate. I still want a way to describe IDAs and requests that can create an event and try to publish it, with the distinction that the publishing may not actually occur in all situations (e.g. in an error situation). For those, "event-producing IDA" and "event-producing request" seem sufficient.

Given that there's no current plan to actually implement immediate mode, do you still want that language about immediate and on-commit behaviors?

Note that we mainly use produce/consume for our docs (e.g. https://openedx.atlassian.net/wiki/spaces/AC/pages/3508699151/How+to+start+using+the+Event+Bus#Producing). I'm not entirely against you using publish/subscribe, but then it feels like you should address this discrepancy. I would wonder how publish/subscribe differs from produce/consume, used everywhere else in the docs.

I say you update as you see fit, and then I can answer your question as to whether I think anything is still unclear.

robrap · 2023-12-05T17:53:50Z

docs/decisions/0015-outbox-pattern-and-production-modes.rst

+
+We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to on-commit send, but openedx-events configuration will be enhanced to allow configuring each event to a choice of production mode (on-commit or outbox).
+
+In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database.


Can we make this more vague as we work out the details of clean-up later?

Suggested change

In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database.

In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and eventually removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database.

Hmm, how about something like this? Marking them as sent is required, but can be separated from deletion.

Suggested change

In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database.

In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and marking the records as sent. (This or another process eventually removes them from the table.) This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database.

robrap · 2023-12-05T17:57:16Z

docs/decisions/0015-outbox-pattern-and-production-modes.rst

+
+We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to on-commit send, but openedx-events configuration will be enhanced to allow configuring each event to a choice of production mode (on-commit or outbox).
+
+In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database.


Should we address separately the benefits of at-least-once resiliency, even if the outbox pattern were used outside the scope of a DB transaction?

Yes, I'll add something.

robrap · 2023-12-06T22:52:34Z

docs/decisions/0015-outbox-pattern-and-production-modes.rst

+
+- ``on-commit``: Delay sending to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process).
+
+  This requires ensuring that any events that are currently being explicitly sent on-commit are changed to call ``get_producer().send(...)`` directly, after appropriate per-event configuration. ``emit_catalog_info_changed_signal`` is a known example of this.


Maybe this belongs under consequences?

Either way, I'm having a tough time getting this. It may need more code or links to code or details.

Also, doesn't this affect more than just the event bus event? This is part of my confusion. I thought there was an OpenEdxPublicSignal send which is for the in-process or event bus events, and that sending events to the event bus was just based on config that uses this signal, wherever it was sent. But, I might be totally missing something.

Here's where we currently delay the sending of a specific event until after the transaction commits: https://github.com/openedx/edx-platform/blob/a74f510f71c19142aeb380facdc2cdfecf16c38a/cms/djangoapps/contentstore/signals/handlers.py#L153 (Note that it currently has an additional race condition, in that the data for the event is gathered after the transaction has committed.)

We'd be able to replace this transaction.on_commit call with something like COURSE_CATALOG_INFO_CHANGED.send_event(*_create_catalog_data_for_signal(course_key)) (computing the event body inside the transaction). The default configuration of COURSE_CATALOG_INFO_CHANGED would be to send it in on-commit mode, largely preserving the existing behavior -- but without that race condition.

I think this raises some important caveats that are worth documenting, maybe in "consequences". django-jaiminho stores the args/kwargs and only executes the function call to send the message after (maybe quite a while after) the transaction is committed, so it is important to be very mindful of both the size of what you're potentially sending in and the fact that you're not actually acting in the same transaction as the rest of the code, which may have some unintended consequences.

Things like passing in a lazy Django queryset can cause it to get evaluated, so if you're planning on doing further filtering in the sending function, that should get bubbled up to the caller. There's also a 65k limit to the dill'd stored args, so there's a chance of inadvertently overrunning that with an evaluated queryset and either losing the event due to error or saving an unserializable blob to the database.

Similarly we'd need to be on guard against pulling any new data in the sending function since it could be operating on data hours later than when it was queued, breaking the ordering guarantees.

With dill in the picture folks will also need to think about the usual pickle security problems, and potential serialization issues that they wouldn't have otherwise had to.

@timmc-edx: can we add to the ADR the example you just mentioned about the replacement?

Also, in these lines you mention:

This requires ensuring that any events that are currently being explicitly sent on-commit are changed to call get_producer().send(...) directly, after appropriate per-event configuration. emit_catalog_info_changed_signal is a known example of this.

But the example is different, the transaction.commit is replaced by the usual send_event method, not the get_producer(). Maybe I'm just missing some context. Also, can we explicitly link which send method will handle the production modes?

@bmtcril Yes, I think what we want to do here is only use the jaiminho decorator on a very simple, very stable function that only accepts pre-serialized data -- maybe JSON strings. I don't think jaiminho's overall approach in that regard is very safe, but it looks like there's a safe way to use it. (We could also just call jaiminho.publish_strategies.publish directly.)

@mariajgrimaldi Sorry, yes, I was still thinking of the older version of openedx-events where we explicitly call send on the producer. Now that we have configuration-based event publishing, everything is handled by openedx-events' signal handlers.

Specifically, openedx_events.apps.general_signal_handler is currently the receiver for OpenEdxPublicSignals, and it makes a call to get_producer().send(...). general_signal_handler would change to look at the publishing mode for the event and either store the event in the outbox or schedule the get_producer().send(...) to be run on-commit.

If that sounds right to you, I can update the Implementation Plan section.

@timmc-edx: thank you! That would be ideal.

@mariajgrimaldi Should be taken care of by the recent commits -- let me know what you think.

@timmc-edx: Thanks again. Now it's crystal clear.

robrap · 2023-12-06T22:54:59Z

docs/decisions/0015-outbox-pattern-and-production-modes.rst

+- ``on-commit``: Delay sending to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process).
+
+  This requires ensuring that any events that are currently being explicitly sent on-commit are changed to call ``get_producer().send(...)`` directly, after appropriate per-event configuration. ``emit_catalog_info_changed_signal`` is a known example of this.
+- ``outbox``: Prep the signal for sending, and save in an outbox table for sending as soon as possible. The outbox table will be managed by `django-jaiminho`_. Deployers using this mode will also need to run a jaiminho management command in a perpetual worker process in order to relay events from the outbox to the broker and mark them as successfully sent. Another management command would be needed to periodically purge old processed events.


I'm wondering if django-jaiminho could either move to the consequence section, or be a lighter touch on this ADR. I feel like this ADR is more general, and doesn't need to be tied to a specific implementation, although we do need to choose one, and it is fine to document what we will be trying. That also could be its own ADR, but my point is that if we decide on a different implementation, it really shouldn't supersede this ADR.

I could break out an implementation plan subsection.

This has been moved to a new subsection.

note to self: review this update.

docs/decisions/0015-outbox-pattern-and-production-modes.rst

robrap · 2023-12-06T23:03:35Z

docs/decisions/0015-outbox-pattern-and-production-modes.rst

+Non-worker event production
+===========================
+
+The outbox pattern usually involves running a worker process that handles moving data from the outbox to the broker. However, it may be possible for deployers to avoid this with the use of some alternative middleware. For example, a custom middleware could flush events to the broker at the end of each event-producing request. The middleware's ``post_response`` would run outside of the request's main transaction. It would check if the request had created events, and if so, it would pull *at least that many* events from the outbox and produce them to the broker, then remove them from the outbox. If the server crashed before this could complete, later requests would eventually complete the work. This would also cover events produced by workers and other non-request-based processes.


It seems like celery beat could be used as well. Same as it possibly could have been for event consumers.

Is an alternative also to just another type (like immediate), or allow outbox to work like immediate if some part of it is not configured? Logging would just need to be used for produce errors, as we would need to do now.

I would group celery-beat under the umbrella of worker processes. (Most of our IDAs don't have a celery worker, right?)

I don't think there's any way for the adding-to-outbox side to know whether the removing-from-outbox side is properly configured.

bmtcril · 2023-12-15T19:26:48Z

Overall I think this is great, most of my concerns are around the specifics of jaiminho implementation details and not the adoption of the pattern.

As we're working toward a place where tracking logs can optionally be sent over the bus I think it would be great to be able to send them on commit, which would fix several long standing issues with the data quality guarantees of tracking logs. However at that volume I wouldn't necessarily want jaiminho's behavior of writing the event to the outbox on failure. I think the current language around immediate mode gives us an out unless/until we can get a workaround upstream for that, so I'm just plugging for making sure that the option stays in.

timmc-edx · 2023-12-18T17:53:12Z

@bmtcril Can you clarify whether you think immediate-mode should be kept as a thing to implement as part of this ADR, or whether you think it's just something that the ADR should leave room for in the future?

bmtcril · 2023-12-18T18:57:33Z

@timmc-edx I just wanted to make sure that we don't rule it out at this point.

"Send" was ambiguous because you can send to a Django signal and you can send to the event bus, and one can lead to the other. New language: - "Send" is used only for Django signals - "Publish" is used for emitting events to the event bus broker This also changes "producer mode" to "publishing mode". I'm avoiding "produce" for the most part, as it leads to awkward phrasing (in my opinion). Produce/consume and publish/subscribe are largely used as synonyms in this space, but publish/subscribe seems to be more accurate. I still want a way to describe IDAs and requests that can create an event and try to publish it, with the distinction that the publishing may not actually occur in all situations (e.g. in an error situation). For those, "event-producing IDA" and "event-producing request" seem sufficient.

- I was describing the old get_producer().send() code that is no longer in use ever since we switched from explicit producer calls to configuration based production. Now the ADR describes OpenEdxPublicSignal.send_event() instead. - I've moved the note about needing to migrate a signal from explicit to config-based on-commit publishing from Decision to Consequences.

It would be great if we could unify our language across all of our code and docs, but for now I can just clarify the language used in the ADR.

timmc-edx · 2024-02-06T16:35:04Z

Any other changes people would like to see? (I think everything has been addressed at this point.)

robrap

Thanks for landing this!

robrap · 2024-02-06T17:07:17Z

docs/decisions/0015-outbox-pattern-and-production-modes.rst

+- **Immediate publish**: The event is published to the event bus immediately upon the signal being sent, which will generally occur inside Django's request-level transaction (if using ``ATOMIC_REQUESTS``). This preserves atomicity in the success case as long as the broker is reachable, even if the IDA crashes -- but it does not preserve atomicity when the transaction fails. There is also no ordering guarantee in the case of concurrent requests.
+- **On-commit publish**: The event is published from a ``django.db.transaction.on_commit`` callback. This preserves atomicity in the failure case, but the IDA could crash after transaction commit but before calling the broker -- or more commonly, the broker could be down or unreachable, and all of the on-commit-published events would be lost during that interval. Ordering is also not preserved here.
+
+We currently use an ad-hoc mix of immediate and on-commit publish in edx-platform, depending on how code sends to particular OpenEdxPublicSignals. For example, the code path for ``COURSE_CATALOG_INFO_CHANGED`` involves an explicit call to ``django.db.transaction.on_commit`` in order to ensure an on-commit publish is used. But most signals sends do not have any such call, and are likely published immediately. This uncontrolled state reflects our iterative approach to the event bus as well as our choice to start with events that are backed by other synchronization measures which can correct for consistency issues. However, we'd like to start handling events that require stronger reliability guarantees, such as those in the ecommerce space.


I'm not sure if instead of signals sends you want signals' sends, or something like that?

Thanks, changing to "signal sends".

robrap · 2024-02-06T17:09:48Z

docs/decisions/0015-outbox-pattern-and-production-modes.rst

+Decision
+********
+
+We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event publishing to database transactions. Events will default to on-commit publish, but openedx-events configuration will be enhanced to allow configuring each event to a choice of **publishing mode** (on-commit or outbox).


Nit: Wondering if language like "Implementation of the transactional outbox pattern ... will solve ...", or something like this? Basically, starting with "We will implement..." when we have no idea who will do this, nor when, may be a little strong. That said, this is just a non-blocking nit.

The template instructions say to use active voice and suggests starting with "we will". I agree that it's a little awkward in a provisional ADR with no immediate plan to implement, but I'd prefer to keep it this way.

docs: ADR for outbox pattern and production modes

c3a9c06

Adapted and expanded from https://openedx.atlassian.net/wiki/spaces/AC/pages/3922952193/Event+Bus+Reliability (with some corrections made to the information on Debezium and CDC).

timmc-edx requested a review from a team as a code owner December 2, 2023 01:42

timmc-edx added 2 commits December 2, 2023 01:49

fixup! Trailing whitespace lint

317b2dc

fixup! Include ADR in index

a4776d0

timmc-edx mentioned this pull request Dec 4, 2023

Discovery: Implement MVP of an outbox pattern for the event bus #251

Open

3 tasks

ashultz0 reviewed Dec 4, 2023

View reviewed changes

docs/decisions/0015-outbox-pattern-and-production-modes.rst Show resolved Hide resolved

ashultz0 reviewed Dec 4, 2023

View reviewed changes

timmc-edx added 3 commits December 4, 2023 21:52

fixup! Change language around operational complexity

a7434e1

fixup! Remove immediate mode

a49d859

robrap reviewed Dec 6, 2023

View reviewed changes

timmc-edx added 3 commits December 11, 2023 19:40

fixup! Fix various typos

55d4957

fixup! Add note about at-least-once benefits

b5c816f

fixup! Move jaiminho details to new Implementation Plan section

f932462

timmc-edx added 4 commits January 19, 2024 21:26

fixup! Be explicit about the properties of the modes

289e3f8

fixup! Add a note on language

4053629

It would be great if we could unify our language across all of our code and docs, but for now I can just clarify the language used in the ADR.

robrap approved these changes Feb 6, 2024

View reviewed changes

timmc-edx added 2 commits February 8, 2024 14:16

Merge branch 'main' into timmc/adr-outbox

5f9d24b

fixup! Fix typo "signals sends"

9fc5e16

timmc-edx merged commit 6427965 into main Feb 20, 2024
8 checks passed

timmc-edx deleted the timmc/adr-outbox branch February 20, 2024 16:11


		The outbox pattern usually involves running a worker process that handles moving data from the outbox to the broker. However, it may be possible for deployers to avoid this with the use of some alternative middleware. For example, a custom middleware could flush events to the broker at the end of each event-producing request. The middleware's ``post_response`` would run outside of the request's main transaction. It would check if the request had created events, and if so, it would pull at least that many events from the outbox and produce them to the broker, then remove them from the outbox. If the server crashed before this could complete, later requests would eventually complete the work. This would also cover events produced by workers and other non-request-based processes.

		Web responses that produce events would have higher latency, as they would have to finish an additional DB read, broker call, and DB write before returning the response to the user. Event latency would also increase and become more variable due to the opportunistic approach.


		openedx-events will change to support three producer modes for sending events:

		- ``immediate``: Whether or not there's a transaction, just send to the event bus immediately. This is the "pre-commit send" described in the Context section and is the current behavior for ``send(...)``.


		We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to on-commit send, but openedx-events configuration will be enhanced to allow configuring each event to a choice of production mode (on-commit or outbox).

		In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the intent to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database.


		- ``on-commit``: Delay sending to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process).

		This requires ensuring that any events that are currently being explicitly sent on-commit are changed to call ``get_producer().send(...)`` directly, after appropriate per-event configuration. ``emit_catalog_info_changed_signal`` is a known example of this.

docs: ADR for outbox pattern and production modes #292

docs: ADR for outbox pattern and production modes #292

Conversation

timmc-edx commented Dec 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robrap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariajgrimaldi Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bmtcril commented Dec 15, 2023

timmc-edx commented Dec 18, 2023

bmtcril commented Dec 18, 2023

timmc-edx commented Feb 6, 2024

robrap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timmc-edx commented Dec 2, 2023 •

edited

Loading

mariajgrimaldi Dec 15, 2023 •

edited

Loading