Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: ADR for outbox pattern and production modes #292

Merged
merged 15 commits into from
Feb 20, 2024
Merged

Conversation

timmc-edx
Copy link
Contributor

@timmc-edx timmc-edx commented Dec 2, 2023

Adapted and expanded from
https://openedx.atlassian.net/wiki/spaces/AC/pages/3922952193/Event+Bus+Reliability (with some corrections made to the information on Debezium and CDC).

See the rendered version of this PR as well.

Adapted and expanded from
https://openedx.atlassian.net/wiki/spaces/AC/pages/3922952193/Event+Bus+Reliability
(with some corrections made to the information on Debezium and CDC).
@timmc-edx timmc-edx requested a review from a team as a code owner December 2, 2023 01:42
Decision
********

We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to post-commit send, but openedx-events configuration will be enhanced to allow configuing each event to a production mode: Immediate, on_commit, or outbox.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there three modes or four here? Events default to post-commit, or you can configure immediate, on_commit, outbox. post-commit = on_commit?

from reading below it looks like post-commit is current adhoc behavior which all needs to be changed to on_commit future regulated behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, I should clean that up -- it's just three modes. Post-commit/on-commit are supposed to be the same thing. I wasn't sure whether or not to draw a distinction between the current ad-hoc behaviors and the intended modes via naming differences. Probably I shouldn't do that, and should use consistent names for the two. Maybe I'll go back and use the mode names in the Context section rather than using pre-commit and post-commit there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to consistently use names "immediate" and "on-commit".

************

- The event bus becomes far more reliable, and able to handle events that require at-least-once delivery. The need for manual re-producing of events should become very rare.
- Open edX becomes more complicated to run. Adding a new worker process to every service that produces events will further increase the orchestration needs of Open edX. (See alternatives section for a possible workaround.)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it? Can't I just configure all my events to be post-commit send in a config file?

or maybe I misunderstood and the three producer models are in code, not all the same in code and then at "send" do one of three things

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it doesn't become more complicated if you opt not to use outbox. (I just... have opinions here.) I can dial that back.


The outbox pattern usually involves running a worker process that handles moving data from the outbox to the broker. However, it may be possible for deployers to avoid this with the use of some alternative middleware. For example, a custom middleware could flush events to the broker at the end of each event-producing request. The middleware's ``post_response`` would run outside of the request's main transaction. It would check if the request had created events, and if so, it would pull *at least that many* events from the outbox and produce them to the broker, then remove them from the outbox. If the server crashed before this could complete, later requests would eventually complete the work. This would also cover events produced by workers and other non-request-based processes.

Web responses that produce events would have higher latency, as they would have to finish an additional DB read, broker call, and DB write before returning the response to the user. Event latency would also increase and become more variable due to the opportunistic approach.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't current web calls that produce events already have the broker latency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might depend on the implementation. The Kafka implementation queues events for delivery to the broker, but that happens asynchronously to the request. Another implementation might do synchronous delivery (including raising an exception in the request thread if the broker communication fails!)


openedx-events will change to support three producer modes for sending events:

- ``immediate``: Whether or not there's a transaction, just send to the event bus immediately. This is the "pre-commit send" described in the Context section and is the current behavior for ``send(...)``.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just drop this one? What I'm getting at is that if all events run through one commit hook, they're easier to think about. Is there ever an important reason to get the event out before the DB write?

Although maybe if all events are sent the same one having this feature is close to free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... yes, I think we could! All of the current events are about a change in application state, so if a transaction rolls back then we should not send the event. I can imagine future events that are more about a request, or an attempt having been made, but we can add that mode if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed immediate as a mode, and described it instead as something that could be added in the future.

- Use "immediate" and "on-commit" names in the Context section to describe
  our current usage -- this will now match the mode names in the Decision
  section (which are just better names, too.)
- Rename the "on_commit" mode to "on-commit" (underscore to hyphen) since
  it's easier to type that way, and to give a little distance from the
  implementation detail of `transaction.on_commit`.
- Fix typo "configuing"
Copy link
Contributor

@robrap robrap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job. Thank you for this.

- **Immediate send**: The event is produced to the event bus immediately upon the signal being sent, which will generally occur inside Django's request-level transaction (if using ``ATOMIC_REQUESTS``). This preserves atomicity in the success case as long as the broker is reachable, even if the IDA crashes -- but it does not preserve atomicity when the transaction fails. There is also no ordering guarantee in the case of concurrent requests.
- **On-commit send**: The event is only sent from a ``django.db.transaction.on_commit`` callback. This preserves atomicity in the failure case, but the IDA could crash after transaction commit but before calling the broker -- or more commonly, the broker could be down or unreachable, and all of the on-commit-produced events would be lost during that interval. Ordering is also not preserved here.

We currently use an ad-hoc mix of immediate and on-commit send in edx-platform, depending on how particular OpenEdxPublicSignals are emitted. For example, the code path for ``COURSE_CATALOG_INFO_CHANGED`` involves an explicit call to ``django.db.transaction.on_commit`` in order to ensure an on-commit send is used. But most signals do not have any such call, and are likely sent immediately. This uncontrolled state reflects our iterative approach to the event bus as well as our choice to start with events that are backed by other synchronization measures which can correct for consistency issues. However, we'd like to start handling events that require stronger reliability guarantees, such as those in the ecommerce space.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uncontrolled state reflects our iterative approach to the event bus as well as our choice to start with events that are backed by other synchronization measures which can correct for consistency issues.

I'm not sure exactly how to document the following thought, but it may be useful to clarify that OpenEdxPublicSignals always result in in-process events, and sometimes result in cross-service events published to the event bus. Some of the concerns you are documenting relate to both in-process and cross-service events (e.g. sending an event immediately, which may get out of sync with transaction failures), and some are specific to the event bus (e.g. broker availability).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those other in-process receivers would generally be operating within the request-level transaction though, right? Any which do should have the desired reliability properties already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. Any non-trivial in-process event processing would still likely happen async using celery. It just wouldn't require the event bus for communicating across services.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I guess I'm not sure what you're trying to say.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, maybe I'm getting confused. There are two steps:

  1. The call to OpenEdxPublicSignal's send_event, and
  2. The call to send an event to a broker in this signal's receiver.

The very first step might be happening immediately, or on-commit, depending on how it was implemented correct? I'm now unclear whether when you are referring to "immediate send" and "on-commit send", whether you are referring to step 1 above, or step 2 above. Also, I'm not clear from an implementation standpoint (without looking more deeply), what combinations of immediate/on-commit are possible across steps 1 and 2.

Once we are on the same page with this, maybe I can make my earlier questions more clear. If this is also not clear, than we can just discuss in person. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's... definitely confusing. I'll rework it to use distinct language. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. And I think adding something of the nature of this comment also helps clarify:

If the code that sends to a signal does so in transaction.on_commit, the only possible behavior is producing to the broker after transaction completion. If instead the code sends to a signal as soon as possible, then it depends on what mode is configured for that signal -- the event might get published to the broker immediately, or delayed to commit-time.

Although it is somewhat logical/obvious, it also helps differentiate between the two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the overall language change I want to make is this:

  • "Send" is used only for Django signals
  • "Publish" is used for emitting events to the event bus broker

I want to avoid "produce" for the most part, as it leads to awkward phrasing (in my opinion). Produce/consume and publish/subscribe are largely used as synonyms in this space, but publish/subscribe seems to be more accurate. I still want a way to describe IDAs and requests that can create an event and try to publish it, with the distinction that the publishing may not actually occur in all situations (e.g. in an error situation). For those, "event-producing IDA" and "event-producing request" seem sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there's no current plan to actually implement immediate mode, do you still want that language about immediate and on-commit behaviors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Note that we mainly use produce/consume for our docs (e.g. https://openedx.atlassian.net/wiki/spaces/AC/pages/3508699151/How+to+start+using+the+Event+Bus#Producing). I'm not entirely against you using publish/subscribe, but then it feels like you should address this discrepancy. I would wonder how publish/subscribe differs from produce/consume, used everywhere else in the docs.
  2. I say you update as you see fit, and then I can answer your question as to whether I think anything is still unclear.


We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to on-commit send, but openedx-events configuration will be enhanced to allow configuring each event to a choice of production mode (on-commit or outbox).

In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this more vague as we work out the details of clean-up later?

Suggested change
In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database.
In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and eventually removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how about something like this? Marking them as sent is required, but can be separated from deletion.

Suggested change
In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database.
In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and marking the records as sent. (This or another process eventually removes them from the table.) This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database.


We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to on-commit send, but openedx-events configuration will be enhanced to allow configuring each event to a choice of production mode (on-commit or outbox).

In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we address separately the benefits of at-least-once resiliency, even if the outbox pattern were used outside the scope of a DB transaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll add something.


- ``on-commit``: Delay sending to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process).

This requires ensuring that any events that are currently being explicitly sent on-commit are changed to call ``get_producer().send(...)`` directly, after appropriate per-event configuration. ``emit_catalog_info_changed_signal`` is a known example of this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Maybe this belongs under consequences?
  2. Either way, I'm having a tough time getting this. It may need more code or links to code or details.
  3. Also, doesn't this affect more than just the event bus event? This is part of my confusion. I thought there was an OpenEdxPublicSignal send which is for the in-process or event bus events, and that sending events to the event bus was just based on config that uses this signal, wherever it was sent. But, I might be totally missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's where we currently delay the sending of a specific event until after the transaction commits: https://github.com/openedx/edx-platform/blob/a74f510f71c19142aeb380facdc2cdfecf16c38a/cms/djangoapps/contentstore/signals/handlers.py#L153 (Note that it currently has an additional race condition, in that the data for the event is gathered after the transaction has committed.)

We'd be able to replace this transaction.on_commit call with something like COURSE_CATALOG_INFO_CHANGED.send_event(*_create_catalog_data_for_signal(course_key)) (computing the event body inside the transaction). The default configuration of COURSE_CATALOG_INFO_CHANGED would be to send it in on-commit mode, largely preserving the existing behavior -- but without that race condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this raises some important caveats that are worth documenting, maybe in "consequences". django-jaiminho stores the args/kwargs and only executes the function call to send the message after (maybe quite a while after) the transaction is committed, so it is important to be very mindful of both the size of what you're potentially sending in and the fact that you're not actually acting in the same transaction as the rest of the code, which may have some unintended consequences.

Things like passing in a lazy Django queryset can cause it to get evaluated, so if you're planning on doing further filtering in the sending function, that should get bubbled up to the caller. There's also a 65k limit to the dill'd stored args, so there's a chance of inadvertently overrunning that with an evaluated queryset and either losing the event due to error or saving an unserializable blob to the database.

Similarly we'd need to be on guard against pulling any new data in the sending function since it could be operating on data hours later than when it was queued, breaking the ordering guarantees.

With dill in the picture folks will also need to think about the usual pickle security problems, and potential serialization issues that they wouldn't have otherwise had to.

Copy link
Member

@mariajgrimaldi mariajgrimaldi Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timmc-edx: can we add to the ADR the example you just mentioned about the replacement?

Also, in these lines you mention:

This requires ensuring that any events that are currently being explicitly sent on-commit are changed to call get_producer().send(...) directly, after appropriate per-event configuration. emit_catalog_info_changed_signal is a known example of this.

But the example is different, the transaction.commit is replaced by the usual send_event method, not the get_producer(). Maybe I'm just missing some context. Also, can we explicitly link which send method will handle the production modes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmtcril Yes, I think what we want to do here is only use the jaiminho decorator on a very simple, very stable function that only accepts pre-serialized data -- maybe JSON strings. I don't think jaiminho's overall approach in that regard is very safe, but it looks like there's a safe way to use it. (We could also just call jaiminho.publish_strategies.publish directly.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariajgrimaldi Sorry, yes, I was still thinking of the older version of openedx-events where we explicitly call send on the producer. Now that we have configuration-based event publishing, everything is handled by openedx-events' signal handlers.

Specifically, openedx_events.apps.general_signal_handler is currently the receiver for OpenEdxPublicSignals, and it makes a call to get_producer().send(...). general_signal_handler would change to look at the publishing mode for the event and either store the event in the outbox or schedule the get_producer().send(...) to be run on-commit.

If that sounds right to you, I can update the Implementation Plan section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timmc-edx: thank you! That would be ideal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariajgrimaldi Should be taken care of by the recent commits -- let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timmc-edx: Thanks again. Now it's crystal clear.

- ``on-commit``: Delay sending to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process).

This requires ensuring that any events that are currently being explicitly sent on-commit are changed to call ``get_producer().send(...)`` directly, after appropriate per-event configuration. ``emit_catalog_info_changed_signal`` is a known example of this.
- ``outbox``: Prep the signal for sending, and save in an outbox table for sending as soon as possible. The outbox table will be managed by `django-jaiminho`_. Deployers using this mode will also need to run a jaiminho management command in a perpetual worker process in order to relay events from the outbox to the broker and mark them as successfully sent. Another management command would be needed to periodically purge old processed events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if django-jaiminho could either move to the consequence section, or be a lighter touch on this ADR. I feel like this ADR is more general, and doesn't need to be tied to a specific implementation, although we do need to choose one, and it is fine to document what we will be trying. That also could be its own ADR, but my point is that if we decide on a different implementation, it really shouldn't supersede this ADR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could break out an implementation plan subsection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been moved to a new subsection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: review this update.

Non-worker event production
===========================

The outbox pattern usually involves running a worker process that handles moving data from the outbox to the broker. However, it may be possible for deployers to avoid this with the use of some alternative middleware. For example, a custom middleware could flush events to the broker at the end of each event-producing request. The middleware's ``post_response`` would run outside of the request's main transaction. It would check if the request had created events, and if so, it would pull *at least that many* events from the outbox and produce them to the broker, then remove them from the outbox. If the server crashed before this could complete, later requests would eventually complete the work. This would also cover events produced by workers and other non-request-based processes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It seems like celery beat could be used as well. Same as it possibly could have been for event consumers.
  2. Is an alternative also to just another type (like immediate), or allow outbox to work like immediate if some part of it is not configured? Logging would just need to be used for produce errors, as we would need to do now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I would group celery-beat under the umbrella of worker processes. (Most of our IDAs don't have a celery worker, right?)
  2. I don't think there's any way for the adding-to-outbox side to know whether the removing-from-outbox side is properly configured.

@bmtcril
Copy link
Contributor

bmtcril commented Dec 15, 2023

Overall I think this is great, most of my concerns are around the specifics of jaiminho implementation details and not the adoption of the pattern.

As we're working toward a place where tracking logs can optionally be sent over the bus I think it would be great to be able to send them on commit, which would fix several long standing issues with the data quality guarantees of tracking logs. However at that volume I wouldn't necessarily want jaiminho's behavior of writing the event to the outbox on failure. I think the current language around immediate mode gives us an out unless/until we can get a workaround upstream for that, so I'm just plugging for making sure that the option stays in.

@timmc-edx
Copy link
Contributor Author

@bmtcril Can you clarify whether you think immediate-mode should be kept as a thing to implement as part of this ADR, or whether you think it's just something that the ADR should leave room for in the future?

@bmtcril
Copy link
Contributor

bmtcril commented Dec 18, 2023

@timmc-edx I just wanted to make sure that we don't rule it out at this point.

"Send" was ambiguous because you can send to a Django signal and you can
send to the event bus, and one can lead to the other. New language:

- "Send" is used only for Django signals
- "Publish" is used for emitting events to the event bus broker

This also changes "producer mode" to "publishing mode".

I'm avoiding "produce" for the most part, as it leads to awkward phrasing
(in my opinion). Produce/consume and publish/subscribe are largely used as
synonyms in this space, but publish/subscribe seems to be more accurate. I
still want a way to describe IDAs and requests that can create an event
and try to publish it, with the distinction that the publishing may not
actually occur in all situations (e.g. in an error situation). For those,
"event-producing IDA" and "event-producing request" seem sufficient.
- I was describing the old get_producer().send() code that is no longer in
  use ever since we switched from explicit producer calls to configuration
  based production. Now the ADR describes OpenEdxPublicSignal.send_event()
  instead.
- I've moved the note about needing to migrate a signal from explicit to
  config-based on-commit publishing from Decision to Consequences.
It would be great if we could unify our language across all of our code
and docs, but for now I can just clarify the language used in the ADR.
@timmc-edx
Copy link
Contributor Author

Any other changes people would like to see? (I think everything has been addressed at this point.)

Copy link
Contributor

@robrap robrap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for landing this!

- **Immediate publish**: The event is published to the event bus immediately upon the signal being sent, which will generally occur inside Django's request-level transaction (if using ``ATOMIC_REQUESTS``). This preserves atomicity in the success case as long as the broker is reachable, even if the IDA crashes -- but it does not preserve atomicity when the transaction fails. There is also no ordering guarantee in the case of concurrent requests.
- **On-commit publish**: The event is published from a ``django.db.transaction.on_commit`` callback. This preserves atomicity in the failure case, but the IDA could crash after transaction commit but before calling the broker -- or more commonly, the broker could be down or unreachable, and all of the on-commit-published events would be lost during that interval. Ordering is also not preserved here.

We currently use an ad-hoc mix of immediate and on-commit publish in edx-platform, depending on how code sends to particular OpenEdxPublicSignals. For example, the code path for ``COURSE_CATALOG_INFO_CHANGED`` involves an explicit call to ``django.db.transaction.on_commit`` in order to ensure an on-commit publish is used. But most signals sends do not have any such call, and are likely published immediately. This uncontrolled state reflects our iterative approach to the event bus as well as our choice to start with events that are backed by other synchronization measures which can correct for consistency issues. However, we'd like to start handling events that require stronger reliability guarantees, such as those in the ecommerce space.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if instead of signals sends you want signals' sends, or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, changing to "signal sends".

Decision
********

We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event publishing to database transactions. Events will default to on-commit publish, but openedx-events configuration will be enhanced to allow configuring each event to a choice of **publishing mode** (on-commit or outbox).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Wondering if language like "Implementation of the transactional outbox pattern ... will solve ...", or something like this? Basically, starting with "We will implement..." when we have no idea who will do this, nor when, may be a little strong. That said, this is just a non-blocking nit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The template instructions say to use active voice and suggests starting with "we will". I agree that it's a little awkward in a provisional ADR with no immediate plan to implement, but I'd prefer to keep it this way.

@timmc-edx timmc-edx merged commit 6427965 into main Feb 20, 2024
8 checks passed
@timmc-edx timmc-edx deleted the timmc/adr-outbox branch February 20, 2024 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants