From c3a9c06dd7b84501f5c2c9b871ba53d77c0cbf34 Mon Sep 17 00:00:00 2001 From: Tim McCormack Date: Sat, 2 Dec 2023 01:42:22 +0000 Subject: [PATCH 01/14] docs: ADR for outbox pattern and production modes Adapted and expanded from https://openedx.atlassian.net/wiki/spaces/AC/pages/3922952193/Event+Bus+Reliability (with some corrections made to the information on Debezium and CDC). --- ...15-outbox-pattern-and-production-modes.rst | 95 +++++++++++++++++++ 1 file changed, 95 insertions(+) create mode 100644 docs/decisions/0015-outbox-pattern-and-production-modes.rst diff --git a/docs/decisions/0015-outbox-pattern-and-production-modes.rst b/docs/decisions/0015-outbox-pattern-and-production-modes.rst new file mode 100644 index 00000000..269f7bc5 --- /dev/null +++ b/docs/decisions/0015-outbox-pattern-and-production-modes.rst @@ -0,0 +1,95 @@ +15. Outbox pattern and production modes +####################################### + +Status +****** + +**Provisional** + +Context +******* + +Some of the event types in the Event Bus might be more sensitive than others to dropped, duplicated, or reordered events. The message broker itself is partially responsible for ensuring that these problems do not occur in transit, but we also need to ensure that the handoff of events to the broker is reliable. + +These are the properties we wish to ensure in the general case: + +- **Atomicity**: Many events are related to data that is written to the database in the same request, but transactions can either commit or abort. This gives us two sub-properties: + + - **Atomic success**: When a transaction successfully commits in the IDA, any produced events relating to that data are durably transmitted to the message broker. This is more important for events intended to keep services synchronized (sending "latest state of entity" events), and may be less important for some kinds of notification events (especially anything used for tracking or statistics). + - **Atomic failure**: When a transaction fails, due to a rollback, network interruption, or application crash, no events related to those database writes are sent to the message broker. Otherwise, these events would be "counterfactuals" that misrepresent the producing service's internal state. This could result in strange behavior such as incorrect notifications to users, and potentially could produce security issues. + +- **Ordering**: If multiple events are produced to the same topic, their ordering is preserved. This raises the question of "ordered according to what metric", as concurrency is in play, so the nature of this property may vary by event. + +This is only in the general case, as some events may not be connected to database transactions, some consumers might tolerate violations of either atomic success or failure, and not all events may have strict notions of ordering. Hoever, in the general case violations of any of these can result in consistency failures between services that might not be corrected over any time scale. + +It's also worth noting a goal we don't have, that of avoiding duplication. At-least-once delivery is acceptible; exactly-once delivery is not required. Double-sends of events are permissible as long as this only happens occasionally (for performance reasons) and does not entail a violation of ordering. + +As of 2023-11-09 we produce events in two different ways relative to transactions: + +- **Pre-commit send**: The event is produced to the event bus immediately upon the signal being sent, which will generally occur inside Django's request-level transaction (if using ``ATOMIC_REQUESTS``). This preserves atomicity in the success case as long as the broker is reachable, even if the IDA crashes -- but it does not preserve atomicity when the transaction fails. There is also no ordering guarantee in the case of concurrent requests. +- **Post-commit send**: The event is only sent from a ``django.db.transaction.on_commit`` callback. This preserves atomicity in the failure case, but the IDA could crash after transaction commit but before calling the broker -- or more commonly, the broker could be down or unreachable, and all of the post-commit-produced events would be lost during that interval. Ordering is also not preserved here. + +We currently use an ad-hoc mix of pre-commit and post-commit send in edx-platform, depending on how particular OpenEdxPublicSignals are emitted. For example, the code path for ``COURSE_CATALOG_INFO_CHANGED`` involves an explicit call to ``django.db.transaction.on_commit`` in order to ensure a post-commit send is used. But most signals do not have any such call, and are likely sent pre-commit. This uncontrolled state reflects our iterative approach to the event bus as well as our choice to start with events that are backed by other synchronization measures which can correct for consistency issues. However, we'd like to start handling events that require stronger reliability guarantees, such as those in the ecommerce space. + +Decision +******** + +We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to post-commit send, but openedx-events configuration will be enhanced to allow configuing each event to a production mode: Immediate, on_commit, or outbox. + +In the outbox pattern, events are not produced immediately, but are appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database. + +openedx-events will change to support three producer modes for sending events: + +- ``immediate``: Whether or not there's a transaction, just send to the event bus immediately. This is the "pre-commit send" described in the Context section and is the current behavior for ``send(...)``. +- ``on_commit``: Delay sending to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process). + + This requires ensuring that any events that are currently being explicitly sent post-commit are changed to call ``get_producer().send(...)`` directly, after appropriate per-event configuration. ``emit_catalog_info_changed_signal`` is a known example of this. +- ``outbox``: Prep the signal for sending, and save in an outbox table for sending as soon as possible. The outbox table will be managed by `django-jaiminho`_. Deployers using this mode will also need to run a jaiminho management command in a perpetual worker process in order to relay events from the outbox to the broker and mark them as successfully sent. Another management command would be needed to periodically purge old processed events. + + (TBD: Format for the event data in the outbox. No further event-specific DB queries should be required for producing the bytes for the wire format, but it should not be serialized in a way that is specific to Kafka, Redis, etc.) + + (TBD: Safeguards around inadvertently changing the save-to-outbox function's name and module, since those are included in jaiminho's outbox records.) + +openedx-events will add a per event type configuration field specifying the event’s producer mode in the form of a new key-topic field inside ``EVENT_BUS_PRODUCER_CONFIG``. It will also add a new Django setting ``EVENT_BUS_PRODUCER_MODE`` that names a mode to use when not otherwise specified (defaulting to ``on_commit``.) + +``django-jaiminho`` will be added as a dependency of openedx-events and to the ``INSTALLED_APPS`` of relying IDAs. + +TBD: Observability of outbox size and event send errors. + +.. _django-jaiminho: https://github.com/loadsmart/django-jaiminho + + +Consequences +************ + +- The event bus becomes far more reliable, and able to handle events that require at-least-once delivery. The need for manual re-producing of events should become very rare. +- Open edX becomes more complicated to run. Adding a new worker process to every service that produces events will further increase the orchestration needs of Open edX. (See alternatives section for a possible workaround.) +- Duplication becomes possible, so we would need a way to avoid sending the same event over and over again to the broker if the broker is failing to send acknowledgements. We may need to revisit existing events and improve documentation around ensuring that consumers can tolerate duplication, either by ensuring that events are idempotent or by keeping track of which event IDs have already been processed. +- The database will be required to store an unbounded number of events during a broker outage, worker outage, or event bus misconfiguration. + +Rejected and Unplanned Alternatives +*********************************** + +Change Data Capture +=================== + +Change data capture (CDC) is a method of directly streaming database changes from one place to another by following the DB's transaction log. This provides the same transactionality benefits as the outbox method. `Debezium `_ is an example of such a system and can read directly from the database and produce to Kafka, where the data can then be transformed and routed to other systems. While a CDC platform could send data to the Open edX event bus, it would also be redundant with the event bus. In the example of Debezium, a deployment would still need a Kafka cluster even if they wanted to put event data into Redis. + +CDC systems also source their data at a lower level than we're targeting with the event bus; Django usually insulates us from schema details via an ORM layer, but CDC involves reading table data directly. We'd have tight coupling with our DB schemas. And the eventing system we've chosen to build operates at a higher abstraction layer than database writes, creating another conceptual mismatch. Theoretically, a CDC system could also be responsible for reading events from an outbox, allowing high-level eventing, but this is unlikely to be more palatable than just running a management command in a loop. + +Non-worker event production +=========================== + +The outbox pattern usually involves running a worker process that handles moving data from the outbox to the broker. However, it may be possible for deployers to avoid this with the use of some alternative middleware. For example, a custom middleware could flush events to the broker at the end of each event-producing request. The middleware's ``post_response`` would run outside of the request's main transaction. It would check if the request had created events, and if so, it would pull *at least that many* events from the outbox and produce them to the broker, then remove them from the outbox. If the server crashed before this could complete, later requests would eventually complete the work. This would also cover events produced by workers and other non-request-based processes. + +Web responses that produce events would have higher latency, as they would have to finish an additional DB read, broker call, and DB write before returning the response to the user. Event latency would also increase and become more variable due to the opportunistic approach. + +It's also conceivable that each Django server in the IDA could start a background process to act as an outbox-emptying worker. + +We're not planning on implementating either of these, but they should be drop-in replacements for the long-running management command, and could be developed in the future by deployers who need such an arrangement. + +References +********** + +- Microservices.io on the transactional outbox pattern: https://microservices.io/patterns/data/transactional-outbox.html +- An introduction to jaiminho: https://engineering.loadsmart.com/blog/introducing-jaiminho From 317b2dc2e33ba74069f54920db355e40fddc38d9 Mon Sep 17 00:00:00 2001 From: Tim McCormack Date: Sat, 2 Dec 2023 01:49:15 +0000 Subject: [PATCH 02/14] fixup! Trailing whitespace lint --- docs/decisions/0015-outbox-pattern-and-production-modes.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/decisions/0015-outbox-pattern-and-production-modes.rst b/docs/decisions/0015-outbox-pattern-and-production-modes.rst index 269f7bc5..1d9deee7 100644 --- a/docs/decisions/0015-outbox-pattern-and-production-modes.rst +++ b/docs/decisions/0015-outbox-pattern-and-production-modes.rst @@ -57,7 +57,6 @@ openedx-events will add a per event type configuration field specifying the even TBD: Observability of outbox size and event send errors. .. _django-jaiminho: https://github.com/loadsmart/django-jaiminho - Consequences ************ From a4776d0f4d117e9a75bb927b8d8ad09181dbd17b Mon Sep 17 00:00:00 2001 From: Tim McCormack Date: Sat, 2 Dec 2023 01:51:49 +0000 Subject: [PATCH 03/14] fixup! Include ADR in index --- docs/decisions/index.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/decisions/index.rst b/docs/decisions/index.rst index 627c4c3d..1a1f053a 100644 --- a/docs/decisions/index.rst +++ b/docs/decisions/index.rst @@ -19,3 +19,4 @@ Architectural Decision Records (ADRs) 0012-producing-to-event-bus-via-settings 0013-special-exam-submission-and-review-events 0014-new-event-bus-producer-config + 0015-outbox-pattern-and-production-modes From 7dda7af140ea2b66ef571e51c6e01b87df6a58e0 Mon Sep 17 00:00:00 2001 From: Tim McCormack Date: Mon, 4 Dec 2023 21:52:44 +0000 Subject: [PATCH 04/14] fixup! Use consistent names; rename on_commit; fix typo - Use "immediate" and "on-commit" names in the Context section to describe our current usage -- this will now match the mode names in the Decision section (which are just better names, too.) - Rename the "on_commit" mode to "on-commit" (underscore to hyphen) since it's easier to type that way, and to give a little distance from the implementation detail of `transaction.on_commit`. - Fix typo "configuing" --- .../0015-outbox-pattern-and-production-modes.rst | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/decisions/0015-outbox-pattern-and-production-modes.rst b/docs/decisions/0015-outbox-pattern-and-production-modes.rst index 1d9deee7..d0aa744a 100644 --- a/docs/decisions/0015-outbox-pattern-and-production-modes.rst +++ b/docs/decisions/0015-outbox-pattern-and-production-modes.rst @@ -26,31 +26,31 @@ It's also worth noting a goal we don't have, that of avoiding duplication. At-le As of 2023-11-09 we produce events in two different ways relative to transactions: -- **Pre-commit send**: The event is produced to the event bus immediately upon the signal being sent, which will generally occur inside Django's request-level transaction (if using ``ATOMIC_REQUESTS``). This preserves atomicity in the success case as long as the broker is reachable, even if the IDA crashes -- but it does not preserve atomicity when the transaction fails. There is also no ordering guarantee in the case of concurrent requests. -- **Post-commit send**: The event is only sent from a ``django.db.transaction.on_commit`` callback. This preserves atomicity in the failure case, but the IDA could crash after transaction commit but before calling the broker -- or more commonly, the broker could be down or unreachable, and all of the post-commit-produced events would be lost during that interval. Ordering is also not preserved here. +- **Immediate send**: The event is produced to the event bus immediately upon the signal being sent, which will generally occur inside Django's request-level transaction (if using ``ATOMIC_REQUESTS``). This preserves atomicity in the success case as long as the broker is reachable, even if the IDA crashes -- but it does not preserve atomicity when the transaction fails. There is also no ordering guarantee in the case of concurrent requests. +- **On-commit send**: The event is only sent from a ``django.db.transaction.on_commit`` callback. This preserves atomicity in the failure case, but the IDA could crash after transaction commit but before calling the broker -- or more commonly, the broker could be down or unreachable, and all of the on-commit-produced events would be lost during that interval. Ordering is also not preserved here. -We currently use an ad-hoc mix of pre-commit and post-commit send in edx-platform, depending on how particular OpenEdxPublicSignals are emitted. For example, the code path for ``COURSE_CATALOG_INFO_CHANGED`` involves an explicit call to ``django.db.transaction.on_commit`` in order to ensure a post-commit send is used. But most signals do not have any such call, and are likely sent pre-commit. This uncontrolled state reflects our iterative approach to the event bus as well as our choice to start with events that are backed by other synchronization measures which can correct for consistency issues. However, we'd like to start handling events that require stronger reliability guarantees, such as those in the ecommerce space. +We currently use an ad-hoc mix of immediate and on-commit send in edx-platform, depending on how particular OpenEdxPublicSignals are emitted. For example, the code path for ``COURSE_CATALOG_INFO_CHANGED`` involves an explicit call to ``django.db.transaction.on_commit`` in order to ensure an on-commit send is used. But most signals do not have any such call, and are likely sent immediately. This uncontrolled state reflects our iterative approach to the event bus as well as our choice to start with events that are backed by other synchronization measures which can correct for consistency issues. However, we'd like to start handling events that require stronger reliability guarantees, such as those in the ecommerce space. Decision ******** -We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to post-commit send, but openedx-events configuration will be enhanced to allow configuing each event to a production mode: Immediate, on_commit, or outbox. +We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to on-commit send, but openedx-events configuration will be enhanced to allow configuring each event to a production mode: Immediate, on-commit, or outbox. In the outbox pattern, events are not produced immediately, but are appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database. openedx-events will change to support three producer modes for sending events: -- ``immediate``: Whether or not there's a transaction, just send to the event bus immediately. This is the "pre-commit send" described in the Context section and is the current behavior for ``send(...)``. -- ``on_commit``: Delay sending to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process). +- ``immediate``: Whether or not there's a transaction, just send to the event bus immediately. This is the current behavior for ``send(...)``. +- ``on-commit``: Delay sending to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process). - This requires ensuring that any events that are currently being explicitly sent post-commit are changed to call ``get_producer().send(...)`` directly, after appropriate per-event configuration. ``emit_catalog_info_changed_signal`` is a known example of this. + This requires ensuring that any events that are currently being explicitly sent on-commit are changed to call ``get_producer().send(...)`` directly, after appropriate per-event configuration. ``emit_catalog_info_changed_signal`` is a known example of this. - ``outbox``: Prep the signal for sending, and save in an outbox table for sending as soon as possible. The outbox table will be managed by `django-jaiminho`_. Deployers using this mode will also need to run a jaiminho management command in a perpetual worker process in order to relay events from the outbox to the broker and mark them as successfully sent. Another management command would be needed to periodically purge old processed events. (TBD: Format for the event data in the outbox. No further event-specific DB queries should be required for producing the bytes for the wire format, but it should not be serialized in a way that is specific to Kafka, Redis, etc.) (TBD: Safeguards around inadvertently changing the save-to-outbox function's name and module, since those are included in jaiminho's outbox records.) -openedx-events will add a per event type configuration field specifying the event’s producer mode in the form of a new key-topic field inside ``EVENT_BUS_PRODUCER_CONFIG``. It will also add a new Django setting ``EVENT_BUS_PRODUCER_MODE`` that names a mode to use when not otherwise specified (defaulting to ``on_commit``.) +openedx-events will add a per event type configuration field specifying the event’s producer mode in the form of a new key-topic field inside ``EVENT_BUS_PRODUCER_CONFIG``. It will also add a new Django setting ``EVENT_BUS_PRODUCER_MODE`` that names a mode to use when not otherwise specified (defaulting to ``on-commit``.) ``django-jaiminho`` will be added as a dependency of openedx-events and to the ``INSTALLED_APPS`` of relying IDAs. From a7434e18378f8c7a136ae0a4b98934bebfbf7b45 Mon Sep 17 00:00:00 2001 From: Tim McCormack Date: Tue, 5 Dec 2023 01:18:45 +0000 Subject: [PATCH 05/14] fixup! Change language around operational complexity --- docs/decisions/0015-outbox-pattern-and-production-modes.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/decisions/0015-outbox-pattern-and-production-modes.rst b/docs/decisions/0015-outbox-pattern-and-production-modes.rst index d0aa744a..0e41e28a 100644 --- a/docs/decisions/0015-outbox-pattern-and-production-modes.rst +++ b/docs/decisions/0015-outbox-pattern-and-production-modes.rst @@ -62,7 +62,7 @@ Consequences ************ - The event bus becomes far more reliable, and able to handle events that require at-least-once delivery. The need for manual re-producing of events should become very rare. -- Open edX becomes more complicated to run. Adding a new worker process to every service that produces events will further increase the orchestration needs of Open edX. (See alternatives section for a possible workaround.) +- The new outbox functionality, if used, comes with operational complexity. Adding a new worker process to every service that produces events will further increase the orchestration needs of Open edX. (See alternatives section for a possible workaround.) - Duplication becomes possible, so we would need a way to avoid sending the same event over and over again to the broker if the broker is failing to send acknowledgements. We may need to revisit existing events and improve documentation around ensuring that consumers can tolerate duplication, either by ensuring that events are idempotent or by keeping track of which event IDs have already been processed. - The database will be required to store an unbounded number of events during a broker outage, worker outage, or event bus misconfiguration. From a49d859f1374490fbc0c93582b354d0af45984fe Mon Sep 17 00:00:00 2001 From: Tim McCormack Date: Tue, 5 Dec 2023 01:32:11 +0000 Subject: [PATCH 06/14] fixup! Remove immediate mode --- .../0015-outbox-pattern-and-production-modes.rst | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/decisions/0015-outbox-pattern-and-production-modes.rst b/docs/decisions/0015-outbox-pattern-and-production-modes.rst index 0e41e28a..8c6cb5cb 100644 --- a/docs/decisions/0015-outbox-pattern-and-production-modes.rst +++ b/docs/decisions/0015-outbox-pattern-and-production-modes.rst @@ -34,13 +34,12 @@ We currently use an ad-hoc mix of immediate and on-commit send in edx-platform, Decision ******** -We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to on-commit send, but openedx-events configuration will be enhanced to allow configuring each event to a production mode: Immediate, on-commit, or outbox. +We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to on-commit send, but openedx-events configuration will be enhanced to allow configuring each event to a choice of production mode (on-commit or outbox). -In the outbox pattern, events are not produced immediately, but are appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database. +In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database. -openedx-events will change to support three producer modes for sending events: +openedx-events will change to support two producer modes for sending events when ``send(...)`` is called: -- ``immediate``: Whether or not there's a transaction, just send to the event bus immediately. This is the current behavior for ``send(...)``. - ``on-commit``: Delay sending to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process). This requires ensuring that any events that are currently being explicitly sent on-commit are changed to call ``get_producer().send(...)`` directly, after appropriate per-event configuration. ``emit_catalog_info_changed_signal`` is a known example of this. @@ -52,6 +51,8 @@ openedx-events will change to support three producer modes for sending events: openedx-events will add a per event type configuration field specifying the event’s producer mode in the form of a new key-topic field inside ``EVENT_BUS_PRODUCER_CONFIG``. It will also add a new Django setting ``EVENT_BUS_PRODUCER_MODE`` that names a mode to use when not otherwise specified (defaulting to ``on-commit``.) +This will remove the ability to send an event immediately, as none of the currently implemented events would benefit from it. If in the future there is an event type that requires it, perhaps because it represents a request or attempt or even a failure, an ``immediate`` mode can be added. + ``django-jaiminho`` will be added as a dependency of openedx-events and to the ``INSTALLED_APPS`` of relying IDAs. TBD: Observability of outbox size and event send errors. From 55d49573f194d0cfb55b1158f5cb92931fd028b2 Mon Sep 17 00:00:00 2001 From: Tim McCormack Date: Mon, 11 Dec 2023 19:40:04 +0000 Subject: [PATCH 07/14] fixup! Fix various typos --- .../0015-outbox-pattern-and-production-modes.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/decisions/0015-outbox-pattern-and-production-modes.rst b/docs/decisions/0015-outbox-pattern-and-production-modes.rst index 8c6cb5cb..f0593bf4 100644 --- a/docs/decisions/0015-outbox-pattern-and-production-modes.rst +++ b/docs/decisions/0015-outbox-pattern-and-production-modes.rst @@ -20,9 +20,9 @@ These are the properties we wish to ensure in the general case: - **Ordering**: If multiple events are produced to the same topic, their ordering is preserved. This raises the question of "ordered according to what metric", as concurrency is in play, so the nature of this property may vary by event. -This is only in the general case, as some events may not be connected to database transactions, some consumers might tolerate violations of either atomic success or failure, and not all events may have strict notions of ordering. Hoever, in the general case violations of any of these can result in consistency failures between services that might not be corrected over any time scale. +This is only in the general case, as some events may not be connected to database transactions, some consumers might tolerate violations of either atomic success or failure, and not all events may have strict notions of ordering. However, in the general case violations of any of these can result in consistency failures between services that might not be corrected over any time scale. -It's also worth noting a goal we don't have, that of avoiding duplication. At-least-once delivery is acceptible; exactly-once delivery is not required. Double-sends of events are permissible as long as this only happens occasionally (for performance reasons) and does not entail a violation of ordering. +It's also worth noting a goal we don't have, that of avoiding duplication. At-least-once delivery is acceptable; exactly-once delivery is not required. Double-sends of events are permissible as long as this only happens occasionally (for performance reasons) and does not entail a violation of ordering. As of 2023-11-09 we produce events in two different ways relative to transactions: @@ -64,7 +64,7 @@ Consequences - The event bus becomes far more reliable, and able to handle events that require at-least-once delivery. The need for manual re-producing of events should become very rare. - The new outbox functionality, if used, comes with operational complexity. Adding a new worker process to every service that produces events will further increase the orchestration needs of Open edX. (See alternatives section for a possible workaround.) -- Duplication becomes possible, so we would need a way to avoid sending the same event over and over again to the broker if the broker is failing to send acknowledgements. We may need to revisit existing events and improve documentation around ensuring that consumers can tolerate duplication, either by ensuring that events are idempotent or by keeping track of which event IDs have already been processed. +- Duplication becomes possible, so we would need a way to avoid sending the same event over and over again to the broker if the broker is failing to send acknowledgments. We may need to revisit existing events and improve documentation around ensuring that consumers can tolerate duplication, either by ensuring that events are idempotent or by keeping track of which event IDs have already been processed. - The database will be required to store an unbounded number of events during a broker outage, worker outage, or event bus misconfiguration. Rejected and Unplanned Alternatives @@ -86,7 +86,7 @@ Web responses that produce events would have higher latency, as they would have It's also conceivable that each Django server in the IDA could start a background process to act as an outbox-emptying worker. -We're not planning on implementating either of these, but they should be drop-in replacements for the long-running management command, and could be developed in the future by deployers who need such an arrangement. +We're not planning on implementing either of these, but they should be drop-in replacements for the long-running management command, and could be developed in the future by deployers who need such an arrangement. References ********** From b5c816f2710a9fee5c2b68cfd8b2041397a23e20 Mon Sep 17 00:00:00 2001 From: Tim McCormack Date: Mon, 11 Dec 2023 21:14:59 +0000 Subject: [PATCH 08/14] fixup! Add note about at-least-once benefits --- docs/decisions/0015-outbox-pattern-and-production-modes.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/decisions/0015-outbox-pattern-and-production-modes.rst b/docs/decisions/0015-outbox-pattern-and-production-modes.rst index f0593bf4..c67093c7 100644 --- a/docs/decisions/0015-outbox-pattern-and-production-modes.rst +++ b/docs/decisions/0015-outbox-pattern-and-production-modes.rst @@ -36,7 +36,7 @@ Decision We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to on-commit send, but openedx-events configuration will be enhanced to allow configuring each event to a choice of production mode (on-commit or outbox). -In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database. +In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database. Even events that are not otherwise sent in a transactional context will benefit from the at-least-once delivery semantics. openedx-events will change to support two producer modes for sending events when ``send(...)`` is called: From f93246209769154b665a9689081fc4d41315578c Mon Sep 17 00:00:00 2001 From: Tim McCormack Date: Tue, 12 Dec 2023 23:08:32 +0000 Subject: [PATCH 09/14] fixup! Move jaiminho details to new Implementation Plan section --- ...15-outbox-pattern-and-production-modes.rst | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/docs/decisions/0015-outbox-pattern-and-production-modes.rst b/docs/decisions/0015-outbox-pattern-and-production-modes.rst index c67093c7..b7d6627f 100644 --- a/docs/decisions/0015-outbox-pattern-and-production-modes.rst +++ b/docs/decisions/0015-outbox-pattern-and-production-modes.rst @@ -43,19 +43,24 @@ openedx-events will change to support two producer modes for sending events when - ``on-commit``: Delay sending to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process). This requires ensuring that any events that are currently being explicitly sent on-commit are changed to call ``get_producer().send(...)`` directly, after appropriate per-event configuration. ``emit_catalog_info_changed_signal`` is a known example of this. -- ``outbox``: Prep the signal for sending, and save in an outbox table for sending as soon as possible. The outbox table will be managed by `django-jaiminho`_. Deployers using this mode will also need to run a jaiminho management command in a perpetual worker process in order to relay events from the outbox to the broker and mark them as successfully sent. Another management command would be needed to periodically purge old processed events. - - (TBD: Format for the event data in the outbox. No further event-specific DB queries should be required for producing the bytes for the wire format, but it should not be serialized in a way that is specific to Kafka, Redis, etc.) - - (TBD: Safeguards around inadvertently changing the save-to-outbox function's name and module, since those are included in jaiminho's outbox records.) +- ``outbox``: Prep the signal for sending, and save in an outbox table for sending as soon as possible. A worker process will then relay events from the outbox to the broker and mark them as successfully sent. Another management command will be needed to periodically purge old processed events. openedx-events will add a per event type configuration field specifying the event’s producer mode in the form of a new key-topic field inside ``EVENT_BUS_PRODUCER_CONFIG``. It will also add a new Django setting ``EVENT_BUS_PRODUCER_MODE`` that names a mode to use when not otherwise specified (defaulting to ``on-commit``.) This will remove the ability to send an event immediately, as none of the currently implemented events would benefit from it. If in the future there is an event type that requires it, perhaps because it represents a request or attempt or even a failure, an ``immediate`` mode can be added. -``django-jaiminho`` will be added as a dependency of openedx-events and to the ``INSTALLED_APPS`` of relying IDAs. +Implementation Plan +=================== + +(Details in this section are subject to change.) + +The most promising option for implementing the transactional outbox is `django-jaiminho`_, a Django app that manages adding to and emptying an outbox table. ``django-jaiminho`` would be added as a dependency of openedx-events and to the ``INSTALLED_APPS`` of event-producing IDAs, and several long-running workers running jaiminho management commands would be required for each event-producing IDA. + +Unknowns and future decisions: -TBD: Observability of outbox size and event send errors. +- Format for the event data in the outbox. No further event-specific DB queries should be required for producing the bytes for the wire format, but it should not be serialized in a way that is specific to Kafka, Redis, etc. +- Safeguards around inadvertently changing the save-to-outbox function's name and module, since those are included in jaiminho's outbox records. +- Observability of outbox size and event send errors. .. _django-jaiminho: https://github.com/loadsmart/django-jaiminho From 71dc150bbb4c9643efee75adf5caab56100fb555 Mon Sep 17 00:00:00 2001 From: Tim McCormack Date: Fri, 19 Jan 2024 21:26:43 +0000 Subject: [PATCH 10/14] fixup! Clarify language around send/publish/produce "Send" was ambiguous because you can send to a Django signal and you can send to the event bus, and one can lead to the other. New language: - "Send" is used only for Django signals - "Publish" is used for emitting events to the event bus broker This also changes "producer mode" to "publishing mode". I'm avoiding "produce" for the most part, as it leads to awkward phrasing (in my opinion). Produce/consume and publish/subscribe are largely used as synonyms in this space, but publish/subscribe seems to be more accurate. I still want a way to describe IDAs and requests that can create an event and try to publish it, with the distinction that the publishing may not actually occur in all situations (e.g. in an error situation). For those, "event-producing IDA" and "event-producing request" seem sufficient. --- ...15-outbox-pattern-and-production-modes.rst | 50 +++++++++---------- 1 file changed, 25 insertions(+), 25 deletions(-) diff --git a/docs/decisions/0015-outbox-pattern-and-production-modes.rst b/docs/decisions/0015-outbox-pattern-and-production-modes.rst index b7d6627f..042c3099 100644 --- a/docs/decisions/0015-outbox-pattern-and-production-modes.rst +++ b/docs/decisions/0015-outbox-pattern-and-production-modes.rst @@ -15,39 +15,39 @@ These are the properties we wish to ensure in the general case: - **Atomicity**: Many events are related to data that is written to the database in the same request, but transactions can either commit or abort. This gives us two sub-properties: - - **Atomic success**: When a transaction successfully commits in the IDA, any produced events relating to that data are durably transmitted to the message broker. This is more important for events intended to keep services synchronized (sending "latest state of entity" events), and may be less important for some kinds of notification events (especially anything used for tracking or statistics). - - **Atomic failure**: When a transaction fails, due to a rollback, network interruption, or application crash, no events related to those database writes are sent to the message broker. Otherwise, these events would be "counterfactuals" that misrepresent the producing service's internal state. This could result in strange behavior such as incorrect notifications to users, and potentially could produce security issues. + - **Atomic success**: When a transaction successfully commits in the IDA, any created events relating to that data are durably published to the message broker. This is more important for events intended to keep services synchronized (publishing "latest state of entity" events), and may be less important for some kinds of notification events (especially anything used for tracking or statistics). + - **Atomic failure**: When a transaction fails, due to a rollback, network interruption, or application crash, no events related to those database writes are published to the message broker. Otherwise, these events would be "counterfactuals" that misrepresent the publishing service's internal state. This could result in strange behavior such as incorrect notifications to users, and potentially could produce security issues. -- **Ordering**: If multiple events are produced to the same topic, their ordering is preserved. This raises the question of "ordered according to what metric", as concurrency is in play, so the nature of this property may vary by event. +- **Ordering**: If multiple events are published to the same topic, their ordering is preserved. This raises the question of "ordered according to what metric", as concurrency is in play, so the nature of this property may vary by event. This is only in the general case, as some events may not be connected to database transactions, some consumers might tolerate violations of either atomic success or failure, and not all events may have strict notions of ordering. However, in the general case violations of any of these can result in consistency failures between services that might not be corrected over any time scale. -It's also worth noting a goal we don't have, that of avoiding duplication. At-least-once delivery is acceptable; exactly-once delivery is not required. Double-sends of events are permissible as long as this only happens occasionally (for performance reasons) and does not entail a violation of ordering. +It's also worth noting a goal we don't have, that of avoiding duplication. At-least-once delivery is acceptable; exactly-once delivery is not required. Double-publishing of events is permissible as long as this only happens occasionally (for performance reasons) and does not entail a violation of ordering. -As of 2023-11-09 we produce events in two different ways relative to transactions: +As of 2023-11-09 we publish events in two different ways relative to transactions: -- **Immediate send**: The event is produced to the event bus immediately upon the signal being sent, which will generally occur inside Django's request-level transaction (if using ``ATOMIC_REQUESTS``). This preserves atomicity in the success case as long as the broker is reachable, even if the IDA crashes -- but it does not preserve atomicity when the transaction fails. There is also no ordering guarantee in the case of concurrent requests. -- **On-commit send**: The event is only sent from a ``django.db.transaction.on_commit`` callback. This preserves atomicity in the failure case, but the IDA could crash after transaction commit but before calling the broker -- or more commonly, the broker could be down or unreachable, and all of the on-commit-produced events would be lost during that interval. Ordering is also not preserved here. +- **Immediate publish**: The event is published to the event bus immediately upon the signal being sent, which will generally occur inside Django's request-level transaction (if using ``ATOMIC_REQUESTS``). This preserves atomicity in the success case as long as the broker is reachable, even if the IDA crashes -- but it does not preserve atomicity when the transaction fails. There is also no ordering guarantee in the case of concurrent requests. +- **On-commit publish**: The event is published from a ``django.db.transaction.on_commit`` callback. This preserves atomicity in the failure case, but the IDA could crash after transaction commit but before calling the broker -- or more commonly, the broker could be down or unreachable, and all of the on-commit-published events would be lost during that interval. Ordering is also not preserved here. -We currently use an ad-hoc mix of immediate and on-commit send in edx-platform, depending on how particular OpenEdxPublicSignals are emitted. For example, the code path for ``COURSE_CATALOG_INFO_CHANGED`` involves an explicit call to ``django.db.transaction.on_commit`` in order to ensure an on-commit send is used. But most signals do not have any such call, and are likely sent immediately. This uncontrolled state reflects our iterative approach to the event bus as well as our choice to start with events that are backed by other synchronization measures which can correct for consistency issues. However, we'd like to start handling events that require stronger reliability guarantees, such as those in the ecommerce space. +We currently use an ad-hoc mix of immediate and on-commit publish in edx-platform, depending on how code sends to particular OpenEdxPublicSignals. For example, the code path for ``COURSE_CATALOG_INFO_CHANGED`` involves an explicit call to ``django.db.transaction.on_commit`` in order to ensure an on-commit publish is used. But most signals sends do not have any such call, and are likely published immediately. This uncontrolled state reflects our iterative approach to the event bus as well as our choice to start with events that are backed by other synchronization measures which can correct for consistency issues. However, we'd like to start handling events that require stronger reliability guarantees, such as those in the ecommerce space. Decision ******** -We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event production to database transactions. Events will default to on-commit send, but openedx-events configuration will be enhanced to allow configuring each event to a choice of production mode (on-commit or outbox). +We will implement the transactional outbox pattern (or just "outbox pattern") in order to allow binding event publishing to database transactions. Events will default to on-commit publish, but openedx-events configuration will be enhanced to allow configuring each event to a choice of **publishing mode** (on-commit or outbox). -In the outbox pattern, events are not produced as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, producing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to send an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database. Even events that are not otherwise sent in a transactional context will benefit from the at-least-once delivery semantics. +In the outbox pattern, events are not published as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, publishing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to publish an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database. Even events that are not otherwise published in a transactional context will benefit from the at-least-once delivery semantics. -openedx-events will change to support two producer modes for sending events when ``send(...)`` is called: +openedx-events will change to support two modes for publishing events when ``send(...)`` is called: -- ``on-commit``: Delay sending to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process). +- ``on-commit``: Delay publishing to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process). - This requires ensuring that any events that are currently being explicitly sent on-commit are changed to call ``get_producer().send(...)`` directly, after appropriate per-event configuration. ``emit_catalog_info_changed_signal`` is a known example of this. -- ``outbox``: Prep the signal for sending, and save in an outbox table for sending as soon as possible. A worker process will then relay events from the outbox to the broker and mark them as successfully sent. Another management command will be needed to periodically purge old processed events. + This requires ensuring that any events that are currently being explicitly published on-commit are changed to call ``get_producer().send(...)`` directly, after appropriate per-event configuration. ``emit_catalog_info_changed_signal`` is a known example of this. +- ``outbox``: Prep the signal for publishing, and save in an outbox table for publishing as soon as possible. A worker process will then relay events from the outbox to the broker and mark them as successfully published. Another management command will be needed to periodically purge old processed events. -openedx-events will add a per event type configuration field specifying the event’s producer mode in the form of a new key-topic field inside ``EVENT_BUS_PRODUCER_CONFIG``. It will also add a new Django setting ``EVENT_BUS_PRODUCER_MODE`` that names a mode to use when not otherwise specified (defaulting to ``on-commit``.) +openedx-events will add a per event type configuration field specifying the event’s publishing mode in the form of a new key-topic field inside ``EVENT_BUS_PRODUCER_CONFIG``. It will also add a new Django setting ``EVENT_BUS_PRODUCER_MODE`` that names a mode to use when not otherwise specified (defaulting to ``on-commit``.) -This will remove the ability to send an event immediately, as none of the currently implemented events would benefit from it. If in the future there is an event type that requires it, perhaps because it represents a request or attempt or even a failure, an ``immediate`` mode can be added. +This will remove the ability to publish an event immediately, as none of the currently implemented events would benefit from it. If in the future there is an event type that requires it, perhaps because it represents a request or attempt or even a failure, an ``immediate`` mode can be added. Implementation Plan =================== @@ -58,18 +58,18 @@ The most promising option for implementing the transactional outbox is `django-j Unknowns and future decisions: -- Format for the event data in the outbox. No further event-specific DB queries should be required for producing the bytes for the wire format, but it should not be serialized in a way that is specific to Kafka, Redis, etc. +- Format for the event data in the outbox. No further event-specific DB queries should be required for creating the bytestring for the wire format, but it should not be serialized in a way that is specific to Kafka, Redis, etc. - Safeguards around inadvertently changing the save-to-outbox function's name and module, since those are included in jaiminho's outbox records. -- Observability of outbox size and event send errors. +- Observability of outbox size and event publish errors. .. _django-jaiminho: https://github.com/loadsmart/django-jaiminho Consequences ************ -- The event bus becomes far more reliable, and able to handle events that require at-least-once delivery. The need for manual re-producing of events should become very rare. -- The new outbox functionality, if used, comes with operational complexity. Adding a new worker process to every service that produces events will further increase the orchestration needs of Open edX. (See alternatives section for a possible workaround.) -- Duplication becomes possible, so we would need a way to avoid sending the same event over and over again to the broker if the broker is failing to send acknowledgments. We may need to revisit existing events and improve documentation around ensuring that consumers can tolerate duplication, either by ensuring that events are idempotent or by keeping track of which event IDs have already been processed. +- The event bus becomes far more reliable, and able to handle events that require at-least-once delivery. The need for manual re-publishing of events should become very rare. +- The new outbox functionality, if used, comes with operational complexity. Adding a new worker process to every service that publishes events will further increase the orchestration needs of Open edX. (See alternatives section for a possible workaround.) +- Duplication becomes possible, so we would need a way to avoid publishing the same event over and over again to the broker if the broker is failing to return acknowledgments. We may need to revisit existing events and improve documentation around ensuring that consumers can tolerate duplication, either by ensuring that events are idempotent or by keeping track of which event IDs have already been processed. - The database will be required to store an unbounded number of events during a broker outage, worker outage, or event bus misconfiguration. Rejected and Unplanned Alternatives @@ -78,16 +78,16 @@ Rejected and Unplanned Alternatives Change Data Capture =================== -Change data capture (CDC) is a method of directly streaming database changes from one place to another by following the DB's transaction log. This provides the same transactionality benefits as the outbox method. `Debezium `_ is an example of such a system and can read directly from the database and produce to Kafka, where the data can then be transformed and routed to other systems. While a CDC platform could send data to the Open edX event bus, it would also be redundant with the event bus. In the example of Debezium, a deployment would still need a Kafka cluster even if they wanted to put event data into Redis. +Change data capture (CDC) is a method of directly streaming database changes from one place to another by following the DB's transaction log. This provides the same transactionality benefits as the outbox method. `Debezium `_ is an example of such a system and can read directly from the database and publish to Kafka, where the data can then be transformed and routed to other systems. While a CDC platform could publish data to the Open edX event bus, it would also be redundant with the event bus. In the example of Debezium, a deployment would still need a Kafka cluster even if they wanted to put event data into Redis. CDC systems also source their data at a lower level than we're targeting with the event bus; Django usually insulates us from schema details via an ORM layer, but CDC involves reading table data directly. We'd have tight coupling with our DB schemas. And the eventing system we've chosen to build operates at a higher abstraction layer than database writes, creating another conceptual mismatch. Theoretically, a CDC system could also be responsible for reading events from an outbox, allowing high-level eventing, but this is unlikely to be more palatable than just running a management command in a loop. -Non-worker event production +Non-worker event publishing =========================== -The outbox pattern usually involves running a worker process that handles moving data from the outbox to the broker. However, it may be possible for deployers to avoid this with the use of some alternative middleware. For example, a custom middleware could flush events to the broker at the end of each event-producing request. The middleware's ``post_response`` would run outside of the request's main transaction. It would check if the request had created events, and if so, it would pull *at least that many* events from the outbox and produce them to the broker, then remove them from the outbox. If the server crashed before this could complete, later requests would eventually complete the work. This would also cover events produced by workers and other non-request-based processes. +The outbox pattern usually involves running a worker process that handles moving data from the outbox to the broker. However, it may be possible for deployers to avoid this with the use of some alternative middleware. For example, a custom middleware could flush events to the broker at the end of each event-producing request. The middleware's ``post_response`` would run outside of the request's main transaction. It would check if the request had created events, and if so, it would pull *at least that many* events from the outbox and publish them to the broker, then remove them from the outbox. If the server crashed before this could complete, later requests would eventually complete the work. This would also cover events published by workers and other non-request-based processes. -Web responses that produce events would have higher latency, as they would have to finish an additional DB read, broker call, and DB write before returning the response to the user. Event latency would also increase and become more variable due to the opportunistic approach. +Web requests that result in events being published would have higher response latency, as they would have to finish an additional DB read, broker call, and DB write before returning the response to the user. Event latency would also increase and become more variable due to the opportunistic approach. It's also conceivable that each Django server in the IDA could start a background process to act as an outbox-emptying worker. From 09ff1597b5e5cad147c4d865d06f51e473458f4e Mon Sep 17 00:00:00 2001 From: Tim McCormack Date: Fri, 19 Jan 2024 21:38:45 +0000 Subject: [PATCH 11/14] fixup! Correct sending example; move migration note to consequences - I was describing the old get_producer().send() code that is no longer in use ever since we switched from explicit producer calls to configuration based production. Now the ADR describes OpenEdxPublicSignal.send_event() instead. - I've moved the note about needing to migrate a signal from explicit to config-based on-commit publishing from Decision to Consequences. --- docs/decisions/0015-outbox-pattern-and-production-modes.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/decisions/0015-outbox-pattern-and-production-modes.rst b/docs/decisions/0015-outbox-pattern-and-production-modes.rst index 042c3099..34ae48d4 100644 --- a/docs/decisions/0015-outbox-pattern-and-production-modes.rst +++ b/docs/decisions/0015-outbox-pattern-and-production-modes.rst @@ -38,11 +38,9 @@ We will implement the transactional outbox pattern (or just "outbox pattern") in In the outbox pattern, events are not published as part of the request/response cycle, but are instead appended to an "outbox" database table within the transaction. A worker process operating in a separate transaction works through the list in order, publishing them to the message broker and removing them once the broker has acknowledged them. This is the standard solution to the dual-write problem and is likely the only way to meet all of the criteria. Atomicity is ensured by bringing the *intent* to publish an event into the transaction's ACID guarantees. Transaction commits also impose a meaningful ordering across all hosts using the same database. Even events that are not otherwise published in a transactional context will benefit from the at-least-once delivery semantics. -openedx-events will change to support two modes for publishing events when ``send(...)`` is called: +openedx-events will change to support two modes for publishing events when an OpenEdxPublicSignal's ``send_event(...)`` is called: - ``on-commit``: Delay publishing to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process). - - This requires ensuring that any events that are currently being explicitly published on-commit are changed to call ``get_producer().send(...)`` directly, after appropriate per-event configuration. ``emit_catalog_info_changed_signal`` is a known example of this. - ``outbox``: Prep the signal for publishing, and save in an outbox table for publishing as soon as possible. A worker process will then relay events from the outbox to the broker and mark them as successfully published. Another management command will be needed to periodically purge old processed events. openedx-events will add a per event type configuration field specifying the event’s publishing mode in the form of a new key-topic field inside ``EVENT_BUS_PRODUCER_CONFIG``. It will also add a new Django setting ``EVENT_BUS_PRODUCER_MODE`` that names a mode to use when not otherwise specified (defaulting to ``on-commit``.) @@ -72,6 +70,8 @@ Consequences - Duplication becomes possible, so we would need a way to avoid publishing the same event over and over again to the broker if the broker is failing to return acknowledgments. We may need to revisit existing events and improve documentation around ensuring that consumers can tolerate duplication, either by ensuring that events are idempotent or by keeping track of which event IDs have already been processed. - The database will be required to store an unbounded number of events during a broker outage, worker outage, or event bus misconfiguration. +Some events are currently published on-commit because the signal ``send_event()`` call is made in a ``transaction.on_commit()`` callback. ``emit_catalog_info_changed_signal`` is a known example of this. These would need to be migrated to use the new on-commit publishing mode and to lift the signal send out of the on_commit callback, calling send_event directly instead. + Rejected and Unplanned Alternatives *********************************** From 289e3f8f11b973e618c54eec7a126abe200c0391 Mon Sep 17 00:00:00 2001 From: Tim McCormack Date: Tue, 23 Jan 2024 00:20:55 +0000 Subject: [PATCH 12/14] fixup! Be explicit about the properties of the modes --- .../decisions/0015-outbox-pattern-and-production-modes.rst | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/decisions/0015-outbox-pattern-and-production-modes.rst b/docs/decisions/0015-outbox-pattern-and-production-modes.rst index 34ae48d4..24e33209 100644 --- a/docs/decisions/0015-outbox-pattern-and-production-modes.rst +++ b/docs/decisions/0015-outbox-pattern-and-production-modes.rst @@ -41,8 +41,15 @@ In the outbox pattern, events are not published as part of the request/response openedx-events will change to support two modes for publishing events when an OpenEdxPublicSignal's ``send_event(...)`` is called: - ``on-commit``: Delay publishing to the event bus until after the current transaction commits, or immediately if there is no open transaction (as might occur in a worker process). + + - Atomicity is preserved in the success case, but not in the failure case. (Events published in this mode may occasionally be lost, but should never be sent when a transaction fails.) + - This does not necessarily preserve ordering of events across multiple hosts. + - ``outbox``: Prep the signal for publishing, and save in an outbox table for publishing as soon as possible. A worker process will then relay events from the outbox to the broker and mark them as successfully published. Another management command will be needed to periodically purge old processed events. + - Atomicity is fully preserved. + - As long as only a single worker per topic is emptying the outbox, ordering of events can be fully maintained. + openedx-events will add a per event type configuration field specifying the event’s publishing mode in the form of a new key-topic field inside ``EVENT_BUS_PRODUCER_CONFIG``. It will also add a new Django setting ``EVENT_BUS_PRODUCER_MODE`` that names a mode to use when not otherwise specified (defaulting to ``on-commit``.) This will remove the ability to publish an event immediately, as none of the currently implemented events would benefit from it. If in the future there is an event type that requires it, perhaps because it represents a request or attempt or even a failure, an ``immediate`` mode can be added. From 40536291a0d0104224d7c7bb640aca3e8fd60b92 Mon Sep 17 00:00:00 2001 From: Tim McCormack Date: Tue, 23 Jan 2024 15:22:21 +0000 Subject: [PATCH 13/14] fixup! Add a note on language It would be great if we could unify our language across all of our code and docs, but for now I can just clarify the language used in the ADR. --- docs/decisions/0015-outbox-pattern-and-production-modes.rst | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/decisions/0015-outbox-pattern-and-production-modes.rst b/docs/decisions/0015-outbox-pattern-and-production-modes.rst index 24e33209..09bd4bcd 100644 --- a/docs/decisions/0015-outbox-pattern-and-production-modes.rst +++ b/docs/decisions/0015-outbox-pattern-and-production-modes.rst @@ -9,6 +9,9 @@ Status Context ******* +.. note:: + Clarification on language: We use "publish", "produce", and "send" somewhat interchangeably in various places in our code and documentation to refer to the transmission of an event from an IDA to the Event Bus message broker. The term "send" is also used in reference to the Django signal system. In this ADR, "send" will refer only to sending to an OpenEdxPublicSignal Django signal, and "publish" will be used for transmitting an event to the message broker. + Some of the event types in the Event Bus might be more sensitive than others to dropped, duplicated, or reordered events. The message broker itself is partially responsible for ensuring that these problems do not occur in transit, but we also need to ensure that the handoff of events to the broker is reliable. These are the properties we wish to ensure in the general case: From 9fc5e16dcd9bb3f25b0b6a0754a3045b1788d65f Mon Sep 17 00:00:00 2001 From: Tim McCormack Date: Thu, 8 Feb 2024 14:22:50 +0000 Subject: [PATCH 14/14] fixup! Fix typo "signals sends" --- docs/decisions/0015-outbox-pattern-and-production-modes.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/decisions/0015-outbox-pattern-and-production-modes.rst b/docs/decisions/0015-outbox-pattern-and-production-modes.rst index 09bd4bcd..4ab8b685 100644 --- a/docs/decisions/0015-outbox-pattern-and-production-modes.rst +++ b/docs/decisions/0015-outbox-pattern-and-production-modes.rst @@ -32,7 +32,7 @@ As of 2023-11-09 we publish events in two different ways relative to transaction - **Immediate publish**: The event is published to the event bus immediately upon the signal being sent, which will generally occur inside Django's request-level transaction (if using ``ATOMIC_REQUESTS``). This preserves atomicity in the success case as long as the broker is reachable, even if the IDA crashes -- but it does not preserve atomicity when the transaction fails. There is also no ordering guarantee in the case of concurrent requests. - **On-commit publish**: The event is published from a ``django.db.transaction.on_commit`` callback. This preserves atomicity in the failure case, but the IDA could crash after transaction commit but before calling the broker -- or more commonly, the broker could be down or unreachable, and all of the on-commit-published events would be lost during that interval. Ordering is also not preserved here. -We currently use an ad-hoc mix of immediate and on-commit publish in edx-platform, depending on how code sends to particular OpenEdxPublicSignals. For example, the code path for ``COURSE_CATALOG_INFO_CHANGED`` involves an explicit call to ``django.db.transaction.on_commit`` in order to ensure an on-commit publish is used. But most signals sends do not have any such call, and are likely published immediately. This uncontrolled state reflects our iterative approach to the event bus as well as our choice to start with events that are backed by other synchronization measures which can correct for consistency issues. However, we'd like to start handling events that require stronger reliability guarantees, such as those in the ecommerce space. +We currently use an ad-hoc mix of immediate and on-commit publish in edx-platform, depending on how code sends to particular OpenEdxPublicSignals. For example, the code path for ``COURSE_CATALOG_INFO_CHANGED`` involves an explicit call to ``django.db.transaction.on_commit`` in order to ensure an on-commit publish is used. But most signal sends do not have any such call, and are likely published immediately. This uncontrolled state reflects our iterative approach to the event bus as well as our choice to start with events that are backed by other synchronization measures which can correct for consistency issues. However, we'd like to start handling events that require stronger reliability guarantees, such as those in the ecommerce space. Decision ********