Race condition in issue-credential v1.0 leads to credential exchange switching to abandoned state #2000

markuskreusch · 2022-11-02T15:26:50Z

We have a test system running that connects two ACA-Py instances and triggers issuance of credentials using the issue-credential v1.0 protocol. During load tests we have around 1% of all credential exchanges fail.

Some more details:

We listen for the /issue_credential webhook and wait for the status REQUEST_RECEIVED of the credential exchange
If we receive this status we invoke /issue-credential/records/{credentialExchangeId}/issue to trigger the issuance
This request sometimes fails with a 400 response and the error message reads "Credential exchange in offer_sent state (must be request_received)." although our request is the direct result of a REQUEST_RECEIVED status event.
Afterwards the webhook returns with an error and is probably invoked again. This time the error involves the credential exchange state abandoned.

We found out that this problem can be mitigated by waiting some seconds before doing the issue http call and assumed thus, that this is a race condition. Further analysis seems to confirm this:

In https://github.com/hyperledger/aries-cloudagent-python/blob/aaee62f197f982d3b8d476e9ace690992708076f/aries_cloudagent/protocols/issue_credential/v1_0/manager.py#L560-L562 the credential record is saved and the the transaction is committed
Notification through the webhook happens inside of save in the post_save method in an asynchronous call inside of https://github.com/hyperledger/aries-cloudagent-python/blob/aaee62f197f982d3b8d476e9ace690992708076f/aries_cloudagent/transport/outbound/manager.py#L312. This way the transaction can commit before the webhook is invoked and thus the webhook can see the updated status.
We assume that most of the time we do the http call to trigger the issuance the transaction is already committed but sometimes this did not already happen and leads to the observed problem
When receiving the callback the transaction should be already committed to prevent race conditions.
This might also be a problem for other webhooks that involve transactions on the updated entities
The presented "workaround" of waiting is not a solution but more an approach to locate the problem. It prolongs the issuance by a large amount of time and does not work in all cases. The system should be race condition free to have something predictable.

The text was updated successfully, but these errors were encountered:

swcurran · 2022-11-08T19:18:24Z

FYI - @ianco -- this sounds familiar.

ianco · 2022-11-08T20:14:58Z

@swcurran ... Yes, similar to the mediator issue we uncovered ... we need to look at where we're sending messages vs committing updates to the exchange records, and also what error checking we're doing on the state when we receive a message (or invoke one of the admin api endpoints) ...

ianco · 2022-11-16T21:10:24Z

OK not exactly like the mediator issue, but also looks like a general problem across basically all protocols.

In the base_record (https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/messaging/models/base_record.py#L389) - the emit_event() happens before the transaction is committed (which I think is what eventually triggers the webhook).

Ideally any events that are triggered within a protocol shouldn't actually be "emitted" and acted upon until the protocol step completes.

Also the specific error of "exchange is in wrong state" shouldn't trigger the exchange to be pushed into an abandoned state or issue a problem report.

ianco · 2022-11-16T21:11:20Z

Labelling this as High Priority because I think this needs to be addressed for a 1.0.0 release

swcurran · 2024-02-01T18:00:59Z

Need to relook at this issue.

swcurran · 2024-02-01T18:22:45Z

@ianco -- could you look at this again, please? As issue-credential-1.0 is deprecated, an issue local to that particular concern is not that big a deal. However, if it is a broader issue, we need to know that, and to characterize both the problem and what actions we should take. Not looking for it to be fixed (yet) -- just a definition of the problem, its potential impact and suggestions.

Note that I didn't read through this full issue, so the information might already be all here. If so, summarizing it in one comment, or in an ACA-Pug presentation might be sufficient for this request. Let me know.

ianco · 2024-02-01T19:51:17Z

I'll take a look. Reading through the comments I think the solution is pretty straightforward. I'll fix the 1.0 protocol and double check the other protocols ...

ianco · 2024-02-02T17:47:00Z

However, if it is a broader issue, we need to know that ...

I believe this is a bigger problem. The emit() function (which triggers the webhook) happens as each record is added, updated or deleted, so in all cases the webhook will be called before the record is saved. So we have potential race conditions pretty much everywhere.

My suggestion ... the problem is in the following code, in https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/messaging/models/base_record.py:

    async def post_save(
        self,
        session: ProfileSession,
        new_record: bool,
        last_state: Optional[str],
        event: bool = None,
    ):
        """Perform post-save actions.

        Args:
            session: The profile session to use
            new_record: Flag indicating if the record was just created
            last_state: The previous state value
            event: Flag to override whether the event is sent
        """

        if event is None:
            event = new_record or (last_state != self.state)
        if event:
            await self.emit_event(session, self.serialize())

    async def delete_record(self, session: ProfileSession):
        """Remove the stored record.

        Args:
            session: The profile session to use
        """

        if self._id:
            storage = session.inject(BaseStorage)
            if self.state:
                self._previous_state = self.state
                self.state = BaseRecord.STATE_DELETED
                await self.emit_event(session, self.serialize())
            await storage.delete_record(self.storage_record)

    async def emit_event(self, session: ProfileSession, payload: Any = None):
        """Emit an event.

        Args:
            session: The profile session to use
            payload: The event payload
        """

        if not self.RECORD_TOPIC:
            return

        if self.state:
            topic = f"{self.EVENT_NAMESPACE}::{self.RECORD_TOPIC}::{self.state}"
        else:
            topic = f"{self.EVENT_NAMESPACE}::{self.RECORD_TOPIC}"

        if not payload:
            payload = self.serialize()

        await session.profile.notify(topic, payload)

(There are other scenarios as well in this code.)

Note that in the delete() scenario the emit() is called before the record is saved, however in the post_save() scenario it happens before. In either case, the commit() happens later on.

My suggestion is to move the emit() code into the commit() method in the SessionProfile class: https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/core/profile.py

This would ensure that all notifications happen after the transaction is committed (and database updates are updated) however it may have other side effects. I can give this a try but wanted to get some feedback ...

@dbluhm @swcurran @shaangill025 any thoughts?

ianco · 2024-02-02T17:55:25Z

My suggestion is to move the emit() code into the commit() method in the SessionProfile class: https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/core/profile.py

This would ensure that all notifications happen after the transaction is committed (and database updates are updated) however it may have other side effects. I can give this a try but wanted to get some feedback ...

the emit_event() method in the SessionProfile class would just cache the event in a local array
when commit() is called, it would emit all the events (after the database commit)
in the case of rollback() all events would be abandoned

swcurran · 2024-02-05T22:04:21Z

@ianco - could you prepare a short session on this for the ACA-Pug meeting tomorrow. I’d like this to have a higher profile, since you haven’t gotten feedback. This is out of my realm.

ianco · 2024-02-05T23:47:18Z

@ianco - could you prepare a short session on this for the ACA-Pug meeting tomorrow. I’d like this to have a higher profile, since you haven’t gotten feedback. This is out of my realm.

Yep will do. I have a fix so I'll open a PR. I'm having trouble duplicating the issue though so can't (yet) verify the fix.

PR #2760

swcurran · 2024-02-07T16:41:01Z

Fixed by #2760

ianco · 2024-02-07T16:42:15Z

@markuskreusch

shaangill025 self-assigned this Nov 2, 2022

ianco added bug Something isn't working High Priority labels Nov 16, 2022

shaangill025 removed their assignment Nov 16, 2022

swcurran added the 1.0.0 To be addressed for the ACA-Py 1.0.0 release label Feb 1, 2024

swcurran assigned ianco Feb 1, 2024

ianco mentioned this issue Feb 5, 2024

Move emit events to profile and delay sending until after commit #2760

Merged

swcurran closed this as completed Feb 7, 2024

swcurran removed bug Something isn't working High Priority 1.0.0 To be addressed for the ACA-Py 1.0.0 release labels Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in issue-credential v1.0 leads to credential exchange switching to abandoned state #2000

Race condition in issue-credential v1.0 leads to credential exchange switching to abandoned state #2000

markuskreusch commented Nov 2, 2022 •

edited

Loading

swcurran commented Nov 8, 2022

ianco commented Nov 8, 2022

ianco commented Nov 16, 2022

ianco commented Nov 16, 2022

swcurran commented Feb 1, 2024

swcurran commented Feb 1, 2024

ianco commented Feb 1, 2024 •

edited

Loading

ianco commented Feb 2, 2024

ianco commented Feb 2, 2024 •

edited

Loading

swcurran commented Feb 5, 2024

ianco commented Feb 5, 2024 •

edited

Loading

swcurran commented Feb 7, 2024

ianco commented Feb 7, 2024

Race condition in issue-credential v1.0 leads to credential exchange switching to abandoned state #2000

Race condition in issue-credential v1.0 leads to credential exchange switching to abandoned state #2000

Comments

markuskreusch commented Nov 2, 2022 • edited Loading

swcurran commented Nov 8, 2022

ianco commented Nov 8, 2022

ianco commented Nov 16, 2022

ianco commented Nov 16, 2022

swcurran commented Feb 1, 2024

swcurran commented Feb 1, 2024

ianco commented Feb 1, 2024 • edited Loading

ianco commented Feb 2, 2024

ianco commented Feb 2, 2024 • edited Loading

swcurran commented Feb 5, 2024

ianco commented Feb 5, 2024 • edited Loading

swcurran commented Feb 7, 2024

ianco commented Feb 7, 2024

markuskreusch commented Nov 2, 2022 •

edited

Loading

ianco commented Feb 1, 2024 •

edited

Loading

ianco commented Feb 2, 2024 •

edited

Loading

ianco commented Feb 5, 2024 •

edited

Loading