Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in issue-credential v1.0 leads to credential exchange switching to abandoned state #2000

Closed
markuskreusch opened this issue Nov 2, 2022 · 13 comments
Assignees

Comments

@markuskreusch
Copy link

markuskreusch commented Nov 2, 2022

We have a test system running that connects two ACA-Py instances and triggers issuance of credentials using the issue-credential v1.0 protocol. During load tests we have around 1% of all credential exchanges fail.

Some more details:

  • We listen for the /issue_credential webhook and wait for the status REQUEST_RECEIVED of the credential exchange
  • If we receive this status we invoke /issue-credential/records/{credentialExchangeId}/issue to trigger the issuance
  • This request sometimes fails with a 400 response and the error message reads "Credential exchange in offer_sent state (must be request_received)." although our request is the direct result of a REQUEST_RECEIVED status event.
  • Afterwards the webhook returns with an error and is probably invoked again. This time the error involves the credential exchange state abandoned.

We found out that this problem can be mitigated by waiting some seconds before doing the issue http call and assumed thus, that this is a race condition. Further analysis seems to confirm this:

@shaangill025 shaangill025 self-assigned this Nov 2, 2022
@swcurran
Copy link
Contributor

swcurran commented Nov 8, 2022

FYI - @ianco -- this sounds familiar.

@ianco
Copy link
Contributor

ianco commented Nov 8, 2022

@swcurran ... Yes, similar to the mediator issue we uncovered ... we need to look at where we're sending messages vs committing updates to the exchange records, and also what error checking we're doing on the state when we receive a message (or invoke one of the admin api endpoints) ...

@ianco
Copy link
Contributor

ianco commented Nov 16, 2022

OK not exactly like the mediator issue, but also looks like a general problem across basically all protocols.

In the base_record (https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/messaging/models/base_record.py#L389) - the emit_event() happens before the transaction is committed (which I think is what eventually triggers the webhook).

Ideally any events that are triggered within a protocol shouldn't actually be "emitted" and acted upon until the protocol step completes.

Also the specific error of "exchange is in wrong state" shouldn't trigger the exchange to be pushed into an abandoned state or issue a problem report.

@ianco ianco added bug Something isn't working High Priority labels Nov 16, 2022
@ianco
Copy link
Contributor

ianco commented Nov 16, 2022

Labelling this as High Priority because I think this needs to be addressed for a 1.0.0 release

@shaangill025 shaangill025 removed their assignment Nov 16, 2022
@swcurran swcurran added the 1.0.0 To be addressed for the ACA-Py 1.0.0 release label Feb 1, 2024
@swcurran
Copy link
Contributor

swcurran commented Feb 1, 2024

Need to relook at this issue.

@swcurran
Copy link
Contributor

swcurran commented Feb 1, 2024

@ianco -- could you look at this again, please? As issue-credential-1.0 is deprecated, an issue local to that particular concern is not that big a deal. However, if it is a broader issue, we need to know that, and to characterize both the problem and what actions we should take. Not looking for it to be fixed (yet) -- just a definition of the problem, its potential impact and suggestions.

Note that I didn't read through this full issue, so the information might already be all here. If so, summarizing it in one comment, or in an ACA-Pug presentation might be sufficient for this request. Let me know.

@ianco
Copy link
Contributor

ianco commented Feb 1, 2024

I'll take a look. Reading through the comments I think the solution is pretty straightforward. I'll fix the 1.0 protocol and double check the other protocols ...

@ianco
Copy link
Contributor

ianco commented Feb 2, 2024

However, if it is a broader issue, we need to know that ...

I believe this is a bigger problem. The emit() function (which triggers the webhook) happens as each record is added, updated or deleted, so in all cases the webhook will be called before the record is saved. So we have potential race conditions pretty much everywhere.

My suggestion ... the problem is in the following code, in https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/messaging/models/base_record.py:

    async def post_save(
        self,
        session: ProfileSession,
        new_record: bool,
        last_state: Optional[str],
        event: bool = None,
    ):
        """Perform post-save actions.

        Args:
            session: The profile session to use
            new_record: Flag indicating if the record was just created
            last_state: The previous state value
            event: Flag to override whether the event is sent
        """

        if event is None:
            event = new_record or (last_state != self.state)
        if event:
            await self.emit_event(session, self.serialize())

    async def delete_record(self, session: ProfileSession):
        """Remove the stored record.

        Args:
            session: The profile session to use
        """

        if self._id:
            storage = session.inject(BaseStorage)
            if self.state:
                self._previous_state = self.state
                self.state = BaseRecord.STATE_DELETED
                await self.emit_event(session, self.serialize())
            await storage.delete_record(self.storage_record)

    async def emit_event(self, session: ProfileSession, payload: Any = None):
        """Emit an event.

        Args:
            session: The profile session to use
            payload: The event payload
        """

        if not self.RECORD_TOPIC:
            return

        if self.state:
            topic = f"{self.EVENT_NAMESPACE}::{self.RECORD_TOPIC}::{self.state}"
        else:
            topic = f"{self.EVENT_NAMESPACE}::{self.RECORD_TOPIC}"

        if not payload:
            payload = self.serialize()

        await session.profile.notify(topic, payload)

(There are other scenarios as well in this code.)

Note that in the delete() scenario the emit() is called before the record is saved, however in the post_save() scenario it happens before. In either case, the commit() happens later on.

My suggestion is to move the emit() code into the commit() method in the SessionProfile class: https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/core/profile.py

This would ensure that all notifications happen after the transaction is committed (and database updates are updated) however it may have other side effects. I can give this a try but wanted to get some feedback ...

@dbluhm @swcurran @shaangill025 any thoughts?

@ianco
Copy link
Contributor

ianco commented Feb 2, 2024

My suggestion is to move the emit() code into the commit() method in the SessionProfile class: https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/core/profile.py

This would ensure that all notifications happen after the transaction is committed (and database updates are updated) however it may have other side effects. I can give this a try but wanted to get some feedback ...

  • the emit_event() method in the SessionProfile class would just cache the event in a local array
  • when commit() is called, it would emit all the events (after the database commit)
  • in the case of rollback() all events would be abandoned

@swcurran
Copy link
Contributor

swcurran commented Feb 5, 2024

@ianco - could you prepare a short session on this for the ACA-Pug meeting tomorrow. I’d like this to have a higher profile, since you haven’t gotten feedback. This is out of my realm.

@ianco
Copy link
Contributor

ianco commented Feb 5, 2024

@ianco - could you prepare a short session on this for the ACA-Pug meeting tomorrow. I’d like this to have a higher profile, since you haven’t gotten feedback. This is out of my realm.

Yep will do. I have a fix so I'll open a PR. I'm having trouble duplicating the issue though so can't (yet) verify the fix.

PR #2760

@swcurran
Copy link
Contributor

swcurran commented Feb 7, 2024

Fixed by #2760

@swcurran swcurran closed this as completed Feb 7, 2024
@swcurran swcurran removed bug Something isn't working High Priority 1.0.0 To be addressed for the ACA-Py 1.0.0 release labels Feb 7, 2024
@ianco
Copy link
Contributor

ianco commented Feb 7, 2024

@markuskreusch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants