Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ftltest race condition when waiting for pubsub to complete #2266

Closed
matt2e opened this issue Aug 6, 2024 · 1 comment · Fixed by #2332
Closed

ftltest race condition when waiting for pubsub to complete #2266

matt2e opened this issue Aug 6, 2024 · 1 comment · Fixed by #2332
Assignees

Comments

@matt2e
Copy link
Collaborator

matt2e commented Aug 6, 2024

from @mistermoe
Steps to repro:

  • Have 2 topics, each with a subscription
  • Create a test that does the following:
    • the subscriber for the first topic publishes to the second topic
    • publish to the first topic
    • calls ftltest.WaitForSubscriptionsToComplete(ctx)

This is the error that can be seen sometimes: panic: sync: WaitGroup is reused before previous Wait has returned
Running the test with --race gives more info.

See branch matt2e/ftltest-pubsub-race-condition which already has a failing test set up for this

@github-actions github-actions bot added the triage Issue needs triaging label Aug 6, 2024
@ftl-robot ftl-robot mentioned this issue Aug 6, 2024
@wesbillman wesbillman removed the triage Issue needs triaging label Aug 6, 2024
@matt2e
Copy link
Collaborator Author

matt2e commented Aug 7, 2024

partner project has skipped their test in the meantime

jonathanj-square added a commit that referenced this issue Aug 12, 2024
github-merge-queue bot pushed a commit that referenced this issue Aug 15, 2024
…#2332)

fixes: #2266

The race conditions was caused by two issues.

1. The WaitGroup down tick was not synchronized with the completion of
the subscriber verb execution. Failure to wait for completion may result
in a down time coming before a new event is dispatched (which ticks up
the waitgroup) - which may result in the WaitGroup reaching zero too
soon.
2. Non-linear pubsub networks are not supported by simply counting up
when an event is published and down after it is completed. The up count
needs to match the number of live subscriptions (e.g. subscriptions with
registered subscribers)

This synchronization scheme does not cover asynchronous event dispatch
(e.g. events dispatched via go routines within the subscriber verb) such
scenarios introduce a race condition that cannot be resolved by black
box external synchronization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants