Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug caused by drop order of the Core struct #263

Merged
merged 2 commits into from
Aug 12, 2024

Conversation

charypar
Copy link
Member

@charypar charypar commented Aug 9, 2024

This is a very subtle problem with the order of the fields of the Core type interacting with the async runtime when using channels (and possibly other coordination primitives).

The problem is this:

  1. A capability creates a long running future with a loop
  2. In order to communicate with this future it creates a channel, gives the receving end to the future and holds the sending end
  3. The future gets to the point of receiving from the channel and suspends on .await.
  4. If the core is now dropped, then due to the original ordering of the fields, the executor is dropped first, causing the task_sender channel to close.
  5. The A::Capabilities instance is dropped, dropping the capability instance, including the sending end of the channel
  6. The channel's Drop implementation closes the channel, which wakes up the future, so that the recv() method can return Err
  7. The ArcWake implementation on the task spawned by Crux's Spawner attempts to .send self to the task_sender, which fails, because the receving end dropped in step 4, and we panic on the .expect

To fix, I've reordered the fields of Core to drop the user defined types first, so that any Crux infrastructure they rely on is still alive when any Drop implementations in them run.

@charypar charypar requested review from obmarg and StuartHarris August 9, 2024 16:55
Copy link
Member

@StuartHarris StuartHarris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I'm guessing the pause/resume was to test the inverse? If so, do we need it now?

@charypar
Copy link
Member Author

It was to get the loop in the worker to be waiting on a channel receive - previously it would only wait on one of the effects and so dropping a channel would not awake it.

Pause/resume was just a semi-realistic thing I could think of to get into the right situation to trigger the problem, even though the APIs are never used.

@charypar charypar merged commit 7f0b838 into master Aug 12, 2024
9 checks passed
@charypar charypar deleted the viktor/fix-drop-bug branch August 12, 2024 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants