Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition of timeout thread interrupt to stabilize multi level build tests #4254

Merged
merged 13 commits into from
Jan 6, 2025

Conversation

lihaoyi
Copy link
Member

@lihaoyi lihaoyi commented Jan 6, 2025

The basic problem was that thread.interrupt() running before interrupt = false meant there was a chance the thread's if (interrupt) { conditional would be true, resulting in the timeout thread timing out the open socket immediately, and the Mill server process shutting down.

The solution is to move interrupt = false to before thread.interrupt()

This should hopefully fix the flakiness in multi-level-build tests, where the process shutting down would cause the next command to re-spawn a new process, resulting in all classloaders to be re-spawned, violating our assertions

Improved the test error checking so next time a similar thing happens, we get more precise reporting "an unwanted process restart occurred", rather than just "classloader invalidation was unexpected".

Tested manually with while ./mill 'integration.invalidation[multi-level-editing].server.test'; do :; done. This seems to reproduce the problem on my laptop in a few tens of minutes without this fix, after this fix I haven't managed to make it appear

@lihaoyi lihaoyi marked this pull request as ready for review January 6, 2025 07:50
@lihaoyi lihaoyi merged commit f0aa010 into com-lihaoyi:main Jan 6, 2025
26 checks passed
@lefou lefou added this to the 0.12.6 milestone Jan 6, 2025
@lihaoyi
Copy link
Member Author

lihaoyi commented Jan 6, 2025

I'm hoping this also fixes the flakiness in example.fundamentals.tasks[6-workers].local.test (e.g. https://github.com/com-lihaoyi/mill/actions/runs/12579809859/job/35060755776) and integration.failure[fatal-error].local.test (e.g. https://github.com/com-lihaoyi/mill/actions/runs/12591901838/job/35095817421), both of which could be explained by the Mill process from the previous invocation exiting unexpectedly before the subsequent invocation happens

lihaoyi added a commit that referenced this pull request Jan 7, 2025
This might have been the cause of a lot of flakiness that seems to have
gone away with #4254, as the
server exiting caused the `runBackground` calls to exit causing the http
servers to exit and fail to pick up requests.

Might have been caused by com-lihaoyi/os-lib#324
which made `destroyOnExit` the default for spawned subprocesses. This PR
explicitly disables `destroyOnExit` for the subprocesses where
`background = true`

Covered by a new `integration.invalidation` test that runs under both
`server` and `fork`, that previously failed when run under `fork`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants