-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a race condition in Bazel's Windows process management. #6860
Conversation
src/main/java/com/google/devtools/build/lib/windows/WindowsSubprocess.java
Show resolved
Hide resolved
Can we add a test to prevent #6853? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the incremental commits! It was much easier to review this PR in small pieces than it would've been as a whole.
src/main/java/com/google/devtools/build/lib/windows/WindowsSubprocess.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/windows/WindowsSubprocess.java
Show resolved
Hide resolved
Thanks so much for the good comments :) I've addressed them. |
I will add a test as requested by @meteorcloudy, too. |
src/main/java/com/google/devtools/build/lib/windows/WindowsSubprocess.java
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/windows/WindowsSubprocess.java
Outdated
Show resolved
Hide resolved
After discussing with @philwo in person: let's address the stylistic comments in a follow-up PR and only focus on the technical one. |
Move deletion logic into destructor.
This allows us to safely close the process handle early, while still allowing Bazel to ask for the exit code later.
If we don't have job objects, we terminate the process itself, but this could fail in case the process has already exited itself at this point. Check for this condition and also wait for the termination to succeed to guarantee that the exit code can be safely queried afterwards.
While we did correctly terminate the job object of a process when calling Subprocess.destroy and we did correctly wait for the process itself to exit in Subprocess.waitFor, we didn't wait for its children to exit. This is usually not a big problem, except when they keep open some file handles to files that Bazel immediately tris to delete after killing / waiting for the main process, in which case the files might still be in use. This is the reason why we're seeing errors like this on CI: https://buildkite.com/bazel/bazel-bazel/builds/5774#27afa2c9-9ff1-4224-9df7-ba0253e7e317/193-303 "Caught I/O exception: Cannot delete path (something below $TEST_TMPDIR): Access is denied." This CL modifies the behavior to explicitly terminate the job object in Subprocess.waitFor when the main process has exited and then wait for all transitive subprocesses to exit, before continuing. Due to some interesting semantics, one cannot simply use WaitForSingleObject on the Job handle, instead one has to use an IoCompletionPort that gets a signal when the last job in the group has exited: https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743
…ng the error message.
I noticed that we already have the sleepy test in our tests: test_test_timeout in bazel_test_test.sh. |
Should we cherry-pick the fix into 0.22? |
I think it's too risky given the size of the change and also it's technically not a regression... :| |
OK, the good thing is we don't see the Windows flaky error as often as before with 0.21.0. So I'm fine waiting for 0.23 |
(Second version of this PR that fixes the issues from the last time and refactors the code a bit to make it easier to understand and maintain.) While we did correctly terminate the job object of a process when calling Subprocess.destroy and we did correctly wait for the process itself to exit in Subprocess.waitFor, we didn't wait for its children to exit. This is usually not a big problem, except when they keep open some file handles to files that Bazel immediately tris to delete after killing / waiting for the main process, in which case the files might still be in use. This is the reason why we're seeing errors like this on CI: https://buildkite.com/bazel/bazel-bazel/builds/5774#27afa2c9-9ff1-4224-9df7-ba0253e7e317/193-303 "Caught I/O exception: Cannot delete path (something below $TEST_TMPDIR): Access is denied." This CL modifies the behavior to explicitly terminate the job object in Subprocess.waitFor when the main process has exited and then wait for all transitive subprocesses to exit, before continuing. Due to some interesting semantics, one cannot simply use WaitForSingleObject on the Job handle, instead one has to use an IoCompletionPort that gets a signal when the last job in the group has exited: https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743 Closes #6860. PiperOrigin-RevId: 229371044
(Second version of this PR that fixes the issues from the last time and refactors the code a bit to make it easier to understand and maintain.) While we did correctly terminate the job object of a process when calling Subprocess.destroy and we did correctly wait for the process itself to exit in Subprocess.waitFor, we didn't wait for its children to exit. This is usually not a big problem, except when they keep open some file handles to files that Bazel immediately tris to delete after killing / waiting for the main process, in which case the files might still be in use. This is the reason why we're seeing errors like this on CI: https://buildkite.com/bazel/bazel-bazel/builds/5774#27afa2c9-9ff1-4224-9df7-ba0253e7e317/193-303 "Caught I/O exception: Cannot delete path (something below $TEST_TMPDIR): Access is denied." This CL modifies the behavior to explicitly terminate the job object in Subprocess.waitFor when the main process has exited and then wait for all transitive subprocesses to exit, before continuing. Due to some interesting semantics, one cannot simply use WaitForSingleObject on the Job handle, instead one has to use an IoCompletionPort that gets a signal when the last job in the group has exited: https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743 Closes #6860. PiperOrigin-RevId: 229371044
(Second version of this PR that fixes the issues from the last time and refactors the code a bit to make it easier to understand and maintain.) While we did correctly terminate the job object of a process when calling Subprocess.destroy and we did correctly wait for the process itself to exit in Subprocess.waitFor, we didn't wait for its children to exit. This is usually not a big problem, except when they keep open some file handles to files that Bazel immediately tris to delete after killing / waiting for the main process, in which case the files might still be in use. This is the reason why we're seeing errors like this on CI: https://buildkite.com/bazel/bazel-bazel/builds/5774#27afa2c9-9ff1-4224-9df7-ba0253e7e317/193-303 "Caught I/O exception: Cannot delete path (something below $TEST_TMPDIR): Access is denied." This CL modifies the behavior to explicitly terminate the job object in Subprocess.waitFor when the main process has exited and then wait for all transitive subprocesses to exit, before continuing. Due to some interesting semantics, one cannot simply use WaitForSingleObject on the Job handle, instead one has to use an IoCompletionPort that gets a signal when the last job in the group has exited: https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743 Closes #6860. PiperOrigin-RevId: 229371044
(Second version of this PR that fixes the issues from the last time and refactors the code a bit to make it easier to understand and maintain.) While we did correctly terminate the job object of a process when calling Subprocess.destroy and we did correctly wait for the process itself to exit in Subprocess.waitFor, we didn't wait for its children to exit. This is usually not a big problem, except when they keep open some file handles to files that Bazel immediately tris to delete after killing / waiting for the main process, in which case the files might still be in use. This is the reason why we're seeing errors like this on CI: https://buildkite.com/bazel/bazel-bazel/builds/5774#27afa2c9-9ff1-4224-9df7-ba0253e7e317/193-303 "Caught I/O exception: Cannot delete path (something below $TEST_TMPDIR): Access is denied." This CL modifies the behavior to explicitly terminate the job object in Subprocess.waitFor when the main process has exited and then wait for all transitive subprocesses to exit, before continuing. Due to some interesting semantics, one cannot simply use WaitForSingleObject on the Job handle, instead one has to use an IoCompletionPort that gets a signal when the last job in the group has exited: https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743 Closes bazelbuild#6860. PiperOrigin-RevId: 229371044
(Second version of this PR that fixes the issues from the last time and refactors the code a bit to make it easier to understand and maintain.)
While we did correctly terminate the job object of a process when calling Subprocess.destroy and we did correctly wait for the process itself to exit in Subprocess.waitFor, we didn't wait for its children to exit.
This is usually not a big problem, except when they keep open some file handles to files that Bazel immediately tris to delete after killing / waiting for the main process, in which case the files might still be in use. This is the reason why we're seeing errors like this on CI:
https://buildkite.com/bazel/bazel-bazel/builds/5774#27afa2c9-9ff1-4224-9df7-ba0253e7e317/193-303
"Caught I/O exception: Cannot delete path (something below $TEST_TMPDIR): Access is denied."
This CL modifies the behavior to explicitly terminate the job object in Subprocess.waitFor when the main process has exited and then wait for all transitive subprocesses to exit, before continuing.
Due to some interesting semantics, one cannot simply use WaitForSingleObject on the Job handle, instead one has to use an IoCompletionPort that gets a signal when the last job in the group has exited: https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743