-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: FileIO: lack of timeouts may cause the pipeline to get stuck indefinitely #29926
Comments
cc @shunping |
This was observed again recently. Looking at thread stacks showed that the futures that should be running to flush files were not stuck but were instead neither running nor scheduled. Full thread stacks are below. I believe that this may be an issue with the use of ForkJoinPool in beam's MoreFutures.runAsync . The common pool is not suitable for asynchronous non-joining work and I believe that might be the case due to the indirect notification of futures in this method. I believe that #33042 may fix the issue by ensuring that the futures we join upon are directly those submitted to the ForkJoinPool.
|
What happened?
A java pipeline running on Dataflow using Beam 2.47 (but the latest 2.52 is affected by the same issue) went into an unrecoverable state, with workers being stuck for at least 4 days with the following stack trace:
The source is here: on FinishBundle,
WriteFiles
waits for all futures that close writers to complete, but they never do.writer.cleanup
from here produces a log message, which we didn't observe in this case, so the futures seemed to be stuck onwriter.close
. It's unclear why (possibly, some network issues, or other issues with Cloud Storage, which was the target file system), but I think Beam SDK should implement a timeout here: if a writer can't be closed within reasonable time, it should fail and the bundle should be re-processed.Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
The text was updated successfully, but these errors were encountered: