-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task]: Client-side throttling for BigQueryIO DIRECT_READ mode #30646
Comments
This is a known issue and there was another approach trying to address it before, namely #24260. The current status of this approach is a pipeline option This task is try to resolve the issue by client side throttling, alternative to aforementioned. |
After #31096, the client side throttling now work with ( Storage read API v2 stream (#28778) + Dataflow legacy runner). There are still many caveats For the default read API v1 stream, it appears the API call waiting on retry won't temporarily release the concurrent stream quota, so hasNext call can be blocked very long until the metrics get reported back to the work item thread. The pipeline do not upscale, but it stuck indefinitely (probably until exhausted retry) Update: API v1 stream issue is due to (effective) deadlock of two synchronized block at Line 238 in 673da54
Line 376 in 673da54
both will call
if readNextRecord gets called first, splitAtFraction will take very long (probably causing issue then?) Finally, metrics is not supported in the thread calling splitAtFraction (Worker status update thread), so reportingPendingMetrics there should be removed ===== It seems also not effective on Dataflow runner v2 |
A Dataflow runner side issue identified and resolved. Close this as done. |
What needs to happen?
There are reports of BigQueryIO DIRECT_READ running out of quota in large Dataflow batch pipelines. Basically, there are both a per-project and a per-region quota number of active read stream. When this quota is drained, BigQuery backend starts to revoke the older streams that may still be active, causing read to fail.
On the other hand, Dataflow does not aware of the quota is burn, still keep to split streams and possibly bumping workers as the progress is slow, adding more stress on the quota
The task is to design and implement a mechanism to mitigate this issue in the short term, as a pivotal implementation for the generic client side throttling approach suggested in #24743
Issue Priority
Priority: 2 (default / most normal work should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: