Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Client-side throttling for BigQueryIO DIRECT_READ mode #30646

Closed
16 tasks
Abacn opened this issue Mar 15, 2024 · 3 comments · Fixed by #31096
Closed
16 tasks

[Task]: Client-side throttling for BigQueryIO DIRECT_READ mode #30646

Abacn opened this issue Mar 15, 2024 · 3 comments · Fixed by #31096
Assignees

Comments

@Abacn
Copy link
Contributor

Abacn commented Mar 15, 2024

What needs to happen?

There are reports of BigQueryIO DIRECT_READ running out of quota in large Dataflow batch pipelines. Basically, there are both a per-project and a per-region quota number of active read stream. When this quota is drained, BigQuery backend starts to revoke the older streams that may still be active, causing read to fail.

On the other hand, Dataflow does not aware of the quota is burn, still keep to split streams and possibly bumping workers as the progress is slow, adding more stress on the quota

The task is to design and implement a mechanism to mitigate this issue in the short term, as a pivotal implementation for the generic client side throttling approach suggested in #24743

Issue Priority

Priority: 2 (default / most normal work should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@Abacn
Copy link
Contributor Author

Abacn commented Mar 15, 2024

This is a known issue and there was another approach trying to address it before, namely #24260. The current status of this approach is a pipeline option --enableStorageReadApiV2 to enroll in BigQuery storage Read API v2, where the read streams are the units of work instead of unit of parallelism, and the streams that created once won't be split further. This appraoch still need stress test to see if it mitigated the issue and to what extent.

This task is try to resolve the issue by client side throttling, alternative to aforementioned.

@Abacn
Copy link
Contributor Author

Abacn commented Apr 26, 2024

After #31096, the client side throttling now work with ( Storage read API v2 stream (#28778) + Dataflow legacy runner). There are still many caveats

For the default read API v1 stream, it appears the API call waiting on retry won't temporarily release the concurrent stream quota, so hasNext call can be blocked very long until the metrics get reported back to the work item thread. The pipeline do not upscale, but it stuck indefinitely (probably until exhausted retry)

Update: API v1 stream issue is due to (effective) deadlock of two synchronized block at

both will call hasNext. If splitAtFraction proceeded into synchronized block first, the work item thread will have to wait until it exits synchronized block, the symptom is then

Operation ongoing in step BigQueryIO.TypedRead/Read(BigQueryStorageTableSource) for at least 05m00s without outputting or completing in state process in thread pool-3-thread-2 with id 28
  at org.apache.beam.sdk.io.gcp.bigquery.BigQueryStorageStreamSource$BigQueryStorageStreamReader.advance(BigQueryStorageStreamSource.java:232)

if readNextRecord gets called first, splitAtFraction will take very long (probably causing issue then?)

Finally, metrics is not supported in the thread calling splitAtFraction (Worker status update thread), so reportingPendingMetrics there should be removed

=====

It seems also not effective on Dataflow runner v2

@Abacn
Copy link
Contributor Author

Abacn commented Jun 11, 2024

A Dataflow runner side issue identified and resolved. Close this as done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant