Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Add a DIRECT_READ method to bigqueryio go sdk using BigQuery Storage API #33268

Open
2 of 17 tasks
win845 opened this issue Dec 3, 2024 · 1 comment
Open
2 of 17 tasks

Comments

@win845
Copy link

win845 commented Dec 3, 2024

What would you like to happen?

The current implementation of bigqueryio in go is rudementary. It always falls back to running a query and emitting records sequentially. This has the downside that subsequent ParDo steps are not autoscaled.

Add a bigqueryio.UseDirectRead or the like which consumes the table in multiple parallel sources as java and python sdk.

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@win845
Copy link
Author

win845 commented Dec 3, 2024

While the implementation may take a while. What is a current strategy to deal with a sequentially emiting source as linked above? Bigqueryio is reading records in loop and emiting them sequentially without also implementing a progress method.
This leads to pipelines on dataflow never being autoscaled .

Can something be done in subsequent step to make the subsequent processing being redistributed to multipe workers.

Example Code:

bigqueryRows := bigqueryio.Query(scope, config.Project, itemOptionsQuery, reflect.TypeOf(ItemOptionRow{}), 
bigqueryio.UseStandardSQL())
mutations := beam.ParDo(scope, func(bigqueryRow ItemRow, emit func(bigtableio.Mutation)) {
    rowKey := bigqueryRow.Id
    mutation := bigtableio.NewMutation(rowKey)
    // ...
    emit(*mutation)

}, bigqueryRows)

Can bigqueryRows here be somehow consumed in chunks which are distributed on multiple workers for further transform?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant