Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use BigQuery Read API for reading external BigLake tables #22974

Conversation

marcinsbd
Copy link
Contributor

@marcinsbd marcinsbd commented Aug 7, 2024

Description

Continuation of the #21017

BigQuery storage APIs support reading BigLake external tables (ie external tables with a connection). But the current implementation uses views which can be expensive, because it requires Trino issuing a SQL query against BigQuery. This PR adds support to read BigLake tables directly using the storage API.

There are no behavior changes for external tables and BQ native tables - they use the view and storage APIs respectively. Added a new test for BigLake tables.

Additional context and related issues

Fixes #21016
https://cloud.google.com/bigquery/docs/biglake-intro

Release notes

# BigQuery
* Improve performance when reading external BigLake tables. ({issue}`21016`)

@cla-bot cla-bot bot added the cla-signed label Aug 7, 2024
@github-actions github-actions bot added the bigquery BigQuery connector label Aug 7, 2024
@marcinsbd marcinsbd force-pushed the marcinsbd/use-bq-read-api-for-reading-external-bl-tables branch from 2ac0a6d to b7c24e9 Compare August 8, 2024 07:32
@marcinsbd marcinsbd marked this pull request as ready for review August 8, 2024 11:24
@marcinsbd
Copy link
Contributor Author

marcinsbd commented Aug 8, 2024

I will squash commits once when I'll make sure that all changes have been properly cherry-picked and rebase with master.

@marcinsbd marcinsbd force-pushed the marcinsbd/use-bq-read-api-for-reading-external-bl-tables branch 5 times, most recently from 99e1bce to 17b1787 Compare August 13, 2024 11:15
@marcinsbd marcinsbd changed the title Use BigQuery Read API for reading external BigLake tables Use BigQuery Read API for reading external BigLake tables [wip] Aug 14, 2024
@marcinsbd marcinsbd force-pushed the marcinsbd/use-bq-read-api-for-reading-external-bl-tables branch 7 times, most recently from b168391 to b2e9393 Compare September 5, 2024 10:45
@marcinsbd marcinsbd changed the title Use BigQuery Read API for reading external BigLake tables [wip] Use BigQuery Read API for reading external BigLake tables Sep 5, 2024
@Praveen2112
Copy link
Member

@marcinsbd Can we update the PR description

@marcinsbd marcinsbd force-pushed the marcinsbd/use-bq-read-api-for-reading-external-bl-tables branch 2 times, most recently from b1fec91 to a6f3884 Compare September 10, 2024 11:52
@ebyhr
Copy link
Member

ebyhr commented Nov 22, 2024

I think we should handle the following limitation:

https://cloud.google.com/bigquery/docs/biglake-intro

The BigQuery Storage API is not available in other cloud environments, such as AWS and Azure.

@Praveen2112
Copy link
Member

Praveen2112 commented Nov 26, 2024

@ebyhr But for Azure or AKS we could use BigLake Omni right or should we use a flag to control them ?

@ebyhr
Copy link
Member

ebyhr commented Nov 26, 2024

@Praveen2112 I'm not sure how BigLake Omni works in this case. How about adding another condition to BigQueryMetadata#isBigLakeTable? e.g.

externalTableDefinition.getSourceUris().stream().allMatch(uri -> uri.startsWith("gs://")

@marcinsbd
Copy link
Contributor Author

Hi @anoopj
what would be the correct way to check if a TableDefinition describes BigLake Table (external table with connection id) or not?
is there a simple way using api to distinguish BigLakeTable from ObjectTable and Omni Table ?

@marcinsbd marcinsbd force-pushed the marcinsbd/use-bq-read-api-for-reading-external-bl-tables branch from f3da1b5 to e9abe1c Compare November 30, 2024 00:49
@anoopj
Copy link
Member

anoopj commented Dec 2, 2024

what would be the correct way to check if a TableDefinition describes BigLake Table (external table with connection id) or not?
Not sure if I understood the question: but a BigLake table will have the connection ID in the table's ExternalDataConfiguration

is there a simple way using api to distinguish BigLakeTable from ObjectTable and Omni Table ?

You can tell from the dataset region (preferred) or looking at the sourceUris as mentioned in the above comment.

@marcinsbd marcinsbd force-pushed the marcinsbd/use-bq-read-api-for-reading-external-bl-tables branch from e9abe1c to 5cad7c1 Compare December 4, 2024 11:32
@marcinsbd
Copy link
Contributor Author

Thanks @ebyhr, @Praveen2112, @anoopj, @krvikash, @pajaks for the review and your help. AC

@marcinsbd marcinsbd force-pushed the marcinsbd/use-bq-read-api-for-reading-external-bl-tables branch 2 times, most recently from fd46be1 to ebeb942 Compare December 4, 2024 12:39
@ebyhr ebyhr self-requested a review December 5, 2024 01:55
@marcinsbd marcinsbd force-pushed the marcinsbd/use-bq-read-api-for-reading-external-bl-tables branch from ebeb942 to e62be36 Compare December 5, 2024 11:04
@marcinsbd
Copy link
Contributor Author

Please let's do the another round of review @ebyhr, @Praveen2112, @krvikash.

Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

% comments

@Praveen2112
Copy link
Member

@marcinsbd Can we rebase the PR

marcinsbd and others added 2 commits December 11, 2024 16:45
The storage APIs support reading BigLake external tables (ie external
tables with a connection). But the current implementation uses views
which can be expensive, because it requires a query. This PR adds
support to read BigLake tables directly using the storage API.

There are no behavior changes for external tables and BQ native tables -
they use the view and storage APIs respectively.

Added a new test for BigLake tables.

Co-authored-by: Marcin Rusek <[email protected]>
@marcinsbd marcinsbd force-pushed the marcinsbd/use-bq-read-api-for-reading-external-bl-tables branch from e62be36 to bcad172 Compare December 11, 2024 15:46
@ebyhr ebyhr merged commit 4718011 into trinodb:master Dec 12, 2024
195 of 198 checks passed
@github-actions github-actions bot added this to the 468 milestone Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery BigQuery connector cla-signed
Development

Successfully merging this pull request may close these issues.

Support for direct read of BigLake tables using BigQuery storage API
7 participants