-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dataset cache to BigQuery connector #14729
Add dataset cache to BigQuery connector #14729
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
% comments
{ | ||
this.bigQuery = requireNonNull(bigQuery, "bigQuery is null"); | ||
this.materializationCache = requireNonNull(materializationCache, "materializationCache is null"); | ||
this.caseInsensitiveNameMatching = caseInsensitiveNameMatching; | ||
this.dataSetCache = EvictableCacheBuilder.newBuilder() | ||
.expireAfterWrite(java.time.Duration.ofNanos(datasetCacheTtl.roundTo(NANOSECONDS))) | ||
.build(CacheLoader.from(projectId -> Streams.stream(bigQuery.listDatasets(projectId).iterateAll()).collect(toImmutableList()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract a private method for (non caching) listDataSets
{ | ||
this.bigQuery = requireNonNull(bigQuery, "bigQuery is null"); | ||
this.materializationCache = requireNonNull(materializationCache, "materializationCache is null"); | ||
this.caseInsensitiveNameMatching = caseInsensitiveNameMatching; | ||
this.dataSetCache = EvictableCacheBuilder.newBuilder() | ||
.expireAfterWrite(java.time.Duration.ofNanos(datasetCacheTtl.roundTo(NANOSECONDS))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass it as datasetCacheTtl.toMIllis(), MILLISECONDS
02cff27
to
ea8dddd
Compare
plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryClient.java
Outdated
Show resolved
Hide resolved
plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryClient.java
Outdated
Show resolved
Hide resolved
plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryClient.java
Outdated
Show resolved
Hide resolved
65c72f2
to
f41751e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix CI failure.
|
||
public BigQueryClient(BigQuery bigQuery, boolean caseInsensitiveNameMatching, ViewMaterializationCache materializationCache) | ||
public BigQueryClient(BigQuery bigQuery, boolean caseInsensitiveNameMatching, ViewMaterializationCache materializationCache, Duration datasetCacheTtl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public BigQueryClient(BigQuery bigQuery, boolean caseInsensitiveNameMatching, ViewMaterializationCache materializationCache, Duration datasetCacheTtl) | |
public BigQueryClient(BigQuery bigQuery, boolean caseInsensitiveNameMatching, ViewMaterializationCache materializationCache, Duration metadataCacheTtl) |
|
||
private List<Dataset> listDatasetsFromBigQuery(String projectId) | ||
{ | ||
return Streams.stream(bigQuery.listDatasets(projectId).iterateAll()).collect(toImmutableList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: stream formatting
return Streams.stream(bigQuery.listDatasets(projectId).iterateAll()).collect(toImmutableList()); | |
return stream(bigQuery.listDatasets(projectId).iterateAll()) | |
.collect(toImmutableList()); |
@@ -51,6 +51,7 @@ | |||
private boolean caseInsensitiveNameMatching; | |||
private Duration viewsCacheTtl = new Duration(15, MINUTES); | |||
private Duration serviceCacheTtl = new Duration(3, MINUTES); | |||
private Duration metadataCacheTtl = new Duration(1, MINUTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default should be disabled IMO. Metadata caching has a tradeoff regarding changes made in source not being immediately visible to users.
Also note that the performance penalty only happens when case-insensitive-name-matching is enabled - which is not enabled by default. So caching by default only makes sense when case-insensitive-name-matching is enabled.
f41751e
to
360e98f
Compare
@@ -133,6 +133,8 @@ Property Description | |||
``BIGNUMERIC`` and ``TIMESTAMP`` types are unsupported. ``false`` | |||
``bigquery.views-cache-ttl`` Duration for which the materialization of a view will be ``15m`` | |||
cached and reused. Set to ``0ms`` to disable the cache. | |||
``bigquery.metadata-cache-ttl`` Duration for which metadata retrieved from BigQuery ``0m`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``bigquery.metadata-cache-ttl`` Duration for which metadata retrieved from BigQuery ``0m`` | |
``bigquery.metadata-cache-ttl`` Duration for which metadata retrieved from BigQuery ``0ms`` |
(To be consistent with the description saying "Set to 0ms
to disable the cache.")
@@ -222,6 +223,21 @@ public BigQueryConfig setServiceCacheTtl(Duration serviceCacheTtl) | |||
return this; | |||
} | |||
|
|||
@NotNull | |||
@MinDuration("0m") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does a MinDuration("0m")
also accept 0ms
? I think it would - important because docs say to set to 0ms
to disable cache.
@BeforeClass(alwaysRun = true) | ||
public void initBigQueryExecutor() | ||
{ | ||
this.bigQuerySqlExecutor = new BigQuerySqlExecutor(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a @BeforeClass
? Ordering of multiple @BeforeClass
members isn't deterministic AFAIK (createQueryRunner
itself is called from some @BeforeClass
).
Maybe move inside createQueryRunner
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this file/classs by copying TestBigQueryConnectorTest
and modifying it -- this is how it is written over there.
Moved this to createQueryRunner
.
...in/trino-bigquery/src/test/java/io/trino/plugin/bigquery/TestBigQueryConnectorSmokeTest.java
Outdated
Show resolved
Hide resolved
" name varchar NOT NULL,\n" + | ||
" comment varchar NOT NULL\n" + | ||
"\\)", | ||
Pattern.quote(getSession().getCatalog().orElseThrow()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, today I learnt about Pattern.quote
. Easier than \\Qsometext\\E
for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we use regular expression instead of isEqualTo()
.
Pattern.quote(getSession().getCatalog().orElseThrow()), | ||
Pattern.quote(getSession().getSchema().orElseThrow()))); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want an additional test method here that:
- Create schema using Trino (so that dataset gets cached)
- List schemas using Trino and verify created schema is visible
- Create a table in the schema/list tables in the schema using Trino
- Drop dataset using BigQuery
- List schemas using Trino - notice that schema is still visible
- Try to access table/list tables in schema using Trino - get an error
This would verify that the cache is indeed being used and would prevent reoccurence of #10725
import static org.assertj.core.api.Assertions.assertThat; | ||
|
||
public class TestBigQueryConnectorSmokeTest | ||
extends BaseConnectorSmokeTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This connector sometimes reaches API limit in CI. How about extending AbstractTestQueryFramework
instead and adding minimum tests described in https://github.com/trinodb/trino/pull/14729/files#r1009087323 if the test purpose is verifying only bigquery.metadata-cache-ttl
config property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea with smoke test was to make sure metadata caching doesn't break any functionality. But now I see that none of the JDBC connectors with metadata caching have a test like that.
I like your suggestion @ebyhr.
360e98f
to
41917b3
Compare
@@ -133,6 +133,8 @@ Property Description | |||
``BIGNUMERIC`` and ``TIMESTAMP`` types are unsupported. ``false`` | |||
``bigquery.views-cache-ttl`` Duration for which the materialization of a view will be ``15m`` | |||
cached and reused. Set to ``0ms`` to disable the cache. | |||
``bigquery.metadata-cache-ttl`` Duration for which metadata retrieved from BigQuery ``0ms`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to bigquery.metadata.cache-ttl
for consistency with other connectors.
It's unfortunate that all connectors except JDBC use a prefix of connector name even though properties are already namespaced per connector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good except for https://github.com/trinodb/trino/pull/14729/files#r1011219261
When caseInsensitiveNameMatching=true, the BigQueryClient makes a call to listDatasets for each call on toRemoteDataset. This is problematic when running queries like SHOW SCHEMAS since listDatasets is called once to get the initial list, then once again for each schema. This is expensive if the BigQuery project has a large number of datasets. This commit adds a cache to listing datasets to drastically reduce the query time for SHOW SCHEMAS.
41917b3
to
b69938f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI was green in #14812
Description
When
caseInsensitiveNameMatching=true
, theBigQueryClient
makes a call tolistDatasets
for each call ontoRemoteDataset
. This is problematic when running queries likeSHOW SCHEMAS
sincelistDatasets
is called once to get the initial list, then once again for each schema. This is expensive if the BigQuery project has a large number of datasets.This commit adds a cache to listing datasets to drastically reduce the query time for
SHOW SCHEMAS
.Non-technical explanation
This commit improves the runtime of
SHOW SCHEMAS
when BigQuery has a large number of datasets.Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(X) Release notes are required, with the following suggested text: