Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset cache to BigQuery connector #14729

Merged

Conversation

adamjshook
Copy link
Member

Description

When caseInsensitiveNameMatching=true, the BigQueryClient makes a call to listDatasets for each call on toRemoteDataset. This is problematic when running queries like SHOW SCHEMAS since listDatasets is called once to get the initial list, then once again for each schema. This is expensive if the BigQuery project has a large number of datasets.

This commit adds a cache to listing datasets to drastically reduce the query time for SHOW SCHEMAS.

Non-technical explanation

This commit improves the runtime of SHOW SCHEMAS when BigQuery has a large number of datasets.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(X) Release notes are required, with the following suggested text:

# BigQuery Connector
* Improve the runtime of `SHOW SCHEMAS` by adding a dataset metadata cache. Set `bigquery.dataset-cache-ttl` in your catalog configuration to a duration, default is `1m`.

@cla-bot cla-bot bot added the cla-signed label Oct 24, 2022
@adamjshook adamjshook requested review from hashhar and wendigo October 24, 2022 16:13
@github-actions github-actions bot added the docs label Oct 24, 2022
Copy link
Member

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

% comments

{
this.bigQuery = requireNonNull(bigQuery, "bigQuery is null");
this.materializationCache = requireNonNull(materializationCache, "materializationCache is null");
this.caseInsensitiveNameMatching = caseInsensitiveNameMatching;
this.dataSetCache = EvictableCacheBuilder.newBuilder()
.expireAfterWrite(java.time.Duration.ofNanos(datasetCacheTtl.roundTo(NANOSECONDS)))
.build(CacheLoader.from(projectId -> Streams.stream(bigQuery.listDatasets(projectId).iterateAll()).collect(toImmutableList())));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract a private method for (non caching) listDataSets

{
this.bigQuery = requireNonNull(bigQuery, "bigQuery is null");
this.materializationCache = requireNonNull(materializationCache, "materializationCache is null");
this.caseInsensitiveNameMatching = caseInsensitiveNameMatching;
this.dataSetCache = EvictableCacheBuilder.newBuilder()
.expireAfterWrite(java.time.Duration.ofNanos(datasetCacheTtl.roundTo(NANOSECONDS)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass it as datasetCacheTtl.toMIllis(), MILLISECONDS

@adamjshook adamjshook force-pushed the adamjshook/bigquery-list-schemas branch 2 times, most recently from 02cff27 to ea8dddd Compare October 26, 2022 18:43
@adamjshook adamjshook force-pushed the adamjshook/bigquery-list-schemas branch 2 times, most recently from 65c72f2 to f41751e Compare October 27, 2022 13:42
Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix CI failure.


public BigQueryClient(BigQuery bigQuery, boolean caseInsensitiveNameMatching, ViewMaterializationCache materializationCache)
public BigQueryClient(BigQuery bigQuery, boolean caseInsensitiveNameMatching, ViewMaterializationCache materializationCache, Duration datasetCacheTtl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public BigQueryClient(BigQuery bigQuery, boolean caseInsensitiveNameMatching, ViewMaterializationCache materializationCache, Duration datasetCacheTtl)
public BigQueryClient(BigQuery bigQuery, boolean caseInsensitiveNameMatching, ViewMaterializationCache materializationCache, Duration metadataCacheTtl)


private List<Dataset> listDatasetsFromBigQuery(String projectId)
{
return Streams.stream(bigQuery.listDatasets(projectId).iterateAll()).collect(toImmutableList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: stream formatting

Suggested change
return Streams.stream(bigQuery.listDatasets(projectId).iterateAll()).collect(toImmutableList());
return stream(bigQuery.listDatasets(projectId).iterateAll())
.collect(toImmutableList());

@@ -51,6 +51,7 @@
private boolean caseInsensitiveNameMatching;
private Duration viewsCacheTtl = new Duration(15, MINUTES);
private Duration serviceCacheTtl = new Duration(3, MINUTES);
private Duration metadataCacheTtl = new Duration(1, MINUTES);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default should be disabled IMO. Metadata caching has a tradeoff regarding changes made in source not being immediately visible to users.

Also note that the performance penalty only happens when case-insensitive-name-matching is enabled - which is not enabled by default. So caching by default only makes sense when case-insensitive-name-matching is enabled.

@adamjshook adamjshook force-pushed the adamjshook/bigquery-list-schemas branch from f41751e to 360e98f Compare October 28, 2022 18:14
@adamjshook adamjshook requested review from hashhar and ebyhr October 29, 2022 13:55
@@ -133,6 +133,8 @@ Property Description
``BIGNUMERIC`` and ``TIMESTAMP`` types are unsupported. ``false``
``bigquery.views-cache-ttl`` Duration for which the materialization of a view will be ``15m``
cached and reused. Set to ``0ms`` to disable the cache.
``bigquery.metadata-cache-ttl`` Duration for which metadata retrieved from BigQuery ``0m``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``bigquery.metadata-cache-ttl`` Duration for which metadata retrieved from BigQuery ``0m``
``bigquery.metadata-cache-ttl`` Duration for which metadata retrieved from BigQuery ``0ms``

(To be consistent with the description saying "Set to 0ms to disable the cache.")

@@ -222,6 +223,21 @@ public BigQueryConfig setServiceCacheTtl(Duration serviceCacheTtl)
return this;
}

@NotNull
@MinDuration("0m")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a MinDuration("0m") also accept 0ms? I think it would - important because docs say to set to 0ms to disable cache.

Comment on lines 33 to 37
@BeforeClass(alwaysRun = true)
public void initBigQueryExecutor()
{
this.bigQuerySqlExecutor = new BigQuerySqlExecutor();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a @BeforeClass? Ordering of multiple @BeforeClass members isn't deterministic AFAIK (createQueryRunner itself is called from some @BeforeClass).

Maybe move inside createQueryRunner instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this file/classs by copying TestBigQueryConnectorTest and modifying it -- this is how it is written over there.

Moved this to createQueryRunner.

" name varchar NOT NULL,\n" +
" comment varchar NOT NULL\n" +
"\\)",
Pattern.quote(getSession().getCatalog().orElseThrow()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, today I learnt about Pattern.quote. Easier than \\Qsometext\\E for sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we use regular expression instead of isEqualTo().

Pattern.quote(getSession().getCatalog().orElseThrow()),
Pattern.quote(getSession().getSchema().orElseThrow())));
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want an additional test method here that:

  • Create schema using Trino (so that dataset gets cached)
  • List schemas using Trino and verify created schema is visible
  • Create a table in the schema/list tables in the schema using Trino
  • Drop dataset using BigQuery
  • List schemas using Trino - notice that schema is still visible
  • Try to access table/list tables in schema using Trino - get an error

This would verify that the cache is indeed being used and would prevent reoccurence of #10725

import static org.assertj.core.api.Assertions.assertThat;

public class TestBigQueryConnectorSmokeTest
extends BaseConnectorSmokeTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This connector sometimes reaches API limit in CI. How about extending AbstractTestQueryFramework instead and adding minimum tests described in https://github.com/trinodb/trino/pull/14729/files#r1009087323 if the test purpose is verifying only bigquery.metadata-cache-ttl config property?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea with smoke test was to make sure metadata caching doesn't break any functionality. But now I see that none of the JDBC connectors with metadata caching have a test like that.

I like your suggestion @ebyhr.

@adamjshook adamjshook force-pushed the adamjshook/bigquery-list-schemas branch from 360e98f to 41917b3 Compare November 1, 2022 15:16
@adamjshook adamjshook requested review from hashhar and ebyhr November 1, 2022 16:08
@@ -133,6 +133,8 @@ Property Description
``BIGNUMERIC`` and ``TIMESTAMP`` types are unsupported. ``false``
``bigquery.views-cache-ttl`` Duration for which the materialization of a view will be ``15m``
cached and reused. Set to ``0ms`` to disable the cache.
``bigquery.metadata-cache-ttl`` Duration for which metadata retrieved from BigQuery ``0ms``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to bigquery.metadata.cache-ttl for consistency with other connectors.

It's unfortunate that all connectors except JDBC use a prefix of connector name even though properties are already namespaced per connector.

Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When caseInsensitiveNameMatching=true, the BigQueryClient
makes a call to listDatasets for each call on toRemoteDataset.
This is problematic when running queries like SHOW SCHEMAS
since listDatasets is called once to get the initial list,
then once again for each schema. This is expensive if the
BigQuery project has a large number of datasets.

This commit adds a cache to listing datasets to drastically
reduce the query time for SHOW SCHEMAS.
@adamjshook adamjshook force-pushed the adamjshook/bigquery-list-schemas branch from 41917b3 to b69938f Compare November 2, 2022 13:57
Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI was green in #14812

@hashhar hashhar merged commit 6e29d60 into trinodb:master Nov 3, 2022
@github-actions github-actions bot added this to the 403 milestone Nov 3, 2022
@adamjshook adamjshook deleted the adamjshook/bigquery-list-schemas branch November 3, 2022 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants