Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System tables optimization #18515

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

atanasenko
Copy link
Member

@atanasenko atanasenko commented Aug 3, 2023

Description

This is a first draft pr which includes following optimizations for informations schema and system.jdbc tables:

  • Perform access control before filtering out data to prevent going over huge datasets
  • Use same approach for both info_schema and system.jdbc
  • Try to make better decisions before resorting to catalog-wide search

Additional context and related issues

I'm still reworking TestHiveMetastoreMetadataQueriesAccessOperations to validate more scenarios (access control, many tables-many filtered tables, many tables-few filtered tables, few tables in total, etc.)

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Aug 3, 2023
@@ -757,18 +757,23 @@ protected Object visitArithmeticUnary(ArithmeticUnaryExpression node, Object con
if (handle.type().parameterCount() > 0 && handle.type().parameterType(0) == ConnectorSession.class) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore this commit, I'll remove it later.
I'm using eclipse and ecj doesn't like yield within try for some reason. M2e also errors out on missing property.

@atanasenko atanasenko force-pushed the at/system-tables-optimization branch from 8057b4f to f6d3d47 Compare August 3, 2023 10:50
@github-actions github-actions bot added tests:hive hive Hive connector labels Aug 3, 2023
Copy link
Member

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just skimmed. Since it is optimization, do you have any numbers?

@@ -119,25 +119,25 @@ public void testSelectTablesWithoutPredicate(boolean allTablesViewsImplemented)
assertMetastoreInvocations("SELECT * FROM information_schema.tables",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extract this commit to separate PR so we can merge it fast

@@ -39,6 +39,7 @@
import io.trino.spi.connector.RowChangeParadigm;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move table access control deeper into column listing

Can you explain in commit message why?

@@ -171,7 +173,7 @@ Optional<TableExecuteHandle> getTableHandleForExecute(
* Gets the columns metadata for all tables that match the specified prefix.
* TODO: consider returning a stream for more efficient processing
*/
List<TableColumnsMetadata> listTableColumns(Session session, QualifiedTablePrefix prefix);
List<TableColumnsMetadata> listTableColumns(Session session, QualifiedTablePrefix prefix, UnaryOperator<Set<SchemaTableName>> tablesFilter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnaryOperator does not sound like filter. Did you mean Predicate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Predicate just returns a boolean value, but we need to perform batch accessControl checks and return filtered entries.

{
requireNonNull(prefix, "prefix is null");

Optional<CatalogMetadata> catalog = getOptionalCatalogMetadata(session, prefix.getCatalogName());

// Track column metadata for every object name to resolve ties between table and view
Map<SchemaTableName, Optional<List<ColumnMetadata>>> tableColumns = new HashMap<>();
List<Map<SchemaTableName, Optional<List<ColumnMetadata>>>> tableColumnMaps = new ArrayList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean Multimap here? Also List sounds like an easy object to be merged with other List (maybe Set is even better here).

@@ -630,9 +637,11 @@ public void testSelectColumnsWithLikeOverColumn(boolean allTablesViewsImplemente
@Test
public void testSelectColumnsFilterByTableAndSchema()
{
assertMetastoreInvocations("SELECT * FROM information_schema.columns WHERE table_schema = 'test_schema_0' AND table_name = 'test_table_0'", ImmutableMultiset.of(GET_TABLE));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo formatting changes

@findepi
Copy link
Member

findepi commented Aug 3, 2023

  • Perform access control before filtering out data to prevent going over huge datasets

That's a good point.
Please note however that listing tables is sometimes sufficient to get all information we need.
See this @ebyhr 's comment #18315 (review)
and see also #18517

I am not sure it's obvious which direction to go into.

.addCopies(GET_TABLE, TEST_ALL_TABLES_COUNT)
.build()
.add(GET_ALL_DATABASES)
.add(GET_ALL_TABLES)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to agree on intellij configuration.
The current ("before") code is like my intellij happens to format the code.
The new ("after") code at least my intellij would not accept, and would revert back to the indentation "before"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe my formatter was taken from the same repo. Let me update it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that even without formatter definition changes, different intellij versions may render the code slightly differently. For pragmatic reasons we should for now simply standardize on eg latest stable ide version.

for this PR, feel free to simply undo formatting changes.

Comment on lines +135 to +142
final long schemaPrefixLimit = Math.min(maxPrefetchedInformationSchemaPrefixes, schemaPrefixes.size());
final long tablePrefixLimit = Math.max(maxPrefetchedInformationSchemaPrefixes, schemaPrefixLimit * TABLE_COUNT_PER_SCHEMA_THRESHOLD);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: we don't use final for variables

Comment on lines +51 to +55
@VisibleForTesting
public static final int TABLE_COUNT_PER_SCHEMA_THRESHOLD = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • @VisibleForTesting is usually on package-private stuff
  • what is semantics of TABLE_COUNT_PER_SCHEMA_THRESHOLD constant? please define

More importantly, i am concerned about "Extract information schema data filtering into a separate class" not being pure refactor, despite the name suggesting so. At least, I couldn't find any "threshold" constant in the code being (re)moved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of the other commits got accidentally merged after a series of interactive rebases. It was supposed to be a pure refactor.

import static java.util.Locale.ENGLISH;
import static java.util.Objects.requireNonNull;

public final class SystemTableFilter<T>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this new class have a javadoc hinting what is it for?

* Gets the metadata for all columns that match the specified table prefix. Redirected table names are included, but the column metadata for them is not. Additional
* <code>tableFilter</code> parameter can be used to prune tables from column search.
*/
default Iterator<TableColumnsMetadata> streamTableColumns(ConnectorSession session, SchemaTablePrefix prefix, UnaryOperator<Set<SchemaTableName>> tablesFilter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already is a method with this name in this interface (streamTableColumns(ConnectorSession session, SchemaTablePrefix prefix))

  • delegate to the existing method
  • deprecate the old method, since we want to eventually remove it

moreover, following applyFilter as example, we could consider not requiring the connector to apply tablesFilter and treat this as advisory. If we do so, we would need the connector to be able to tell whether it applied the filter. We should not apply the filter twice, as it may be expensive (depending on number of tables). Not sure about this, but this at least is the direction i took in #18517.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moreover, following applyFilter as example, we could consider not requiring the connector to apply tablesFilter and treat this as advisory

i think it's simpler if we require implementations to apply the filter.
shouldn't be a big deal and simplifies the API. i reworked #18517 following this idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going further, we could allow the connector to fetch column information for MVs, views and tables in one call. This would let avoid listing relations 3 times, when compared to getMaterializedViews + getViews + streamTableColumns currently used to satisfy information_schema.columns.
I drafted a PR showing how this could look like: #18585

Comment on lines +590 to +595
catch (TrinoException e) {
if (e.getErrorCode().equals(NOT_SUPPORTED.toErrorCode())) {
columnsIterator = metadata.streamTableColumns(connectorSession, tablePrefix);
needsExplicitFiltering = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if new streamTableColumns delegates to the old streamTableColumns, then the logic here doesn't need to be exception driven

@@ -19,12 +19,11 @@
import io.airlift.slice.Slices;
import io.trino.FullConnectorSession;
import io.trino.Session;
import io.trino.connector.informationschema.SystemTableFilter;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change schema/table prefix behavior in system.jdbc and info_schema

Assuming "info_schema" is an abbreviation for information_schema, please spell it out.

Also, the commit seems not to touch information_schema.
So spell out "system.jdbc.columns"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema and table prefixes are treated equal when limiting their numbers
during column queries in system.jdbc and information_schema. But in
reality fetching tables from 100 schemas and then filtering them in
memory is way more expensive than fetching 100 tables directly.

This is especially true if schemas have lots of tables underneath that
we will have to go through.

Thanks for adding some color to the commit message. This introduces the problem it is attempting to solve. Can the commit message also summarize the solution?

@@ -118,7 +121,7 @@ public void testInformationSchemaPredicatePushdown()
Constraint constraint = new Constraint(TupleDomain.withColumnDomains(domains.buildOrThrow()));

ConnectorSession session = createNewSession(transactionId);
ConnectorMetadata metadata = new InformationSchemaMetadata("test_catalog", this.metadata, MAX_PREFIXES_COUNT);
ConnectorMetadata metadata = new InformationSchemaMetadata("test_catalog", this.metadata, this.accessControl, MAX_PREFIXES_COUNT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this. redundant?

@atanasenko atanasenko force-pushed the at/system-tables-optimization branch from f6d3d47 to b431c66 Compare August 9, 2023 11:22
@atanasenko
Copy link
Member Author

Commit with streamTableColumns filtering will be superseded by #18586

Schema and table prefixes are treated equal when limiting their numbers
during column queries in system.jdbc and information_schema. But in
reality fetching tables from 100 schemas and then filtering them in
memory is way more expensive than fetching 100 tables directly.

This is especially true if schemas have lots of tables underneath that
we will have to go through.
@atanasenko atanasenko force-pushed the at/system-tables-optimization branch from b431c66 to 0ba9d57 Compare August 9, 2023 11:32
@findepi
Copy link
Member

findepi commented Aug 9, 2023

Commit with streamTableColumns filtering will be superseded by #18586

Thanks for the info.
You may also rebase onto it, if you want.
Move table access control deeper into column listing will probably get obsoleted too.

BTW i see there are unapplied/unanswered comments (e.g. #18515 (comment)), which is understood as WIP.
Let me know when I should review again.

@github-actions github-actions bot added the jdbc Relates to Trino JDBC driver label Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector jdbc Relates to Trino JDBC driver
Development

Successfully merging this pull request may close these issues.

3 participants