-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System tables optimization #18515
base: master
Are you sure you want to change the base?
System tables optimization #18515
Conversation
@@ -757,18 +757,23 @@ protected Object visitArithmeticUnary(ArithmeticUnaryExpression node, Object con | |||
if (handle.type().parameterCount() > 0 && handle.type().parameterType(0) == ConnectorSession.class) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ignore this commit, I'll remove it later.
I'm using eclipse and ecj doesn't like yield within try for some reason. M2e also errors out on missing property.
8057b4f
to
f6d3d47
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just skimmed. Since it is optimization, do you have any numbers?
@@ -119,25 +119,25 @@ public void testSelectTablesWithoutPredicate(boolean allTablesViewsImplemented) | |||
assertMetastoreInvocations("SELECT * FROM information_schema.tables", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please extract this commit to separate PR so we can merge it fast
@@ -39,6 +39,7 @@ | |||
import io.trino.spi.connector.RowChangeParadigm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move table access control deeper into column listing
Can you explain in commit message why?
@@ -171,7 +173,7 @@ Optional<TableExecuteHandle> getTableHandleForExecute( | |||
* Gets the columns metadata for all tables that match the specified prefix. | |||
* TODO: consider returning a stream for more efficient processing | |||
*/ | |||
List<TableColumnsMetadata> listTableColumns(Session session, QualifiedTablePrefix prefix); | |||
List<TableColumnsMetadata> listTableColumns(Session session, QualifiedTablePrefix prefix, UnaryOperator<Set<SchemaTableName>> tablesFilter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnaryOperator
does not sound like filter. Did you mean Predicate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Predicate just returns a boolean value, but we need to perform batch accessControl checks and return filtered entries.
{ | ||
requireNonNull(prefix, "prefix is null"); | ||
|
||
Optional<CatalogMetadata> catalog = getOptionalCatalogMetadata(session, prefix.getCatalogName()); | ||
|
||
// Track column metadata for every object name to resolve ties between table and view | ||
Map<SchemaTableName, Optional<List<ColumnMetadata>>> tableColumns = new HashMap<>(); | ||
List<Map<SchemaTableName, Optional<List<ColumnMetadata>>>> tableColumnMaps = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean Multimap
here? Also List
sounds like an easy object to be merged with other List
(maybe Set
is even better here).
@@ -630,9 +637,11 @@ public void testSelectColumnsWithLikeOverColumn(boolean allTablesViewsImplemente | |||
@Test | |||
public void testSelectColumnsFilterByTableAndSchema() | |||
{ | |||
assertMetastoreInvocations("SELECT * FROM information_schema.columns WHERE table_schema = 'test_schema_0' AND table_name = 'test_table_0'", ImmutableMultiset.of(GET_TABLE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please undo formatting changes
That's a good point. I am not sure it's obvious which direction to go into. |
.addCopies(GET_TABLE, TEST_ALL_TABLES_COUNT) | ||
.build() | ||
.add(GET_ALL_DATABASES) | ||
.add(GET_ALL_TABLES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to agree on intellij configuration.
The current ("before") code is like my intellij happens to format the code.
The new ("after") code at least my intellij would not accept, and would revert back to the indentation "before"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe my formatter was taken from the same repo. Let me update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that even without formatter definition changes, different intellij versions may render the code slightly differently. For pragmatic reasons we should for now simply standardize on eg latest stable ide version.
for this PR, feel free to simply undo formatting changes.
final long schemaPrefixLimit = Math.min(maxPrefetchedInformationSchemaPrefixes, schemaPrefixes.size()); | ||
final long tablePrefixLimit = Math.max(maxPrefetchedInformationSchemaPrefixes, schemaPrefixLimit * TABLE_COUNT_PER_SCHEMA_THRESHOLD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: we don't use final
for variables
@VisibleForTesting | ||
public static final int TABLE_COUNT_PER_SCHEMA_THRESHOLD = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VisibleForTesting
is usually on package-private stuff- what is semantics of
TABLE_COUNT_PER_SCHEMA_THRESHOLD
constant? please define
More importantly, i am concerned about "Extract information schema data filtering into a separate class" not being pure refactor, despite the name suggesting so. At least, I couldn't find any "threshold" constant in the code being (re)moved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the other commits got accidentally merged after a series of interactive rebases. It was supposed to be a pure refactor.
import static java.util.Locale.ENGLISH; | ||
import static java.util.Objects.requireNonNull; | ||
|
||
public final class SystemTableFilter<T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this new class have a javadoc hinting what is it for?
* Gets the metadata for all columns that match the specified table prefix. Redirected table names are included, but the column metadata for them is not. Additional | ||
* <code>tableFilter</code> parameter can be used to prune tables from column search. | ||
*/ | ||
default Iterator<TableColumnsMetadata> streamTableColumns(ConnectorSession session, SchemaTablePrefix prefix, UnaryOperator<Set<SchemaTableName>> tablesFilter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There already is a method with this name in this interface (streamTableColumns(ConnectorSession session, SchemaTablePrefix prefix)
)
- delegate to the existing method
- deprecate the old method, since we want to eventually remove it
moreover, following applyFilter as example, we could consider not requiring the connector to apply tablesFilter
and treat this as advisory. If we do so, we would need the connector to be able to tell whether it applied the filter. We should not apply the filter twice, as it may be expensive (depending on number of tables). Not sure about this, but this at least is the direction i took in #18517.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moreover, following applyFilter as example, we could consider not requiring the connector to apply
tablesFilter
and treat this as advisory
i think it's simpler if we require implementations to apply the filter.
shouldn't be a big deal and simplifies the API. i reworked #18517 following this idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going further, we could allow the connector to fetch column information for MVs, views and tables in one call. This would let avoid listing relations 3 times, when compared to getMaterializedViews + getViews + streamTableColumns
currently used to satisfy information_schema.columns
.
I drafted a PR showing how this could look like: #18585
catch (TrinoException e) { | ||
if (e.getErrorCode().equals(NOT_SUPPORTED.toErrorCode())) { | ||
columnsIterator = metadata.streamTableColumns(connectorSession, tablePrefix); | ||
needsExplicitFiltering = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if new streamTableColumns delegates to the old streamTableColumns, then the logic here doesn't need to be exception driven
@@ -19,12 +19,11 @@ | |||
import io.airlift.slice.Slices; | |||
import io.trino.FullConnectorSession; | |||
import io.trino.Session; | |||
import io.trino.connector.informationschema.SystemTableFilter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change schema/table prefix behavior in system.jdbc and info_schema
Assuming "info_schema" is an abbreviation for information_schema, please spell it out.
Also, the commit seems not to touch information_schema.
So spell out "system.jdbc.columns"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema and table prefixes are treated equal when limiting their numbers
during column queries in system.jdbc and information_schema. But in
reality fetching tables from 100 schemas and then filtering them in
memory is way more expensive than fetching 100 tables directly.This is especially true if schemas have lots of tables underneath that
we will have to go through.
Thanks for adding some color to the commit message. This introduces the problem it is attempting to solve. Can the commit message also summarize the solution?
@@ -118,7 +121,7 @@ public void testInformationSchemaPredicatePushdown() | |||
Constraint constraint = new Constraint(TupleDomain.withColumnDomains(domains.buildOrThrow())); | |||
|
|||
ConnectorSession session = createNewSession(transactionId); | |||
ConnectorMetadata metadata = new InformationSchemaMetadata("test_catalog", this.metadata, MAX_PREFIXES_COUNT); | |||
ConnectorMetadata metadata = new InformationSchemaMetadata("test_catalog", this.metadata, this.accessControl, MAX_PREFIXES_COUNT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this.
redundant?
f6d3d47
to
b431c66
Compare
Commit with streamTableColumns filtering will be superseded by #18586 |
Schema and table prefixes are treated equal when limiting their numbers during column queries in system.jdbc and information_schema. But in reality fetching tables from 100 schemas and then filtering them in memory is way more expensive than fetching 100 tables directly. This is especially true if schemas have lots of tables underneath that we will have to go through.
b431c66
to
0ba9d57
Compare
Thanks for the info. BTW i see there are unapplied/unanswered comments (e.g. #18515 (comment)), which is understood as WIP. |
Description
This is a first draft pr which includes following optimizations for informations schema and system.jdbc tables:
Additional context and related issues
I'm still reworking
TestHiveMetastoreMetadataQueriesAccessOperations
to validate more scenarios (access control, many tables-many filtered tables, many tables-few filtered tables, few tables in total, etc.)Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: