-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable the Information Schema Connector to load data lazily #1543
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks!
presto-main/src/test/java/io/prestosql/metadata/TestInformationSchemaMetadata.java
Show resolved
Hide resolved
@@ -195,6 +195,10 @@ public ConnectorTableProperties getTableProperties(ConnectorSession session, Con | |||
|
|||
Set<QualifiedTablePrefix> prefixes = getPrefixes(session, table, constraint); | |||
|
|||
if (prefixes.equals(defaultPrefixes())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (prefixes.equals(table.getPrefixes())
?
Should we remove || !table.getPrefixes().equals(defaultPrefixes())
from line 192?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the checking of isTablesEnumeratingTable
from calculatePrefixesWithSchemaName
to line 192 to short-circuit it. As there is no isTablesEnumeratingTable
check in calculatePrefixesWithSchemaName
any more, we can't remove || !table.getPrefixes().equals(defaultPrefixes())
from line 192.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
The problem here is that information schema connector does not allow subsequent predicate push down. Predicate could pushed down only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I misread your suggestion and my reasoning was messed up.
I see your point now.
- For the first suggestion, yes, I think it should be
if (prefixes.equals(table.getPrefixes())
- For the second suggestion, I'd rather keep
|| !table.getPrefixes().equals(defaultPrefixes())
for now because the currentapplyFilter
doesn't perform predicate pushdown upon existing prefixes. (Each time prefixes are calculated independently) We can address this in a future PR.
@@ -167,9 +167,14 @@ public ConnectorTableMetadata getTableMetadata(ConnectorSession session, Connect | |||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make table listing in MockConnector emulate Hive Connector better
Can you explain what it means because "better" is very ambiguous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I've included the explanation in the commit message body.
presto-tests/src/test/java/io/prestosql/tests/TestInformationSchemaConnector.java
Outdated
Show resolved
Hide resolved
@Test | ||
public void testLimit() | ||
{ | ||
assertQuery("SELECT count(*) FROM (SELECT * from tpch.information_schema.columns LIMIT 1)", "VALUES 1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we count metadata calls to prove that metadata is called less times with lower limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I added two tests to TestInformationSchemaConnector::testMetadataCalls
.
Also, through doing this, I found that we can make InformationSchemaPageSource::addColumnsRecordsFor
finer-grained to reduce the number of metadata calls for queries with a limit clause. Instead of calling listTableColumns
directly with a schema prefix, we can call listTables
firstly and then call listTableColumns
for each table prefix. In this way, after a limit is hit, a schema's remaining tables won't be enumerated. But this approach makes code messier and won't help a lot if the table distribution among schemas is not skewed. So I'd rather not do this. Please let me know if you feel like it's better to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added two tests to TestInformationSchemaConnector::testMetadataCalls.
I don't see them.
Please let me know if you feel like it's better to do so.
Are there any other disadvantages other than code more complicated? I think if we can optimize some real life use case then I think it is worth doing so. Maybe it could be done after this PR with some additional abstraction or refactoring to handle this code complication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see them.
Added now.
Are there any other disadvantages other than code more complicated? I think if we can optimize some real life use case then I think it is worth doing so. Maybe it could be done after this PR with some additional abstraction or refactoring to handle this code complication.
I tried to implement this finer-grained approach and found that it does incur inefficiency in terms of the number of metastore calls. For Hive Connector, listTableColumns
performs better on schema prefixes than table prefixes. In other words, it increases the number of metastore calls to unfold a schema prefix and then perform listTableColumns
on each table prefix. So I didn't use this approach in the updated PR. But as a basis for discussion, the finer-grained approach I'm talking about is like:
private void addColumnsRecords(QualifiedTablePrefix prefix)
{
List<QualifiedTablePrefix> tablePrefixes;
if (prefix.getTableName().isPresent()) {
tablePrefixes = ImmutableList.of(prefix);
}
else {
tablePrefixes = Stream.concat(
metadata.listTables(session, prefix).stream(),
metadata.listViews(session, prefix).stream())
.map(QualifiedObjectName::asQualifiedTablePrefix)
.collect(toImmutableList());
}
for (QualifiedTablePrefix tablePrefix : tablePrefixes) {
for (Map.Entry<SchemaTableName, List<ColumnMetadata>> entry : listTableColumns(session, metadata, accessControl, tablePrefix).entrySet()) {
SchemaTableName tableName = entry.getKey();
int ordinalPosition = 1;
for (ColumnMetadata column : entry.getValue()) {
if (column.isHidden()) {
continue;
}
addRecord(
prefix.getCatalogName(),
tableName.getSchemaName(),
tableName.getTableName(),
column.getName(),
ordinalPosition,
null,
"YES",
column.getType().getDisplayName(),
column.getComment(),
column.getExtraInfo(),
column.getComment());
ordinalPosition++;
if (isLimitExhausted()) {
return;
}
}
}
}
}
...main/src/main/java/io/prestosql/connector/informationschema/InformationSchemaPageSource.java
Outdated
Show resolved
Hide resolved
...main/src/main/java/io/prestosql/connector/informationschema/InformationSchemaPageSource.java
Outdated
Show resolved
Hide resolved
...main/src/main/java/io/prestosql/connector/informationschema/InformationSchemaPageSource.java
Outdated
Show resolved
Hide resolved
...main/src/main/java/io/prestosql/connector/informationschema/InformationSchemaPageSource.java
Outdated
Show resolved
Hide resolved
return null; | ||
} | ||
|
||
if (pages.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|| !atLimit()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. I added such a check to isFinished()
and removed
while (prefixIterator.hasNext()) {
prefixIterator.next();
}
from isLimitExhausted()
.
ac4a2f4
to
369af25
Compare
@kokosing Thanks for the review. I've updated the PR but there is one test failure in Travis.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This connector does not support table privileges.
Try to reproduce the issue locally, I wonder what kind of query is issued to Presto from Tempto.
presto-master_1_419aac2a16f7 | at io.prestosql.plugin.hive.security.AccessControlMetadata.listTablePrivileges(AccessControlMetadata.java:123)
presto-master_1_419aac2a16f7 | at io.prestosql.plugin.hive.HiveMetadata.listTablePrivileges(HiveMetadata.java:2092)
suggests legacy hive access control is used in product tests, but I don't why it tries to enumerate table privilages.
@@ -195,6 +195,10 @@ public ConnectorTableProperties getTableProperties(ConnectorSession session, Con | |||
|
|||
Set<QualifiedTablePrefix> prefixes = getPrefixes(session, table, constraint); | |||
|
|||
if (prefixes.equals(defaultPrefixes())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
The problem here is that information schema connector does not allow subsequent predicate push down. Predicate could pushed down only once.
@Test | ||
public void testLimit() | ||
{ | ||
assertQuery("SELECT count(*) FROM (SELECT * from tpch.information_schema.columns LIMIT 1)", "VALUES 1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added two tests to TestInformationSchemaConnector::testMetadataCalls.
I don't see them.
Please let me know if you feel like it's better to do so.
Are there any other disadvantages other than code more complicated? I think if we can optimize some real life use case then I think it is worth doing so. Maybe it could be done after this PR with some additional abstraction or refactoring to handle this code complication.
while (pages.isEmpty() && prefixIterator.get().hasNext() && !closed && !isLimitExhausted()) { | ||
QualifiedTablePrefix prefix = prefixIterator.get().next(); | ||
switch (table) { | ||
case COLUMNS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a thought (no change requested):
maybe instead of having this switch we could create some PageProducer
interface with class per each information schema table that could iteratively produce pages. Then this class would be much easier to understand and possibly it would be easier to handle more complicated logic for columns table. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good proposal. The complication here is indeed that these tables are heterogeneous but we managed to handle them in a uniform way.
ed935eb
to
1b53ab8
Compare
Please ping me when it is ready to review. |
1b53ab8
to
9f5c036
Compare
It looks like there is some issue still, see: https://travis-ci.com/prestosql/presto/jobs/241339526 Last logs seem to be related
|
...main/src/main/java/io/prestosql/connector/informationschema/InformationSchemaPageSource.java
Show resolved
Hide resolved
...main/src/main/java/io/prestosql/connector/informationschema/InformationSchemaPageSource.java
Outdated
Show resolved
Hide resolved
presto-tests/src/test/java/io/prestosql/tests/TestInformationSchemaConnector.java
Outdated
Show resolved
Hide resolved
For tables that don't enumerate tables (schemata, roles, applicable_roles and enabled_roles), predicate pushdown can not be applied.
When HiveMetadata::listTables is called, Hive Connector firstly lists schemas, and then lists tables for each schema separately. As a result, assume there are N schemas and no filter on them, HiveMetadata::listTables will make 1 + N metastore calls. In Mock Connector, we intend to use function references listSchemaNames and listTables to emulate metastore calls. So we should modify MockConnectorMetadata::listTables to make its logic identical to HiveMetadata::listTables.
The numbers of tables in "test_schema1" and "test_schema2" are reduced because the added queries have full scans as sub-queries. The original numbers are too large and may crash workers during product tests.
9f5c036
to
e47e11c
Compare
@kokosing Thanks, PR updated.
I guess it was because queries like SELECT count(*) FROM (SELECT * from test_catalog.information_schema.columns LIMIT 100000 crashed the workers. So I reduced the test data size in the updated PR. The goal of a large data size is to make sure |
Checkstyle error
|
catalogName = tableHandle.getCatalogName(); | ||
table = tableHandle.getTable(); | ||
prefixIterator = Suppliers.memoize(new Supplier<Iterator<QualifiedTablePrefix>>() { | ||
private Iterator<QualifiedTablePrefix> iterator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this field when you are doing memoize
e47e11c
to
bbe96ad
Compare
prefixIterator = new Supplier<Iterator<QualifiedTablePrefix>>() {
private Iterator<QualifiedTablePrefix> iterator;
@Override
public Iterator<QualifiedTablePrefix> get()
{
if (iterator == null) {
Set<QualifiedTablePrefix> prefixes = tableHandle.getPrefixes();
if (isTablesEnumeratingTable(table)) {
if (prefixes.equals(defaultPrefixes(catalogName))) {
prefixes = metadata.listSchemaNames(session, catalogName).stream()
.map(schema -> new QualifiedTablePrefix(catalogName, schema))
.collect(toImmutableSet());
}
}
else {
checkArgument(prefixes.equals(defaultPrefixes(catalogName)), "Catalog-wise tables have prefixes other than the default one");
}
iterator = prefixes.iterator();
}
return iterator;
}
}; |
|
It enables the Information Schema Connector to build pages lazily so that downstream operators can process data as soon as it's ready.
bbe96ad
to
bace50a
Compare
Thanks, I understand how to write it now. Updated. |
@kokosing This is ready. Do you want to take another pass? |
.withListTablesCount(2) | ||
.withGetColumnsCount(30000)); | ||
.withListTablesCount(0) | ||
.withGetColumnsCount(8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Merged, thanks! Well done. Thank you for working on this! |
Supersedes #1121 and closes #1046 #998
Before: Information Schema Connector builds all the pages beforehand and returns a
FixedPageSource
.After: It builds pages lazily. Downstream operators can process data as soon as it's ready.
cc: @kokosing @wagnermarkd @findepi