Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Hive file system listing #18179

Merged
merged 2 commits into from
Jul 14, 2023

Conversation

alexjo2144
Copy link
Member

Description

Reduce the number of times Hive needs to call the file system listing API when using common file systems.

Additional context and related issues

Relates to: #17323

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

@cla-bot cla-bot bot added the cla-signed label Jul 7, 2023
@alexjo2144 alexjo2144 requested review from electrum and losipiuk July 7, 2023 15:40
Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just not sure how much does it buy us given we have cache in place anyway.

Also can the cache itself be keyed on scheme? Not on whole filesystem object?

@alexjo2144
Copy link
Member Author

just not sure how much does it buy us given we have cache in place anyway.

I was finding that cache hits were somewhat uncommon because TrinoFileSystem instances are fairly short lived in the Hive code.

Also can the cache itself be keyed on scheme? Not on whole filesystem object?

No, unfortunately not. ADLS Gen2 can be used in either a hierarchical or non-hierarchical mode and I don't think it is indicated by a different scheme.

@alexjo2144 alexjo2144 force-pushed the hive/reduce-list-calls branch 2 times, most recently from 40bbfd9 to 67a5c7b Compare July 13, 2023 15:55
@github-actions github-actions bot added tests:hive hive Hive connector labels Jul 13, 2023
@alexjo2144 alexjo2144 force-pushed the hive/reduce-list-calls branch from 67a5c7b to 7dcc4a4 Compare July 13, 2023 19:44
Reduce the number of times Hive needs to call the file
system listing API when using common file systems.

Additionally, avoid unnecessary directory exists checks when
file listing returns a non-empty result.
@alexjo2144 alexjo2144 force-pushed the hive/reduce-list-calls branch from 7dcc4a4 to f0ef639 Compare July 13, 2023 19:52
@alexjo2144 alexjo2144 requested review from losipiuk and findepi July 13, 2023 19:52
@alexjo2144
Copy link
Member Author

@findepi @losipiuk please take another look, added another suggestion Piotr had offline that non-empty partitions don't need to be checked explicitly.

try {
return ImmutableList.copyOf(new HiveFileIterator(table, location, fs, directoryLister, hdfsNamenodeStats, FAIL));
HiveFileIterator fileIterator = new HiveFileIterator(table, location, fs, directoryLister, hdfsNamenodeStats, FAIL);
Copy link
Member

@losipiuk losipiuk Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will iterator creation report error if location does not exists? If so will this error be as nice as one generated by checkPartitionLocationExists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will iterator creation report error if location does not exists?

It will not, that's why the check is there.

@losipiuk losipiuk merged commit 703cc2a into trinodb:master Jul 14, 2023
@github-actions github-actions bot added this to the 423 milestone Jul 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

3 participants