Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not cache Hadoop LocatedFileStatus objects #14408

Merged
merged 1 commit into from
Oct 4, 2022

Conversation

sopel39
Copy link
Member

@sopel39 sopel39 commented Sep 30, 2022

LocatedFileStatus objects contain much more fields than are required by Trino. It doesn't make sense
to cache them.

Description

Non-technical explanation

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@sopel39
Copy link
Member Author

sopel39 commented Sep 30, 2022

Helps with #14313

@findepi
Copy link
Member

findepi commented Sep 30, 2022

Helps with #14313

Helps or fixes?

the io.trino.plugin.hive.fs.CachingDirectoryLister#cache currently is bounded by total number of files, but file information still varies (eg list of blocks). Also, it's harder to tune.
What about making it based on memory footprint?
i.e. let the user configure number of bytes the cache can take, and then let's weight entries by retained size?
like in

.weigher((Weigher<String, DeltaLakeDataFileCacheEntry>) (key, value) -> Ints.saturatedCast(estimatedSizeOf(key) + value.getRetainedSizeInBytes()))

@findepi
Copy link
Member

findepi commented Oct 1, 2022

Do not cache Hadoop LocatedFileStatus objects

Let's maybe call it

Reduce CachingDirectoryLister memory footprint

Instead of caching Hadoop LocatedFileStatus objects which contain many 
fields we don't need, store only the information we need.

it;s OK to address memory-aware caching configuration (#14408 (comment)) as a follow-up

Instead of caching Hadoop LocatedFileStatus objects which contain many
fields we don't need, store only the information we need.
@sopel39
Copy link
Member Author

sopel39 commented Oct 3, 2022

it;s OK to address memory-aware caching configuration (#14408 (comment)) as a follow-up

I was actually thinking about reducing limits (if needed) and not too much complexity.

TransactionScopeCachingDirectoryLister is still unbounded (it's per query)

@findepi
Copy link
Member

findepi commented Oct 4, 2022

TransactionScopeCachingDirectoryLister is still unbounded (it's per query)

it's per transaction, and Hive connector supports transactions spanning multiple queries

a single table can perhaps comprise of a large number of partitions (even more so after @arhimondr 's #14225) and these of large number of files, so maybe it needs to be size limited?

For context, Delta connector has active files cache (getting list of files is more expensive there), and it has been proven to raise connector memory requirements significantly when that connector is in use.

@sopel39
Copy link
Member Author

sopel39 commented Oct 4, 2022

it's per transaction, and Hive connector supports transactions spanning multiple queries

I mean concurrent queries or long running queries (file listing will still be cached during transaction).

a single table can perhaps comprise of a large number of partitions (even more so after @arhimondr 's #14225) and these of large number of files, so maybe it needs to be size limited?

I'm not sure extra complexity of counting bytes is worth it compared to just dropping listing limit to 10_000 (although I'm not sure how effective cache is then).

Also alternative is having some global size limit for transactional cache (but one has to keep object lifecycle then).

@sopel39 sopel39 merged commit c4c8717 into trinodb:master Oct 4, 2022
@sopel39 sopel39 deleted the ks/do_not_cache branch October 4, 2022 10:01
@github-actions github-actions bot added this to the 399 milestone Oct 4, 2022
@sopel39 sopel39 mentioned this pull request May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants