Do not cache Hadoop LocatedFileStatus objects #14408

sopel39 · 2022-09-30T15:29:34Z

LocatedFileStatus objects contain much more fields than are required by Trino. It doesn't make sense
to cache them.

Description

Non-technical explanation

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

sopel39 · 2022-09-30T15:30:44Z

Helps with #14313

findepi · 2022-09-30T15:38:43Z

Helps with #14313

Helps or fixes?

the io.trino.plugin.hive.fs.CachingDirectoryLister#cache currently is bounded by total number of files, but file information still varies (eg list of blocks). Also, it's harder to tune.
What about making it based on memory footprint?
i.e. let the user configure number of bytes the cache can take, and then let's weight entries by retained size?
like in

trino/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java

Line 119 in 9859b91

    
           .weigher((Weigher<String, DeltaLakeDataFileCacheEntry>) (key, value) -> Ints.saturatedCast(estimatedSizeOf(key) + value.getRetainedSizeInBytes()))

plugin/trino-hive/src/main/java/io/trino/plugin/hive/fs/TrinoFileStatus.java

findepi · 2022-10-01T05:06:59Z

Do not cache Hadoop LocatedFileStatus objects

Let's maybe call it

Reduce CachingDirectoryLister memory footprint

Instead of caching Hadoop LocatedFileStatus objects which contain many 
fields we don't need, store only the information we need.

it;s OK to address memory-aware caching configuration (#14408 (comment)) as a follow-up

Instead of caching Hadoop LocatedFileStatus objects which contain many fields we don't need, store only the information we need.

sopel39 · 2022-10-03T15:46:37Z

it;s OK to address memory-aware caching configuration (#14408 (comment)) as a follow-up

I was actually thinking about reducing limits (if needed) and not too much complexity.

TransactionScopeCachingDirectoryLister is still unbounded (it's per query)

findepi · 2022-10-04T08:11:55Z

TransactionScopeCachingDirectoryLister is still unbounded (it's per query)

it's per transaction, and Hive connector supports transactions spanning multiple queries

a single table can perhaps comprise of a large number of partitions (even more so after @arhimondr 's #14225) and these of large number of files, so maybe it needs to be size limited?

For context, Delta connector has active files cache (getting list of files is more expensive there), and it has been proven to raise connector memory requirements significantly when that connector is in use.

sopel39 · 2022-10-04T09:59:18Z

it's per transaction, and Hive connector supports transactions spanning multiple queries

I mean concurrent queries or long running queries (file listing will still be cached during transaction).

a single table can perhaps comprise of a large number of partitions (even more so after @arhimondr 's #14225) and these of large number of files, so maybe it needs to be size limited?

I'm not sure extra complexity of counting bytes is worth it compared to just dropping listing limit to 10_000 (although I'm not sure how effective cache is then).

Also alternative is having some global size limit for transactional cache (but one has to keep object lifecycle then).

sopel39 requested review from findepi and raunaqmorarka September 30, 2022 15:29

cla-bot bot added the cla-signed label Sep 30, 2022

raunaqmorarka approved these changes Sep 30, 2022

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/fs/TrinoFileStatus.java Outdated Show resolved Hide resolved

github-actions bot added the tests:hive label Sep 30, 2022

Reduce caching and transactional directory listers memory footprint

16701ff

Instead of caching Hadoop LocatedFileStatus objects which contain many fields we don't need, store only the information we need.

sopel39 force-pushed the ks/do_not_cache branch from 22f7747 to 16701ff Compare October 3, 2022 15:45

sopel39 merged commit c4c8717 into trinodb:master Oct 4, 2022

sopel39 deleted the ks/do_not_cache branch October 4, 2022 10:01

sopel39 mentioned this pull request Oct 4, 2022

Add Trino 399 release notes #14360

Closed

github-actions bot added this to the 399 milestone Oct 4, 2022

sopel39 mentioned this pull request Oct 4, 2022

Coordinator out of memory due to too many FileStatuses been cached #14313

Closed

colebow mentioned this pull request Oct 5, 2022

Add Trino 399 release notes #14477

Merged

sopel39 mentioned this pull request May 24, 2024

Equality delete #21441

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not cache Hadoop LocatedFileStatus objects #14408

Do not cache Hadoop LocatedFileStatus objects #14408

sopel39 commented Sep 30, 2022

sopel39 commented Sep 30, 2022

findepi commented Sep 30, 2022

findepi commented Oct 1, 2022 •

edited

Loading

sopel39 commented Oct 3, 2022

findepi commented Oct 4, 2022

sopel39 commented Oct 4, 2022 •

edited

Loading

Do not cache Hadoop LocatedFileStatus objects #14408

Do not cache Hadoop LocatedFileStatus objects #14408

Conversation

sopel39 commented Sep 30, 2022

Description

Non-technical explanation

Release notes

sopel39 commented Sep 30, 2022

findepi commented Sep 30, 2022

findepi commented Oct 1, 2022 • edited Loading

sopel39 commented Oct 3, 2022

findepi commented Oct 4, 2022

sopel39 commented Oct 4, 2022 • edited Loading

findepi commented Oct 1, 2022 •

edited

Loading

sopel39 commented Oct 4, 2022 •

edited

Loading