Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid materializing all results in Iceberg $files system table at once #16112

Conversation

krvikash
Copy link
Contributor

@krvikash krvikash commented Feb 14, 2023

Description

Fixes #15991

The RR is an alternative approach to fix #15991. another approach #16063 to fix #15991

For iceberg.iceberg_small_files_tpcds_sf1000_parquet.store_sales$files earlier it used to take 826MB of memory (~32K records loaded) and after the change, it takes 32.3 MB (~72K records loaded). It also reduces the response to console time by 50%.

This example is when the table has ~227K files.

Before fix:
Screenshot 2023-02-17 at 1 29 18 PM

After fix:
Screenshot 2023-02-17 at 1 16 54 PM

Release notes

(X) Release notes are required, with the following suggested text:

# Iceberg
* Reduce memory usage when reading `$files` system tables. ({issue}`16112`)

@cla-bot cla-bot bot added the cla-signed label Feb 14, 2023
@krvikash krvikash self-assigned this Feb 14, 2023
@krvikash
Copy link
Contributor Author

CC: @alexjo2144 | @findinpath

@krvikash krvikash force-pushed the avoid-materializing-all-results-in-file-table-alternative branch from e81e50e to f1658ac Compare February 14, 2023 16:17
@krvikash krvikash changed the title [Alternative]: Avoid materializing all results in Iceberg system table at once [Alternative]: Avoid materializing all results in Iceberg $files system table at once Feb 14, 2023
Copy link
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a better direction that the PageBuilder way

@krvikash krvikash force-pushed the avoid-materializing-all-results-in-file-table-alternative branch from f1658ac to 441c747 Compare February 14, 2023 19:37
@krvikash
Copy link
Contributor Author

Thanks @alexjo2144 | @findinpath for reviewing the PR.

This is definitely a better direction that the PageBuilder way

I hope we are going forward with this (RecordCursor) approach. I will make this as primary PR.

@krvikash krvikash changed the title [Alternative]: Avoid materializing all results in Iceberg $files system table at once Avoid materializing all results in Iceberg $files system table at once Feb 14, 2023
@krvikash krvikash marked this pull request as ready for review February 14, 2023 19:45
@krvikash krvikash requested review from ebyhr and homar February 14, 2023 19:45
@krvikash
Copy link
Contributor Author

Currently, we materialize all columns from the $files table irrespective of what column is asked in the query. Is it worth looking for if we can only materialize the column which is asked?

@krvikash
Copy link
Contributor Author

Currently, we materialize all columns from the $files table irrespective of what column is asked in the query. Is it worth looking for if we can only materialize the column which is asked?

Like, If we query SELECT file_path from "store_sales$files"; then we collect all columns (My question is here why not just collect the column which is queried for, file_path in this case) from $files table and pass it to the caller. and the caller will display according to what field we have in the SELECT query.

@krvikash krvikash force-pushed the avoid-materializing-all-results-in-file-table-alternative branch from 441c747 to 7e02045 Compare February 15, 2023 18:04
@krvikash
Copy link
Contributor Author

Addressed comments and rebased with master.

@krvikash
Copy link
Contributor Author

Hi @ebyhr, When you get time, could you please take a look at this PR?

Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just skimmed.

@krvikash krvikash force-pushed the avoid-materializing-all-results-in-file-table-alternative branch from 7e02045 to 7e1c6b1 Compare February 16, 2023 06:35
@krvikash
Copy link
Contributor Author

Thanks, @ebyhr | @findinpath | @homar for the reviews. I have addressed the comments.

@krvikash krvikash force-pushed the avoid-materializing-all-results-in-file-table-alternative branch from 7e1c6b1 to 71f693a Compare February 16, 2023 06:58
@krvikash krvikash force-pushed the avoid-materializing-all-results-in-file-table-alternative branch from 71f693a to b4b5d36 Compare February 16, 2023 09:01
@ebyhr ebyhr merged commit b8cbac2 into trinodb:master Feb 17, 2023
@ebyhr ebyhr mentioned this pull request Feb 17, 2023
@krvikash krvikash deleted the avoid-materializing-all-results-in-file-table-alternative branch February 17, 2023 08:20
@github-actions github-actions bot added this to the 408 milestone Feb 17, 2023
@@ -68,15 +68,15 @@ public RecordCursor cursor()
return new InMemoryRecordCursor(types, records.iterator());
}

private static class InMemoryRecordCursor
public static class InMemoryRecordCursor<T extends Iterator<? extends List<?>>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class became generic, but so there are now warnings wherever it's used as a raw class (eg 3 lines above).

BTW it seems the generics are not leveraged, since T shows up only in the constructor declaration.
#16295

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Avoid materializing all results in $files Iceberg system table
6 participants