Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering #197

Merged
merged 34 commits into from
Nov 14, 2023

Conversation

emkornfield
Copy link
Contributor

No description provided.

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
* When present there is expected to be one element corresponding to each repetition (i.e. size=max repetition_leve)
* where each element represens the count of the number of times that level occurs in the page/column chunk.
*/
8: optional list<i64> repetition_level_histogram;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it's a per-rowgroup statistics?

Copy link
Contributor Author

@emkornfield emkornfield Mar 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand, I thought statistics can be on either page or column chunk. the histograms would vary based on column chunk I think, but I might be misunderstanding your suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry for missunderstand this. How can we make full use of this histogram?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These levels are rather complicated for high-level applications, we may need to provide some utility to translate to norms (e.g. struct/map/array/null) that are much easier to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there is complexity here but I think this is the simplest and most complete set of information that we can provide readers. I think the utility methods would likely be per system but we could likely provide some in core packages that give leaf value estimates that follow the rules above.

@mapleFU in terms of usage these would be used similar to how row level reconstruction is done with them. For instance total number of nulls at the leaf can be computed assuming no repeated fields as the cumulative sum of the first n-1 entries in definition level. Similarly number of nested lists assuming no null lists should be computable with a similar cumulative sum. When lists are nullable repetition and definition level need to be looked at together to determine null vs empty lists. Similarly the number of empty lists at level can be inferred by looking at definition and repetition levels together

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could provide utility methods that account for common permutations of memory models

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems It can help pushdown some filter on List/Map, and helping constructing the list. It's great, but I think maybe we need some samples? Because it's a bit hard to understand how to make full use of it. Like some rules in ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So some simple examples in from Arrow (my rep/def level calculations are rusty so please double check to make sure they make sense):

  1. If all you have are nested structures (say 3 nullable levels), then the number of null values at the leaf is the sum of the first 3 elements of the definition level histogram. Individual level nullability for Arrow isn't super important because it always leaves the space for nulls so it is pretty much square. You could however determine that there are no nulls at the second level of nesting by checking the appropriate histogram box (and potentially save the allocation of the bit vector in that case)
  2. If you have two nested lists and assuming lists and elements are not nullable. The number of inner nested lists is rep_hist[0] - def_hist[0] + {rep_hist[1] (outer list starts - empty outer lists + inner list starts).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds feasible. Then we are good to store level histograms separately.

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
@wgtmac
Copy link
Member

wgtmac commented Mar 27, 2023

@gszadovszky @shangxinli Have time take a look?

@emkornfield emkornfield changed the title PARQUET-2261: Initial proposal for unencoded/uncompressed statistics PARQUET-2261: Proposal for unencoded/uncompressed statistics Mar 27, 2023
@mapleFU
Copy link
Member

mapleFU commented Mar 27, 2023

Seems this patch doesn't need to consider backward-compatibility rules?

* Special calculations:
* - Enum: plain-encoded BYTE_ARRAY size
* - Integers (same size used for signed and unsigned): int8 - 1 bytes, int16 - 2
* - Decimal - Each value is assumed to take the minimal number of bytes necessary to encode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that small Decimal can be encoded as FLBA or BYTE_ARRAY, but big decimal cannot be stored as i32. Should we force use the physical type or related with physical type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had this. I think given the two different opinions expressed, I'm going to change this field to only record variable width bytes, and say all other calcutions can be performed by readers based on type and number of values

@emkornfield
Copy link
Contributor Author

Seems this patch doesn't need to consider backward-compatibility rules?

Since we are storing rep/def levels directly I don't think so since those rules are applied on top of this data to make the correct inferences.

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy on this if some common util for histogram can be described and added. Thanks!

@wgtmac wgtmac requested a review from shangxinli March 28, 2023 02:56
*
* This value is optional when max_definition_level is 0.
*/
3: optional list<i64> definition_level_histogram;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do we need to add an extra histogram for pair<def_level, rep_level> if both exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. But I might not be following what you are suggesting (I gave two examples from Arrow below on usage).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we store def_levels and ref_levels separately, how can we derive number of nulls in each level precisely?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking of supporting pushing down filters like IS_NULL or IS_NOT_NULL to nested fields. So I want to make sure if this can satisfy the use case. Maybe we don't need precise null_count of each level but it would be great to answer yes or no to the filters above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might pay to illustrate exact queries, but if this is just answering a question is there any null element at a particular nesting level I think definition level histogram by itself gives that information.

Take a nested lists where both lists and elements can be nullable at each level. IIRC, the definition levels would represent as follows:
0 - Null top level list.
1 - empty top level list
2 - null nested list
3 - empty nested list
4 - null leaf element
5 - present leaf element

So if the query is for top level list is null, one could prune when def_level[0] == 0. For is not null one could prune if def_level[0] == num_values from page (i.e. all values are null).

I believe similar logic holds for def_level[2] but could get more complicated depending on the semantics of whether a top level null element should imply a the nested list is also null or if an empty list implies the nested list should be considered null (but should still be derivable by using histogram indices 0,1 and 2).

One thing the joint histogram (pairs of rep/def level counts) could give you is the number first list elements that are null, but I'm not sure how useful that is. I would need to think about other queries the joint histogram would enable (or if you have more examples of supported queries we can figure out if one is needed).

@wgtmac wgtmac requested review from pitrou and gszadovszky March 29, 2023 01:24
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to include these statistics at both row group (column chunk) and page level? For the latter I am not sure it is the right approach. We implemented column indexes so one would not need to read the page header to get the related statistics. We even stopped writing Statistics into page headers in parquet-mr.
If we only want these for the column chunk level then I would suggest having it under ColumnMetaData directly.

@emkornfield
Copy link
Contributor Author

emkornfield commented Apr 1, 2023

Do we want to include these statistics at both row group (column chunk) and page level? For the latter I am not sure it is the right approach. We implemented column indexes so one would not need to read the page header to get the related statistics. We even stopped writing Statistics into page headers in parquet-mr. If we only want these for the column chunk level then I would suggest having it under ColumnMetaData directly.

@gszadovsky
Is there an argument against flexibility here? I believe parquet-cpp still writes page headers. One argument for page headers is it allows readers better incremental estimates of memory needed as they progress (although it is possible taking an average size per cell at column chunk is sufficient here), i.e. these stats accomplish a few things: 1. finer grain pushdown for nulls (mostly valuable at column chunk, and possibly adding to the page index). 2. estimating memory usage for reconstructing in memory avoid allocations during scans. 3. Query planning to get better estimates for things like broad-cast joins.

@gszadovszky
Copy link
Contributor

@emkornfield, I am not against re-adding such statistics to page headers if it have benefits. Meanwhile, I would only expect filling values that are really make sense and not redundant at the related level (page or column chunk). If a value make sense for filtering then it should be added to the column index instead of to the page header.
However from specification point of view it would not be strictly required to describe the actual use case, it would be nice to do so at some place. WDYT?

@emkornfield
Copy link
Contributor Author

However from specification point of view it would not be strictly required to describe the actual use case, it would be nice to do so at some place. WDYT?

This sounds reasonable, let me refactor this a little bit so we can add it to column index as well, with a little bit more description on each of the use-cases

@emkornfield
Copy link
Contributor Author

@gszadovszky updated the PR based on your feedback, let me know what you think.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I can accept having the old statistics saved in the page headers as well as column indexes (for backward compatibility), you also highlights the superiority of the column indexes. As so I would not extend the statistics saved in the page headers.
What do you think about adding SizeEstimationStatistics directly to ColumnMetaData and the histograms to the ColumnIndex?

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
@emkornfield
Copy link
Contributor Author

emkornfield commented Apr 5, 2023

What do you think about adding SizeEstimationStatistics directly to ColumnMetaData and the histograms to the ColumnIndex?

this sounds fine to me. We can always add separately to page headers if people want them. @wgtmac concerns with this?

@wgtmac
Copy link
Member

wgtmac commented Apr 6, 2023

What do you think about adding SizeEstimationStatistics directly to ColumnMetaData and the histograms to the ColumnIndex?

this sounds fine to me. We can always add separately to page headers if people want them. @wgtmac concerns with this?

I am good with this approach.

@@ -764,6 +810,14 @@ struct ColumnMetaData {
* in a single I/O.
*/
15: optional i32 bloom_filter_length;

/**
Copy link
Contributor

@JFinis JFinis Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not related to this line, but to ColumnMetaData in general)

For completeness-reasons, we might also want to add unencoded_byte_array_data_bytes and num_entries for the dictionary page (if existent) into the ColumnMetadata, i.e., dictionary_unencoded_byte_array_data_bytes and num_dictionary_entries.

This way, readers could plan how much memory the dictionary of a column chunk will take. This can help in decisions whether, e.g., to load the dictionary up-front to perform pre-filtering on the dictionary. It also helps to right-size the buffer that will hold the dictionary.

I'm not suggesting that this is a must-have for this commit or at all, so feel free to drop this issue. I just wanted to voice that if we already want to provide tools for size estimation, the dictionary is currently not really accounted for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this could be useful, my preference is to leave this out for now as it hasn't come up in discussion before we can always add this as follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields are already presented in the page header but it requires an in-efficient hop to read it.

@wgtmac
Copy link
Member

wgtmac commented Oct 28, 2023

Java PoC implementation: apache/parquet-java#1177

etseidl added a commit to etseidl/cudf that referenced this pull request Oct 28, 2023
@etseidl
Copy link
Contributor

etseidl commented Nov 1, 2023

C++ PoC as well rapidsai/cudf#14000. Do we still need arrow-cpp to proceed? If so, @mapleFU have you started on apache/arrow#37869 yet? If not, I have cycles to try a first pass at an implementation if that would be helpful.

@wgtmac
Copy link
Member

wgtmac commented Nov 2, 2023

C++ PoC as well rapidsai/cudf#14000. Do we still need arrow-cpp to proceed? If so, @mapleFU have you started on apache/arrow#37869 yet? If not, I have cycles to try a first pass at an implementation if that would be helpful.

IMHO, two PoC impls do not imply that they have to be parquet-mr and arrow-cpp. So should we proceed to a formal vote on this PR? @emkornfield @gszadovszky

Once this PR has been merged, I can proceed the release process of parquet format v2.10. After that, all pending PoCs can be formally reviewed and merged.

@emkornfield
Copy link
Contributor Author

I think it is reasonable for two different implementations that can be demonstrated to interop with each other. We should really document the exact policy thought. @gszadovszky @pitrou

@etseidl
Copy link
Contributor

etseidl commented Nov 3, 2023

I think it is reasonable for two different implementations that can be demonstrated to interop with each other. We should really document the exact policy thought.

I have @wgtmac's branch compiled locally, and can test the interoperability between the two existing PoC implementations. FWIW I've been using a modified parquet-cli in parquet-mr to print out the statistics I generate, so I don't foresee any big problems there.

Update: I've verified that spark with Gang's changes can read my files, and my implementation can read the spark generated files. Also verified the histograms and unencoded sizes match at the chunk level (pages are problematic because we choose different cutoff strategies, so even if the limits are set to the same values we get different page sizes).

Update 2: I set rows per page to 100 and then got congruent files. Can now confirm both PoCs produce the same histograms and unencoded sizes at the page level as well. This is for a file with a wide variety of nesting with and without null values. Let me know if there are any specific files from parquet-testing you'd like to see run through both PoCs.

@wgtmac
Copy link
Member

wgtmac commented Nov 5, 2023

Thanks @etseidl for verifying two implementations!

As they are still in the PoC state, I think the manual verification is sufficient and prefer delaying the work on interoperability by adding parquet files with SizeStatistics to the parquet-testing repo. We can add testing files after the implementations are formally reviewed and merged.

WDYT? @emkornfield

@emkornfield
Copy link
Contributor Author

Thanks @etseidl for verifying two implementations!

As they are still in the PoC state, I think the manual verification is sufficient and prefer delaying the work on interoperability by adding parquet files with SizeStatistics to the parquet-testing repo. We can add testing files after the implementations are formally reviewed and merged.

WDYT? @emkornfield

I think this is sufficient for now. @wgtmac if we think the Java PR is close enough I can start a vote for approval on this PR unless you want to do it?

@wgtmac
Copy link
Member

wgtmac commented Nov 5, 2023

I think it is good time to start a vote. @emkornfield

I will take care of the release process once the proposal passes the vote :)

@emkornfield
Copy link
Contributor Author

Started vote on mailing list.

@wgtmac
Copy link
Member

wgtmac commented Nov 8, 2023

Could you help the vote on mailing list: https://lists.apache.org/thread/wgobz41mfldbhqpg9q4mdwypghg2cxg2? Help is needed from the PMC members. @ggershinsky @gszadovszky @julienledem @rdblue @wesm @xhochy

@emkornfield
Copy link
Contributor Author

The vote passed. Is the standard way of merging her using a squash commit?

@Fokko
Copy link
Contributor

Fokko commented Nov 14, 2023

@emkornfield I don't think there is a formal agreement on it on the Parquet project. squash is my personal preference and that's what the pre-Github script is doing as well:

run_cmd(['git', 'merge', pr_branch_name, '--squash'])

@emkornfield emkornfield merged commit 40699d0 into apache:master Nov 14, 2023
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Dec 6, 2023
Adds Parquet size statistics introduced in apache/parquet-format#197.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #14000
karthikeyann pushed a commit to karthikeyann/cudf that referenced this pull request Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.