Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering #197

Merged
merged 34 commits into from
Nov 14, 2023
Merged
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
e311a8a
Draft: Initial proposal
emkornfield Mar 26, 2023
c60568b
Update src/main/thrift/parquet.thrift
emkornfield Mar 26, 2023
d341446
Separate out structure
emkornfield Mar 27, 2023
e49b908
revert cumulative, change decimal to minimum precisions
emkornfield Mar 27, 2023
f9fba05
Change field to variable width bytes
emkornfield Mar 27, 2023
97d22a5
update based on feedback
emkornfield Mar 28, 2023
c6e244b
more language smithing
emkornfield Mar 28, 2023
24fafcc
change name to unencoded
emkornfield Mar 28, 2023
0556bfb
Update src/main/thrift/parquet.thrift
emkornfield Apr 1, 2023
6fe15bb
Update src/main/thrift/parquet.thrift
emkornfield Apr 1, 2023
c5ca0e1
address some comment
emkornfield Apr 1, 2023
766e62d
separate histogram and add it to index
emkornfield Apr 1, 2023
a0d0d43
Update src/main/thrift/parquet.thrift
emkornfield Apr 5, 2023
5edbf34
Update src/main/thrift/parquet.thrift
emkornfield Apr 5, 2023
0dfc307
move stats
emkornfield Apr 5, 2023
f00f611
Update src/main/thrift/parquet.thrift
emkornfield Apr 12, 2023
cb68370
fix some formatting
emkornfield Apr 19, 2023
b7ebf1c
Apply suggestions from code review
emkornfield Aug 22, 2023
407443f
Update src/main/thrift/parquet.thrift
emkornfield Aug 22, 2023
ab3ef49
Update src/main/thrift/parquet.thrift
emkornfield Aug 22, 2023
a9ec32f
reflow comments, remove trailing whitespace
emkornfield Aug 22, 2023
d0b051c
update based on feedback on PR
emkornfield Sep 1, 2023
8591f23
rename size_estimate_statistics
emkornfield Sep 1, 2023
98d1881
clarify histogram comment
emkornfield Sep 1, 2023
0181c5e
rename field
emkornfield Sep 1, 2023
9fa9f9c
update language on omitted fields
emkornfield Sep 5, 2023
4f8dcf0
Update src/main/thrift/parquet.thrift
emkornfield Sep 6, 2023
ce7904d
Update src/main/thrift/parquet.thrift
emkornfield Sep 6, 2023
1752f2d
Update src/main/thrift/parquet.thrift
emkornfield Sep 6, 2023
5ee0864
flatten structures
emkornfield Sep 7, 2023
b4a703d
update indices
emkornfield Sep 7, 2023
7081735
Update src/main/thrift/parquet.thrift
emkornfield Sep 8, 2023
aee8c0e
remove redundant phrasing
emkornfield Sep 12, 2023
d7856b1
Update src/main/thrift/parquet.thrift
emkornfield Sep 12, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 91 additions & 11 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,52 @@ enum FieldRepetitionType {
REPEATED = 2;
}

/**
* A structure for capturing metadata for estimating the unencoded,
* uncompressed size of data written. This is useful for readers to estimate
* how much memory is needed to reconstruct data in their memory model and for
* fine grained filter pushdown on nested structures (the histograms contained
* in this structure can help determine the number of nulls at a particular
* nesting level and maximum length of lists).
*/
struct SizeStatistics {
/**
* The number of physical bytes stored for BYTE_ARRAY data values assuming
* no encoding. This is exclusive of the bytes needed to store the length of
* each byte array. In other words, this field is equivalent to the `(size
* of PLAIN-ENCODING the byte array values) - (4 bytes * number of values
* written)`. To determine unencoded sizes of other types readers can use
* schema information multiplied by the number of non-null and null values.
* The number of null/non-null values can be inferred from the histograms
* below.
*
* For example, if a column chunk is dictionary-encoded with dictionary
* ["a", "bc", "cde"], and a data page contains the indices [0, 0, 1, 2],
* then this value for that data page should be 7 (1 + 1 + 2 + 3).
*
* This field should only be set for types that use BYTE_ARRAY as their
* physical type.
*/
1: optional i64 unencoded_byte_array_data_bytes;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
/**
* When present, there is expected to be one element corresponding to each
* repetition (i.e. size=max repetition_level+1) where each element
* represents the number of times the repetition level was observed in the
* data.
*
* This field may be omitted if max_repetition_level is 0 without loss
* of information.
**/
2: optional list<i64> repetition_level_histogram;
emkornfield marked this conversation as resolved.
Show resolved Hide resolved
/**
* Same as repetition_level_histogram except for definition levels.
*
* This field may be omitted if max_definition_level is 0 or 1 without
* loss of information.
**/
3: optional list<i64> definition_level_histogram;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do we need to add an extra histogram for pair<def_level, rep_level> if both exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. But I might not be following what you are suggesting (I gave two examples from Arrow below on usage).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we store def_levels and ref_levels separately, how can we derive number of nulls in each level precisely?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking of supporting pushing down filters like IS_NULL or IS_NOT_NULL to nested fields. So I want to make sure if this can satisfy the use case. Maybe we don't need precise null_count of each level but it would be great to answer yes or no to the filters above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might pay to illustrate exact queries, but if this is just answering a question is there any null element at a particular nesting level I think definition level histogram by itself gives that information.

Take a nested lists where both lists and elements can be nullable at each level. IIRC, the definition levels would represent as follows:
0 - Null top level list.
1 - empty top level list
2 - null nested list
3 - empty nested list
4 - null leaf element
5 - present leaf element

So if the query is for top level list is null, one could prune when def_level[0] == 0. For is not null one could prune if def_level[0] == num_values from page (i.e. all values are null).

I believe similar logic holds for def_level[2] but could get more complicated depending on the semantics of whether a top level null element should imply a the nested list is also null or if an empty list implies the nested list should be considered null (but should still be derivable by using histogram indices 0,1 and 2).

One thing the joint histogram (pairs of rep/def level counts) could give you is the number first list elements that are null, but I'm not sure how useful that is. I would need to think about other queries the joint histogram would enable (or if you have more examples of supported queries we can figure out if one is needed).

}

/**
* Statistics per row group and per page
* All fields are optional.
Expand Down Expand Up @@ -529,7 +575,7 @@ struct DataPageHeader {
/** Encoding used for repetition levels **/
4: required Encoding repetition_level_encoding;

/** Optional statistics for the data in this page**/
/** Optional statistics for the data in this page **/
5: optional Statistics statistics;
}

Expand Down Expand Up @@ -571,19 +617,19 @@ struct DataPageHeaderV2 {

// repetition levels and definition levels are always using RLE (without size in it)

/** length of the definition levels */
/** Length of the definition levels */
5: required i32 definition_levels_byte_length;
/** length of the repetition levels */
/** Length of the repetition levels */
6: required i32 repetition_levels_byte_length;

/** whether the values are compressed.
/** Whether the values are compressed.
Which means the section of the page between
definition_levels_byte_length + repetition_levels_byte_length + 1 and compressed_page_size (included)
is compressed with the compression_codec.
If missing it is considered compressed */
7: optional bool is_compressed = true;

/** optional statistics for the data in this page **/
/** Optional statistics for the data in this page **/
8: optional Statistics statistics;
}

Expand All @@ -596,11 +642,11 @@ union BloomFilterAlgorithm {
}

/** Hash strategy type annotation. xxHash is an extremely fast non-cryptographic hash
* algorithm. It uses 64 bits version of xxHash.
* algorithm. It uses 64 bits version of xxHash.
**/
struct XxHash {}

/**
/**
* The hash function used in Bloom filter. This function takes the hash of a column value
* using plain encoding.
**/
Expand Down Expand Up @@ -764,6 +810,14 @@ struct ColumnMetaData {
* in a single I/O.
*/
15: optional i32 bloom_filter_length;

/**
Copy link
Contributor

@JFinis JFinis Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not related to this line, but to ColumnMetaData in general)

For completeness-reasons, we might also want to add unencoded_byte_array_data_bytes and num_entries for the dictionary page (if existent) into the ColumnMetadata, i.e., dictionary_unencoded_byte_array_data_bytes and num_dictionary_entries.

This way, readers could plan how much memory the dictionary of a column chunk will take. This can help in decisions whether, e.g., to load the dictionary up-front to perform pre-filtering on the dictionary. It also helps to right-size the buffer that will hold the dictionary.

I'm not suggesting that this is a must-have for this commit or at all, so feel free to drop this issue. I just wanted to voice that if we already want to provide tools for size estimation, the dictionary is currently not really accounted for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this could be useful, my preference is to leave this out for now as it hasn't come up in discussion before we can always add this as follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields are already presented in the page header but it requires an in-efficient hop to read it.

* Optional statistics to help estimate total memory when converted to in-memory
* representations. The histograms contained in these statistics can
* also be useful in some cases for more fine-grained nullability/list length
* filter pushdown.
*/
16: optional SizeStatistics size_statistics;
}

struct EncryptionWithFooterKey {
Expand All @@ -772,7 +826,7 @@ struct EncryptionWithFooterKey {
struct EncryptionWithColumnKey {
/** Column path in schema **/
1: required list<string> path_in_schema

/** Retrieval metadata of column encryption key **/
2: optional binary key_metadata
}
Expand Down Expand Up @@ -811,7 +865,7 @@ struct ColumnChunk {

/** Crypto metadata of encrypted columns **/
8: optional ColumnCryptoMetaData crypto_metadata

/** Encrypted column metadata for this chunk **/
9: optional binary encrypted_column_metadata
}
Expand Down Expand Up @@ -938,6 +992,13 @@ struct OffsetIndex {
* that page_locations[i].first_row_index < page_locations[i+1].first_row_index.
*/
1: required list<PageLocation> page_locations
/**
* Unencoded/uncompressed size for BYTE_ARRAY types.
*
* See documention for unencoded_byte_array_data_bytes in SizeStatistics for
* more details on this field.
*/
2: optional list<i64> unencoded_byte_array_data_bytes
}

/**
Expand Down Expand Up @@ -977,6 +1038,25 @@ struct ColumnIndex {

/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts

/**
* Contains repetition level histograms for each page
* concatenated together. The repetition_level_histogram field on
* SizeStatistics contains more details.
*
* When present the length should always be (number of pages *
* (max_repetition_level + 1)) elements.
*
* Element 0 is the first element of the histogram for the first page.
* Element (max_repetition_level + 1) is the first element of the histogram
* for the second page.
Comment on lines +1050 to +1052
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column-major vs row-major terminology confused me, so I documented one approach. If there isn't consensus on the ordering here, lets please create a new thread.

I'll take a stab (and likely be wrong 😉). I believe the discussion has assumed the histograms form a matrix with the row index being page number, the column index being the level. Assuming that, what you have defined would be the row major ordering, where elements of the same row are contiguous in memory, as in a C matrix. What @JFinis and @pitrou seem to prefer is the opposite, where elements of the same column are contiguous in memory, as in a Fortran matrix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think i got lost in whether column-major meant "page-major" or "histogram-major". I'll let @JFinis and @pitrou chime in if they want this reversed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess I'm confused by the terminology as well. Perhaps we can drop "row-major" and "column-major" entirely when discussing this? :-)

Personally, what I mean is that the levels of a given page should be contiguous in memory. I suppose that can be called "page-major" but I might be mistaken.

(if it was a 2D array or list, you would index first by page and then by level)

Copy link
Contributor Author

@emkornfield emkornfield Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on moving away from the terminology "major" terminology and just assess if the comment matches what you want (which I think it does).

The other "major" option would be something like Element 0 is the first element of the histogram for the first page. Element 1 is the first element of the histogram from the second page. The element at index ``num_pages`` is the second element of the histogram for the first page

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on moving away from the terminology "major" terminology and just assess if the comment matches what you want (which I think it does).

Yes, it does.

That said, I don't have a strong opinion on this anymore. I got confused by the "column-major" terminology and mistakenly assumed that Parquet columns were involved. Actually they're not :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm listing an option, I agree there is tension between size estimation use-case and filtering use-case, it really depends what we want to favor (the plus side of favoring size estimation is it a more obvious representation).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm listing an option,

Understood. And I'm just voicing a preference 😉 I don't need to squeeze every last microsecond out of size estimation, so am happy to yield to those with more pressing performance constraints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the per level representation to be slightly superior, as it is more useful for filtering. Filtering is a process that might lead to most pages being skipped, so the overall query time might be super short in this case. The most extreme case would be a point look-up where only a single row in a single page survives the filters. In this case, the performance of actually performing the filtering on the page index might have a measurable impact.

In contrast, for the size-estimation case, we're estimating the size because we're planning to read the page. This reading will take orders of magnitude longer, so it is not too important to avoid every possible cache-miss in this case.

That being said, we're talking about micro optimizations here. Even though my gut feeling is that the other ordering would be superior, I don't mind this order. We're not creating hundreds of lists anymore, that's the most important point for performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will leave as is, unless we get strong data on the trade-offs here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of data, as will surprise no one, the current set of changes has a huge impact on the column index size (at least for the cases with thousands of pages with short histograms). I can post an update to the index size tables if there's interest.

**/
6: optional list<i64> repetition_level_histograms;
/**
* Same as repetition_level_histograms except for definitions levels.
**/
7: optional list<i64> definition_level_histograms;

}

struct AesGcmV1 {
Expand All @@ -985,7 +1065,7 @@ struct AesGcmV1 {

/** Unique file identifier part of AAD suffix **/
2: optional binary aad_file_unique

/** In files encrypted with AAD prefix without storing it,
* readers must supply the prefix **/
3: optional bool supply_aad_prefix
Expand All @@ -997,7 +1077,7 @@ struct AesGcmCtrV1 {

/** Unique file identifier part of AAD suffix **/
2: optional binary aad_file_unique

/** In files encrypted with AAD prefix without storing it,
* readers must supply the prefix **/
3: optional bool supply_aad_prefix
Expand Down