Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix chunked reads of Parquet delta encoded pages #14921

Merged
merged 33 commits into from
Mar 1, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jan 29, 2024

Description

The chunked Parquet reader currently does not properly estimate the sizes of string pages that are delta encoded. This PR modifies gpuDecodeTotalPageStringSize() to take into account the new encodings.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Jan 29, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 29, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Jan 29, 2024

@nvdbaranec I'd like your opinion on how to handle the sizes. Do you want me to decode the delta lengths to get exact sizes or do you think the estimates are good enough?

@etseidl etseidl marked this pull request as ready for review January 29, 2024 23:29
@etseidl etseidl requested a review from a team as a code owner January 29, 2024 23:29
@nvdbaranec
Copy link
Contributor

nvdbaranec commented Jan 30, 2024

@nvdbaranec I'd like your opinion on how to handle the sizes. Do you want me to decode the delta lengths to get exact sizes or do you think the estimates are good enough?

For the output chunking, the granularity we care about right now is just at the page level. The individual row string sizes don't matter. Is it possible to know total page string size without the full decode? Also, how good would the estimate be?

Comment on lines 464 to 472
size_type str_bytes = 0;
if (s->page.encoding == Encoding::DELTA_BYTE_ARRAY) {
// this must be called by all threads
str_bytes = gpuDeltaPageStringSize(s, t);
} else if (t < warp_size) {
// single warp
str_bytes = gpuDecodeTotalPageStringSize(s, t);
}
if (t == 0) { s->page.str_bytes = str_bytes; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest refactoring gpuDecodeTotalPageStringSize as the single entry point and doing this check inside of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was the other option :)

Another is to break out the delta_byte_array as it's own kernel and only call it if necessary (like we do with the decoding kernels). Then we don't get the shared mem hit if we don't have to.

@etseidl
Copy link
Contributor Author

etseidl commented Jan 30, 2024

For the output chunking, the granularity we care about right now is just at the page level. The individual row string sizes don't matter. Is it possible to know total page string size without the full decode?

For DELTA_LENGTH_BYTE_ARRAY, it's much like plain encoding, so if we know the size of the encoded lengths, we can subtract that from the page size. Finding the end isn't terribly expensive, so it's probably worthwhile to do that correctly.

For DELTA_BYTE_ARRAY it's another story. To get the string size we need to decode the suffix and prefix lengths and sum them. Trying to estimate a size is pretty much impossible otherwise. A column with a lot of repetition in it might explode in size when decoded. I'd have to do some timings, but I don't know if decoding the lengths is ruinously expensive (maybe same OOM as the current time to traverse a plain encoded page), but the decoder adds 2k to the shared memory budget.

Edit:
Did a quick check of PLAIN vs DICT vs DELTA_LENGTH, and found that for a file with 76m rows of small strings, gpuComputePageSizes takes ~160ms for PLAIN, ~40ms for DICT, and about 5ms for DLBA, so no concern there. Now to wicker up a file with DELTA_BYTE_ARRAY...

Edit 2:
DELTA_BYTE_ARRAY is closer to DICT than PLAIN...~32ms

@nvdbaranec
Copy link
Contributor

Sounds like doing it the accurate way is the way to go. We could live with approx if it was a conservative/overestimate approx, but if it can be too small we risk actually producing too much output (which is only really an issue in the case where output chunking is used to limit column lengths, but that's an important one).

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing but nits and a couple of questions for clarification.

cpp/src/io/parquet/decode_preprocess.cu Outdated Show resolved Hide resolved
switch (s->page.encoding) {
case Encoding::PLAIN_DICTIONARY:
case Encoding::RLE_DICTIONARY:
// TODO: make this block-based instead of just 1 warp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO for this PR or do we need a tracking issue for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an old TODO I moved from outside the function. @nvdbaranec do you think this TODO is still relevant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to delete. At this point any optimization effort needs to be starting at the top and looking at the whole picture anyway.

Comment on lines 159 to 161
// TODO: since this is really just an estimate, we could just return
// s->dict_size (overestimate) or
// s->dict_size - sizeof(int) * s->page.num_input_values (underestimate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I mistaken that this was discussed and decided to find the real size?

Copy link
Contributor Author

@etseidl etseidl Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind the discussion was more about the delta encodings...this is more a note that there could be a faster way to do this (esp since the plain string size calc is agonizingly slow). Unfortunately, at this point we don't really yet know how many values are present, so we can't get an exact number of string bytes. But if an overestimate is ok, we could just use dict_size and save a lot of time in this step. Not sure if that's something to address with this PR.

Edit: actually, for V2 headers, we do know how many values there are. I'm going to change this and get rid of the TODO.

@etseidl etseidl requested a review from nvdbaranec January 31, 2024 17:48
@vuule
Copy link
Contributor

vuule commented Feb 6, 2024

/ok to test

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Feb 6, 2024
@hyperbolic2346
Copy link
Contributor

/ok to test

@hyperbolic2346
Copy link
Contributor

/ok to test

@hyperbolic2346
Copy link
Contributor

/ok to test

@vuule
Copy link
Contributor

vuule commented Feb 27, 2024

/ok to test

@vuule
Copy link
Contributor

vuule commented Feb 28, 2024

/ok to test

@etseidl
Copy link
Contributor Author

etseidl commented Feb 28, 2024

Hope springs eternal 🥲

@vuule
Copy link
Contributor

vuule commented Feb 29, 2024

Hope springs eternal 🥲

A fix has been merged, so there iswas hope. I'm just not sure if it's too early.

@vuule
Copy link
Contributor

vuule commented Feb 29, 2024

/ok to test

@vuule
Copy link
Contributor

vuule commented Mar 1, 2024

/merge

@rapids-bot rapids-bot bot merged commit 56a3b8f into rapidsai:branch-24.04 Mar 1, 2024
76 checks passed
@etseidl etseidl deleted the delta_chunked branch March 1, 2024 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants