-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix chunked reads of Parquet delta encoded pages #14921
Conversation
@nvdbaranec I'd like your opinion on how to handle the sizes. Do you want me to decode the delta lengths to get exact sizes or do you think the estimates are good enough? |
For the output chunking, the granularity we care about right now is just at the page level. The individual row string sizes don't matter. Is it possible to know total page string size without the full decode? Also, how good would the estimate be? |
size_type str_bytes = 0; | ||
if (s->page.encoding == Encoding::DELTA_BYTE_ARRAY) { | ||
// this must be called by all threads | ||
str_bytes = gpuDeltaPageStringSize(s, t); | ||
} else if (t < warp_size) { | ||
// single warp | ||
str_bytes = gpuDecodeTotalPageStringSize(s, t); | ||
} | ||
if (t == 0) { s->page.str_bytes = str_bytes; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest refactoring gpuDecodeTotalPageStringSize as the single entry point and doing this check inside of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was the other option :)
Another is to break out the delta_byte_array as it's own kernel and only call it if necessary (like we do with the decoding kernels). Then we don't get the shared mem hit if we don't have to.
For DELTA_LENGTH_BYTE_ARRAY, it's much like plain encoding, so if we know the size of the encoded lengths, we can subtract that from the page size. Finding the end isn't terribly expensive, so it's probably worthwhile to do that correctly. For DELTA_BYTE_ARRAY it's another story. To get the string size we need to decode the suffix and prefix lengths and sum them. Trying to estimate a size is pretty much impossible otherwise. A column with a lot of repetition in it might explode in size when decoded. I'd have to do some timings, but I don't know if decoding the lengths is ruinously expensive (maybe same OOM as the current time to traverse a plain encoded page), but the decoder adds 2k to the shared memory budget. Edit: Edit 2: |
Sounds like doing it the accurate way is the way to go. We could live with approx if it was a conservative/overestimate approx, but if it can be too small we risk actually producing too much output (which is only really an issue in the case where output chunking is used to limit column lengths, but that's an important one). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing but nits and a couple of questions for clarification.
switch (s->page.encoding) { | ||
case Encoding::PLAIN_DICTIONARY: | ||
case Encoding::RLE_DICTIONARY: | ||
// TODO: make this block-based instead of just 1 warp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO for this PR or do we need a tracking issue for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an old TODO I moved from outside the function. @nvdbaranec do you think this TODO is still relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to delete. At this point any optimization effort needs to be starting at the top and looking at the whole picture anyway.
// TODO: since this is really just an estimate, we could just return | ||
// s->dict_size (overestimate) or | ||
// s->dict_size - sizeof(int) * s->page.num_input_values (underestimate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I mistaken that this was discussed and decided to find the real size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind the discussion was more about the delta encodings...this is more a note that there could be a faster way to do this (esp since the plain string size calc is agonizingly slow). Unfortunately, at this point we don't really yet know how many values are present, so we can't get an exact number of string bytes. But if an overestimate is ok, we could just use dict_size
and save a lot of time in this step. Not sure if that's something to address with this PR.
Edit: actually, for V2 headers, we do know how many values there are. I'm going to change this and get rid of the TODO.
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
Hope springs eternal 🥲 |
A fix has been merged, so there |
/ok to test |
/merge |
Description
The chunked Parquet reader currently does not properly estimate the sizes of string pages that are delta encoded. This PR modifies
gpuDecodeTotalPageStringSize()
to take into account the new encodings.Checklist