-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug(arrow-row): calling convert_raw
function cause "offset overflow" panic
#6112
Comments
In newer versions of DataFusion I would expect that grouping on a string column would not use row format, but instead usse the special GroupValuesBytes: |
The bug you describe certainly can happen if there are large numbers of distinct large strings in a multi-column group 🤔 |
FWIW in general offset overflows do yield panics in arrow-rs, the additional plumbing for error handling what is almost always an unrecoverable error has been hard to justify, although I suspect in this case it could be made into an Edit: I've updated this to be an enhancement, panics are not a bug |
Yes, this problem was first discovered in the case of a group by multi-column. |
Do you mean we need to add a new function like the following? pub fn try_decode_binary<I: OffsetSizeTrait>(
rows: &mut [&[u8]],
options: SortOptions,
) -> Result<GenericBinaryArray<I>, ArrowError> {
...
} |
Maybe you could add a check in your code (or in datafusion 🤔 ) on the size of the string buffer and make a new record batch if they exceed 2GB or something. This might be related: apache/datafusion#9562 Making a single array with more than 2GB of string data is likely to be non ideal in a bunch of ways |
Describe the bug
Datafusion Table Info:
Having a
http_url
column, which DataType isUtf8
, and it has a lot of distinct values.Datafusion SQL
GroupValuesRows
in Datafusion stores rows usingRows
, and this bug may be triggered when callingemit
function.arrow-rs/arrow-row/src/variable.rs
Lines 217 to 226 in 49e714d
In the extreme case, if I
append
twoUtf8
values which size are large(len1 + len2 > i32::MAX) into Rows twice, and then callconvert_rows
, which should also trigger the bug. 🤔To Reproduce
Expected behavior
Additional context
The text was updated successfully, but these errors were encountered: