-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Trino cannot read map column written by GPU #15473
Comments
jlowe
added
bug
Something isn't working
libcudf
Affects libcudf (C++/CUDA) code.
Spark
Functionality that helps Spark RAPIDS
labels
Apr 5, 2024
2 tasks
Potential fix posted. |
Note, to generate a file that reproduces this:
|
rapids-bot bot
pushed a commit
that referenced
this issue
Apr 9, 2024
#15474) Fixes #15473 The issue is that in some cases, for example where we have all nulls, we can fail to update the size of the page output buffer, resulting in a missing byte expected by some readers. Specifically, we poke the value of dict_bits into the output buffer here: https://github.com/rapidsai/cudf/blob/6319ab708f2dff9fd7a62a5c77fd3b387bde1bb8/cpp/src/io/parquet/page_enc.cu#L1892 But, if we have no leaf values (for example, because everything in the page is null) `s->cur` never gets updated here, because we never enter the containing loop. https://github.com/rapidsai/cudf/blob/6319ab708f2dff9fd7a62a5c77fd3b387bde1bb8/cpp/src/io/parquet/page_enc.cu#L1948 The fix is to just always update `s->cur` after this if-else block https://github.com/rapidsai/cudf/blob/6319ab708f2dff9fd7a62a5c77fd3b387bde1bb8/cpp/src/io/parquet/page_enc.cu#L1891 Note that this was already handled by our reader. But some third party readers (Trino) are expecting that data to be there and crash if it's not. Authors: - https://github.com/nvdbaranec Approvers: - Nghia Truong (https://github.com/ttnghia) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #15474
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Parquet files written by libcudf that contain map columns (libcudf sees them as LIST<STRUCT<keytype,valtype>> columns) can end up being unreadable by Trino version 376. The error on the Trino CLI when it fails is:
The stacktrace on the Trino worker
Digging into the Trino error, it is complaining about a PLAIN_DICTIONARY V1 page where the page is exhausted after the repetition and definition levels when trying to read the byte for the dictionary bit width.
Steps/Code to reproduce bug
It appears to be related to a page containing all nulls is written in such a way where the chunk is using a dictionary but the dictionary bit width byte is missing. @nvdbaranec helped create code to write a file that will trigger this missing byte.
Parquet write code
The following patch to the libcudf reader will print out when it's properly seeing the dictionary bit width byte or missing the byte when decoding Parquet files, and can be used to detect when the byte is missing, which simulates the behavior of the Trino reader.
Expected behavior
libcudf does not write Parquet files that have PLAIN_DICTIONARY encoded pages missing the dictionary bit width byte.
Environment overview (please complete the following information)
The text was updated successfully, but these errors were encountered: