-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protobuf error on SPARK with cudf data [BUG] #10755
Comments
Likely culprit: #10041 |
Got a minimal-ish repro where reading the file fails with cuDF:
Everything works when there's no more than 10K rows. Otherwise, all values in the second row group are equal.
All this points to issues with the row index (not used by Pandas). |
Issue #10755 Fixes an issue in protobuf writer where the length on the row index entry was being written into a single byte. This would cause errors when the size is larger than 127. The issue was uncovered when row group statistics were added. String statistics contain copies to min/max strings, so the size is unbounded. This PR changes the protobuf writer to write the entry size as a generic uint, allowing larger entries. Also fixed `start_row` in row group info array in the reader (unrelated). Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu) - David Wendt (https://github.com/davidwendt) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #10989
Closed by #10989 |
Issue rapidsai#10755 Fixes an issue in protobuf writer where the length on the row index entry was being written into a single byte. This would cause errors when the size is larger than 127. The issue was uncovered when row group statistics were added. String statistics contain copies to min/max strings, so the size is unbounded. This PR changes the protobuf writer to write the entry size as a generic uint, allowing larger entries. Also fixed `start_row` in row group info array in the reader (unrelated). Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu) - David Wendt (https://github.com/davidwendt) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: rapidsai#10989
Issue #10755 Backporting the fix to 22.06 Fixes an issue in protobuf writer where the length on the row index entry was being written into a single byte. This would cause errors when the size is larger than 127. The issue was uncovered when row group statistics were added. String statistics contain copies to min/max strings, so the size is unbounded. This PR changes the protobuf writer to write the entry size as a generic uint, allowing larger entries. Also fixed `start_row` in row group info array in the reader (unrelated). Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - AJ Schmidt (https://github.com/ajschmidt8)
Describe the bug
Hello, we try using pyspark to query the data generated by cudf, but get the error
InvalidProtocolBufferException: Protocol message contained an invalid tag (zero)
. This error occurs at Rapids 22.02 and later version. I reproduce that on the below code. Appreciate it if anyone can help us.Steps/Code to reproduce bug
Cudf code
PySpark 2.4.7
Rapids version
The text was updated successfully, but these errors were encountered: