-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Order of row index streams does not match the order of streams in the file footer #1475
Comments
Logs that point to incorrect order: rapidsai/cudf#11890 (comment) |
Thanks for reporting the issue! @vuule The order of data streams are NOT FIXED meaning that:
However, the order of positions in a index stream is FIXED. So for a direct-encoded string column, its I checked the specs and it does not state this clearly. It would be a good time to document this as well. @deshanxiao |
Thanks @deshanxiao ! Could you also verify that the java implementation matches this behavior? |
Thank you for the clarification @wgtmac ! |
Yes, the order is fixed. This is implemented in the In the orc/java/core/src/java/org/apache/orc/impl/writer/TreeWriterBase.java Lines 369 to 377 in 792c3f8
And then in the orc/java/core/src/java/org/apache/orc/impl/writer/StringBaseTreeWriter.java Lines 265 to 270 in 9dbf833
I followed the same order when I was implementing the C++ writer so they should be consistent. |
IIRC, the order is same as the table of the spec doc. |
Thank you for sharing the Java code. I double check it and you are right @wgtmac .
In fact, different languages currently have different order implementations. The order of java depends on the method of compareTo to flush the stream to disk.
Do you mean that the streams will cross for different columns like: I notice that the streams in the same column will appear together, but the order of the streams in different column is uncertain even they are the same data type. |
BTW, Is it necessary for us to add a type list in IndexEntry to describe the type of the position? @wgtmac @dongjoon-hyun @guiyanakuang |
Yes, that would help a lot. |
When writing a file with a string column and multiple row groups, the resulting file has incorrect row index streams.
The string column is encoded using direct encoding. The file footer contains the LENGTH (kind 2) stream before DATA (kind 1) stream. However, the row index seems to contain the index data for the DATA stream before the LENGTH stream. Switching out the order in which we read the row index streams fixes the issue and everything can be used correctly.
Isolation:
Only observing this behavior with string columns. Other types with multiple streams look correct in this regard.
Behavior looks unrelated to string content in the column.
No info on dictionary encoded string columns - writer seemlingly defaults to direct encoding.
See attached repro file. The file contains a single string column, with
["*"] * 10001
10001_strings.zip
The text was updated successfully, but these errors were encountered: