-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Enable proper Index
round-tripping in orc
reader and writer
#10170
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10170 +/- ##
================================================
+ Coverage 10.42% 10.62% +0.19%
================================================
Files 119 122 +3
Lines 20603 20977 +374
================================================
+ Hits 2148 2228 +80
- Misses 18455 18749 +294
Continue to review full report at Codecov.
|
How come? Are the old files invalid in some way (with respect to ORC specs)? |
Because there is no metadata being written previously. The only way we can correctly "know" the index column is by writing it in metadata and reading the same metadata. The old files will not be invalid, they will still be read in the old way i.e., the index column will become a column of the actual dataframe rather than the index of the dataframe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional suggestions
python/cudf/cudf/_lib/orc.pyx
Outdated
@@ -123,8 +129,59 @@ cpdef read_orc(object filepaths_or_buffers, | |||
c_result = move(libcudf_read_orc(c_orc_reader_options)) | |||
|
|||
names = [name.decode() for name in c_result.metadata.column_names] | |||
cdef map[string, string] user_data = c_result.metadata.user_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be good to place this code in a separate function, not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay, moved this to a separate function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just have small question
Thanks for reviewing this @rgsl888prabhu ! |
@gpucibot merge |
Fixes: #10010
This PR:
to_orc
by enabling writing of dataframe metadata to ORC file being created.read_orc
to correctly read and assignIndex
objects that exist inmetadata
cudf
.