[REVIEW] Enable proper `Index` round-tripping in `orc` reader and writer #10170

galipremsagar · 2022-01-31T16:19:21Z

This PR:

Fixes to_orc by enabling writing of dataframe metadata to ORC file being created.
Fixes read_orc to correctly read and assign Index objects that exist in metadata
Note: This change is not backward compatible with files already written in previous versions of cudf.

codecov · 2022-01-31T17:56:51Z

Codecov Report

Merging #10170 (7441ff5) into branch-22.04 (a7d88cd) will increase coverage by 0.19%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.04   #10170      +/-   ##
================================================
+ Coverage         10.42%   10.62%   +0.19%     
================================================
  Files               119      122       +3     
  Lines             20603    20977     +374     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18749     +294

Impacted Files	Coverage Δ
...ython/custreamz/custreamz/tests/test_dataframes.py	`99.39% <0.00%> (-0.01%)`	⬇️
python/cudf/cudf/errors.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/orc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_version.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/datasets.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/utils.py	`0.00% <0.00%> (ø)`
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 496f452...7441ff5. Read the comment docs.

vuule · 2022-02-01T02:10:38Z

This change is not backward compatible with files already written in previous versions of cudf.

How come? Are the old files invalid in some way (with respect to ORC specs)?

galipremsagar · 2022-02-01T02:14:23Z

How come? Are the old files invalid in some way (with respect to ORC specs)?

Because there is no metadata being written previously. The only way we can correctly "know" the index column is by writing it in metadata and reading the same metadata.

The old files will not be invalid, they will still be read in the old way i.e., the index column will become a column of the actual dataframe rather than the index of the dataframe.

vuule

optional suggestions

python/cudf/cudf/_lib/orc.pyx

vuule · 2022-02-01T02:28:30Z

python/cudf/cudf/_lib/orc.pyx

@@ -123,8 +129,59 @@ cpdef read_orc(object filepaths_or_buffers,
        c_result = move(libcudf_read_orc(c_orc_reader_options))

    names = [name.decode() for name in c_result.metadata.column_names]
+    cdef map[string, string] user_data = c_result.metadata.user_data


Maybe it would be good to place this code in a separate function, not sure.

Apologies for the delay, moved this to a separate function.

Co-authored-by: Vukasin Milovanovic <[email protected]>

rgsl888prabhu

Just have small question

python/cudf/cudf/_lib/orc.pyx

galipremsagar · 2022-02-24T16:20:39Z

Thanks for reviewing this @rgsl888prabhu !

galipremsagar · 2022-02-24T16:20:56Z

@gpucibot merge

galipremsagar added 3 commits January 28, 2022 22:35

fix index handling in orc reader and writer

b485447

Merge remote-tracking branch 'upstream/branch-22.04' into 10010

e3a5073

fix tests

b90c9ed

galipremsagar added bug Something isn't working Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer cuIO cuIO issue breaking Breaking change labels Jan 31, 2022

galipremsagar requested review from vuule and rgsl888prabhu January 31, 2022 16:19

galipremsagar self-assigned this Jan 31, 2022

galipremsagar requested a review from a team as a code owner January 31, 2022 16:19

galipremsagar requested a review from brandon-b-miller January 31, 2022 16:19

vuule reviewed Feb 1, 2022

View reviewed changes

galipremsagar and others added 4 commits February 23, 2022 10:18

Merge remote-tracking branch 'upstream/branch-22.04' into 10010

d0bd784

Update python/cudf/cudf/_lib/orc.pyx

ed77047

Co-authored-by: Vukasin Milovanovic <[email protected]>

Merge branch '10010' of https://github.com/galipremsagar/cudf into 10010

da04667

move to a function

7441ff5

rgsl888prabhu reviewed Feb 24, 2022

View reviewed changes

python/cudf/cudf/_lib/orc.pyx Show resolved Hide resolved

python/cudf/cudf/_lib/orc.pyx Show resolved Hide resolved

rgsl888prabhu approved these changes Feb 24, 2022

View reviewed changes

galipremsagar removed 4 - Needs cuDF (Python) Reviewer labels Feb 24, 2022

galipremsagar added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Feb 24, 2022

rapids-bot bot merged commit 3a1dbe8 into rapidsai:branch-22.04 Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Enable proper `Index` round-tripping in `orc` reader and writer #10170

[REVIEW] Enable proper `Index` round-tripping in `orc` reader and writer #10170

galipremsagar commented Jan 31, 2022 •

edited

Loading

codecov bot commented Jan 31, 2022 •

edited

Loading

vuule commented Feb 1, 2022

galipremsagar commented Feb 1, 2022 •

edited

Loading

vuule left a comment

vuule Feb 1, 2022

rgsl888prabhu Feb 3, 2022

galipremsagar Feb 23, 2022

rgsl888prabhu left a comment

galipremsagar commented Feb 24, 2022

galipremsagar commented Feb 24, 2022

[REVIEW] Enable proper Index round-tripping in orc reader and writer #10170

[REVIEW] Enable proper Index round-tripping in orc reader and writer #10170

Conversation

galipremsagar commented Jan 31, 2022 • edited Loading

codecov bot commented Jan 31, 2022 • edited Loading

Codecov Report

vuule commented Feb 1, 2022

galipremsagar commented Feb 1, 2022 • edited Loading

vuule left a comment

Choose a reason for hiding this comment

vuule Feb 1, 2022

Choose a reason for hiding this comment

rgsl888prabhu Feb 3, 2022

Choose a reason for hiding this comment

galipremsagar Feb 23, 2022

Choose a reason for hiding this comment

rgsl888prabhu left a comment

Choose a reason for hiding this comment

galipremsagar commented Feb 24, 2022

galipremsagar commented Feb 24, 2022

[REVIEW] Enable proper `Index` round-tripping in `orc` reader and writer #10170

[REVIEW] Enable proper `Index` round-tripping in `orc` reader and writer #10170

galipremsagar commented Jan 31, 2022 •

edited

Loading

codecov bot commented Jan 31, 2022 •

edited

Loading

galipremsagar commented Feb 1, 2022 •

edited

Loading