update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior #10749

karthikeyann · 2022-04-27T17:16:57Z

Depends on #10584

karthikeyann · 2022-05-11T17:05:27Z

rerun tests

codecov · 2022-05-11T19:09:22Z

Codecov Report

Merging #10749 (0564c25) into branch-22.06 (4ad1e51) will increase coverage by 0.02%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10749      +/-   ##
================================================
+ Coverage         86.29%   86.32%   +0.02%     
================================================
  Files               144      144              
  Lines             22656    22656              
================================================
+ Hits              19552    19557       +5     
+ Misses             3104     3099       -5

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/numerical.py	`95.88% <0.00%> (-0.30%)`	⬇️
python/cudf/cudf/core/dataframe.py	`93.78% <0.00%> (+0.04%)`	⬆️
python/cudf/cudf/core/column/string.py	`88.78% <0.00%> (+0.12%)`	⬆️
python/cudf/cudf/core/groupby/groupby.py	`91.79% <0.00%> (+0.22%)`	⬆️
python/cudf/cudf/core/tools/datetimes.py	`84.49% <0.00%> (+0.30%)`	⬆️
python/cudf/cudf/core/column/lists.py	`91.70% <0.00%> (+0.97%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0802451...0564c25. Read the comment docs.

bdice · 2022-05-11T19:19:48Z

cpp/src/io/csv/reader_impl.cu

+          // Rename duplicates of column X as X.1, X.2, ...; First appearance stays as X
+          while (cur_count > 0) {
+            col_names_counts[old_col] = cur_count + 1;
+            col                       = old_col + "." + std::to_string(cur_count);


I'm generally wary of injecting pandas-specific behavior into libcudf. I can imagine this would affect other users of the read_csv API that do not expect pandas conventions (or changes in conventions) here. Is there a way we could achieve similar functionality without hard-coding pandas implementation details into C++?

(While it seems that we're already injecting pandas implementation details into libcudf, maybe we should reconsider the design rather than double down on it.)

Right. We could move the mangle_dupe_cols code to python/Cython layer.
I would like to get opinion from Spark @rapidsai/cudf-java-codeowners how they handle duplicate columns? how this will affect them?

Anyway, we still need a behavior in libcudf to handle duplicate columns, else, it will be an undefined behavior.

Spark does odd things with CSV and should not be impacted by this. It has a schema discovery phase that looks at the files, sub-samples the lines and figures out the schema that it wants to use for that data if one is not provided. It handles deduping column names/etc in the way that Spark wants. We let the CPU do this currently. But after that everything is based off of its column order, not the name of the column in the file. This is mostly done so that when reading the data there is no need to go back to the start of the file and look at the first row to know the order that is needed.

python/cudf/cudf/tests/test_csv.py

…bug-demangled_csv_col_names2

This reverts commit 8705ffe.

cpp/src/io/csv/reader_impl.cu

karthikeyann · 2022-05-16T05:03:06Z

@gpucibot merge

karthikeyann added 2 commits April 27, 2022 22:41

update duplicate cols naming to pandas 1.4.0 logic

672ee0a

add pytest test cases for mangle_dupe_cols

c593f62

karthikeyann added bug Something isn't working 3 - Ready for Review Ready for review by team pandas libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. cuIO cuIO issue breaking Breaking change labels Apr 27, 2022

karthikeyann requested review from a team as code owners April 27, 2022 17:16

karthikeyann requested review from vyasr, skirui-source, PointKernel and galipremsagar April 27, 2022 17:16

karthikeyann marked this pull request as draft April 27, 2022 17:17

karthikeyann added the 5 - Merge After Dependencies label Apr 27, 2022

fix test and duplicate disable code

ffa28cc

karthikeyann marked this pull request as ready for review May 11, 2022 19:10

karthikeyann removed the 5 - Merge After Dependencies label May 11, 2022

karthikeyann requested a review from a team May 11, 2022 19:13

bdice reviewed May 11, 2022

View reviewed changes

galipremsagar requested changes May 11, 2022

View reviewed changes

python/cudf/cudf/tests/test_csv.py Outdated Show resolved Hide resolved

address review comments (galipremsagar)

8705ffe

galipremsagar reviewed May 13, 2022

View reviewed changes

python/cudf/cudf/tests/test_csv.py Outdated Show resolved Hide resolved

karthikeyann added 3 commits May 13, 2022 22:35

Merge branch 'branch-22.06' of https://github.com/rapidsai/cudf into …

8a8b0c6

…bug-demangled_csv_col_names2

Revert "address review comments (galipremsagar)"

d824ed6

This reverts commit 8705ffe.

update mangle_dupe_cols=False test case since pandas does not support it

e8db11c

karthikeyann requested review from galipremsagar and bdice May 13, 2022 18:09

PointKernel reviewed May 13, 2022

View reviewed changes

cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved

PointKernel reviewed May 13, 2022

View reviewed changes

cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved

PointKernel reviewed May 13, 2022

View reviewed changes

cpp/src/io/csv/reader_impl.cu Show resolved Hide resolved

galipremsagar approved these changes May 13, 2022

View reviewed changes

PointKernel approved these changes May 13, 2022

View reviewed changes

address review comments (PointKernel)

0564c25

rapids-bot bot merged commit e58d049 into rapidsai:branch-22.06 May 16, 2022

vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuIO Reviewer labels Feb 23, 2024

vyasr removed the pandas label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior #10749

update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior #10749

karthikeyann commented Apr 27, 2022

karthikeyann commented May 11, 2022

codecov bot commented May 11, 2022 •

edited

Loading

bdice May 11, 2022 •

edited

Loading

karthikeyann May 11, 2022

revans2 May 12, 2022

karthikeyann commented May 16, 2022

update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior #10749

update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior #10749

Conversation

karthikeyann commented Apr 27, 2022

karthikeyann commented May 11, 2022

codecov bot commented May 11, 2022 • edited Loading

Codecov Report

bdice May 11, 2022 • edited Loading

Choose a reason for hiding this comment

karthikeyann May 11, 2022

Choose a reason for hiding this comment

revans2 May 12, 2022

Choose a reason for hiding this comment

karthikeyann commented May 16, 2022

codecov bot commented May 11, 2022 •

edited

Loading

bdice May 11, 2022 •

edited

Loading