Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior #10749

Merged

Conversation

karthikeyann
Copy link
Contributor

Fixes #10618

Depends on #10584

@karthikeyann karthikeyann added bug Something isn't working 3 - Ready for Review Ready for review by team pandas libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. cuIO cuIO issue breaking Breaking change labels Apr 27, 2022
@karthikeyann karthikeyann requested review from a team as code owners April 27, 2022 17:16
@karthikeyann karthikeyann marked this pull request as draft April 27, 2022 17:17
@karthikeyann
Copy link
Contributor Author

rerun tests

@codecov
Copy link

codecov bot commented May 11, 2022

Codecov Report

Merging #10749 (0564c25) into branch-22.06 (4ad1e51) will increase coverage by 0.02%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10749      +/-   ##
================================================
+ Coverage         86.29%   86.32%   +0.02%     
================================================
  Files               144      144              
  Lines             22656    22656              
================================================
+ Hits              19552    19557       +5     
+ Misses             3104     3099       -5     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/numerical.py 95.88% <0.00%> (-0.30%) ⬇️
python/cudf/cudf/core/dataframe.py 93.78% <0.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/string.py 88.78% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.79% <0.00%> (+0.22%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
python/cudf/cudf/core/column/lists.py 91.70% <0.00%> (+0.97%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0802451...0564c25. Read the comment docs.

@karthikeyann karthikeyann marked this pull request as ready for review May 11, 2022 19:10
@karthikeyann karthikeyann requested a review from a team May 11, 2022 19:13
// Rename duplicates of column X as X.1, X.2, ...; First appearance stays as X
while (cur_count > 0) {
col_names_counts[old_col] = cur_count + 1;
col = old_col + "." + std::to_string(cur_count);
Copy link
Contributor

@bdice bdice May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally wary of injecting pandas-specific behavior into libcudf. I can imagine this would affect other users of the read_csv API that do not expect pandas conventions (or changes in conventions) here. Is there a way we could achieve similar functionality without hard-coding pandas implementation details into C++?

(While it seems that we're already injecting pandas implementation details into libcudf, maybe we should reconsider the design rather than double down on it.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We could move the mangle_dupe_cols code to python/Cython layer.
I would like to get opinion from Spark @rapidsai/cudf-java-codeowners how they handle duplicate columns? how this will affect them?

Anyway, we still need a behavior in libcudf to handle duplicate columns, else, it will be an undefined behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark does odd things with CSV and should not be impacted by this. It has a schema discovery phase that looks at the files, sub-samples the lines and figures out the schema that it wants to use for that data if one is not provided. It handles deduping column names/etc in the way that Spark wants. We let the CPU do this currently. But after that everything is based off of its column order, not the name of the column in the file. This is mostly done so that when reading the data there is no need to go back to the start of the file and look at the first row to know the order that is needed.

python/cudf/cudf/tests/test_csv.py Outdated Show resolved Hide resolved
@karthikeyann
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e58d049 into rapidsai:branch-22.06 May 16, 2022
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuIO Reviewer labels Feb 23, 2024
@vyasr vyasr removed the pandas label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond breaking Breaking change bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] mangle_dupe_cols in CSV reader requires a change in behavior
6 participants