-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior #10749
update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior #10749
Conversation
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10749 +/- ##
================================================
+ Coverage 86.29% 86.32% +0.02%
================================================
Files 144 144
Lines 22656 22656
================================================
+ Hits 19552 19557 +5
+ Misses 3104 3099 -5
Continue to review full report at Codecov.
|
// Rename duplicates of column X as X.1, X.2, ...; First appearance stays as X | ||
while (cur_count > 0) { | ||
col_names_counts[old_col] = cur_count + 1; | ||
col = old_col + "." + std::to_string(cur_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally wary of injecting pandas-specific behavior into libcudf. I can imagine this would affect other users of the read_csv
API that do not expect pandas conventions (or changes in conventions) here. Is there a way we could achieve similar functionality without hard-coding pandas implementation details into C++?
(While it seems that we're already injecting pandas implementation details into libcudf, maybe we should reconsider the design rather than double down on it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We could move the mangle_dupe_cols code to python/Cython layer.
I would like to get opinion from Spark @rapidsai/cudf-java-codeowners how they handle duplicate columns? how this will affect them?
Anyway, we still need a behavior in libcudf to handle duplicate columns, else, it will be an undefined behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark does odd things with CSV and should not be impacted by this. It has a schema discovery phase that looks at the files, sub-samples the lines and figures out the schema that it wants to use for that data if one is not provided. It handles deduping column names/etc in the way that Spark wants. We let the CPU do this currently. But after that everything is based off of its column order, not the name of the column in the file. This is mostly done so that when reading the data there is no need to go back to the start of the file and look at the first row to know the order that is needed.
…bug-demangled_csv_col_names2
This reverts commit 8705ffe.
@gpucibot merge |
Fixes #10618
Depends on #10584