Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior #10749
update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior #10749
Changes from 7 commits
672ee0a
c593f62
ffa28cc
8705ffe
8a8b0c6
d824ed6
e8db11c
0564c25
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally wary of injecting pandas-specific behavior into libcudf. I can imagine this would affect other users of the
read_csv
API that do not expect pandas conventions (or changes in conventions) here. Is there a way we could achieve similar functionality without hard-coding pandas implementation details into C++?(While it seems that we're already injecting pandas implementation details into libcudf, maybe we should reconsider the design rather than double down on it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We could move the mangle_dupe_cols code to python/Cython layer.
I would like to get opinion from Spark @rapidsai/cudf-java-codeowners how they handle duplicate columns? how this will affect them?
Anyway, we still need a behavior in libcudf to handle duplicate columns, else, it will be an undefined behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark does odd things with CSV and should not be impacted by this. It has a schema discovery phase that looks at the files, sub-samples the lines and figures out the schema that it wants to use for that data if one is not provided. It handles deduping column names/etc in the way that Spark wants. We let the CPU do this currently. But after that everything is based off of its column order, not the name of the column in the file. This is mostly done so that when reading the data there is no need to go back to the start of the file and look at the first row to know the order that is needed.