-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issues when both usecols
and names
options are used in read_csv
#12018
Fix issues when both usecols
and names
options are used in read_csv
#12018
Conversation
…bug-read_csv-usecols-names
Codecov ReportBase: 87.47% // Head: 88.27% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #12018 +/- ##
================================================
+ Coverage 87.47% 88.27% +0.79%
================================================
Files 133 137 +4
Lines 21826 22607 +781
================================================
+ Hits 19093 19957 +864
+ Misses 2733 2650 -83
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
usecols
and names
options in read_csv
usecols
and names
options are used in read_csv
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 Awesome
…bug-read_csv-usecols-names
…bug-read_csv-usecols-names
CC @jlowe in case you want to check if the new behavior matches the expectations |
I started trying to give this a quick review but I don't understand the existing logic well enough. It might be best if we had an additional reviewer familiar with this part of the code; I could figure it out, but it may not be the most time-effective method here and I don't want to hold up progress. If there isn't a better person though let me know and I can give this another shot, I'll just need to familiarize myself much more with how the reader works. |
The behavior validated in the new tests seems OK to me, but I'm wondering what happens when the specified schema is not a subset of the file? For example, if the user says the file schema is |
python/cudf/cudf/utils/ioutils.py
Outdated
@@ -948,7 +948,8 @@ | |||
the column names: if no names are passed, header=0; | |||
if column names are passed explicitly, header=None. | |||
names : list of str, default None | |||
List of column names to be used. | |||
List of column names to be used. Needs to include names of all column in | |||
the file, or names of all columns selected using `usecols` (indices only). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indices only? pytest example has column names too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the requirement to have the same number of elements as usecols
is limited to the case when usecols
contains indices.
I included the failing cases in the C++ test UseColsValidation
, the options object names indicate the reason why it should fail.
I can add comments there to make it clearer. Also, let me know if this comment should be worded differently to be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(indices only)
is only confusing. does indices only
mean usecols
uses indices, but not column names?
Rest all looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expanded the related comments, I hope they make sense now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Minor nitpicks. Look good.
…bug-read_csv-usecols-names
…/cudf into bug-read_csv-usecols-names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One general comment. There's a number of blocks of mildly complex logic that would be well-served with some simple comments on what they do. Examples: Lines, 702, 773, 789. Descriptions of what each block is intended for would help with the overall flow of reading.
reader_opts.get_names().size() == detected_column_names.size() or | ||
// Columns are not selected by indices; read first reader_opts.get_names().size() columns | ||
unique_use_cols_indexes.empty()); | ||
auto column_names = opts_have_all_col_names ? reader_opts.get_names() : detected_column_names; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't, names are potentially modified in a few places later in the code (empty names -> "Unnamed: col_index", mangle duplicates, apply names
to selected column when index-based column selection is used).
…bug-read_csv-usecols-names
Co-authored-by: Vyas Ramasubramani <[email protected]>
…/cudf into bug-read_csv-usecols-names
rerun tests |
@gpucibot merge |
Description
closes #8973
CSV reader has a few gaps in the logic for column selection and user specified column names:
This PR fixes the issues above. Users can now specify column names (can be lower than the actual number of columns) or names of columns selected via their indices (must match the number of indices). If selection via indices is used, the number of column names has to match either the actual number of columns, or the number of selected columns.
Also fixed test an error that went unnoticed due to issues above.
Checklist