Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues when both usecols and names options are used in read_csv #12018

Merged
merged 27 commits into from
Nov 17, 2022

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Oct 27, 2022

Description

closes #8973
CSV reader has a few gaps in the logic for column selection and user specified column names:

  1. Users cannot only specify the names of selected columns;
  2. Reader fails in unpredictable ways when only a subset of column names is passed (w/o column selection);

This PR fixes the issues above. Users can now specify column names (can be lower than the actual number of columns) or names of columns selected via their indices (must match the number of indices). If selection via indices is used, the number of column names has to match either the actual number of columns, or the number of selected columns.

Also fixed test an error that went unnoticed due to issues above.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Oct 27, 2022
@vuule vuule self-assigned this Oct 27, 2022
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Oct 27, 2022
@codecov
Copy link

codecov bot commented Oct 28, 2022

Codecov Report

Base: 87.47% // Head: 88.27% // Increases project coverage by +0.79% 🎉

Coverage data is based on head (b8d16e7) compared to base (f817d96).
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #12018      +/-   ##
================================================
+ Coverage         87.47%   88.27%   +0.79%     
================================================
  Files               133      137       +4     
  Lines             21826    22607     +781     
================================================
+ Hits              19093    19957     +864     
+ Misses             2733     2650      -83     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/interval.py 85.45% <0.00%> (-9.10%) ⬇️
python/cudf/cudf/io/text.py 91.66% <0.00%> (-8.34%) ⬇️
python/cudf/cudf/core/_base_index.py 81.28% <0.00%> (-4.27%) ⬇️
python/cudf/cudf/io/json.py 92.06% <0.00%> (-2.68%) ⬇️
python/cudf/cudf/utils/utils.py 89.91% <0.00%> (-0.69%) ⬇️
python/cudf/cudf/core/column/timedelta.py 90.17% <0.00%> (-0.58%) ⬇️
python/cudf/cudf/core/column/datetime.py 89.21% <0.00%> (-0.51%) ⬇️
python/cudf/cudf/core/column/column.py 87.96% <0.00%> (-0.46%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.72% <0.00%> (-0.41%) ⬇️
python/cudf/cudf/io/parquet.py 90.45% <0.00%> (-0.39%) ⬇️
... and 48 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@vuule vuule changed the title Fix interactions between usecols and names options in read_csv Fix issues when both usecols and names options are used in read_csv Oct 28, 2022
@vuule
Copy link
Contributor Author

vuule commented Oct 28, 2022

rerun tests

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 Awesome

@vuule vuule marked this pull request as ready for review November 1, 2022 17:35
@vuule vuule requested review from a team as code owners November 1, 2022 17:35
@vuule
Copy link
Contributor Author

vuule commented Nov 1, 2022

CC @jlowe in case you want to check if the new behavior matches the expectations

@vyasr
Copy link
Contributor

vyasr commented Nov 2, 2022

I started trying to give this a quick review but I don't understand the existing logic well enough. It might be best if we had an additional reviewer familiar with this part of the code; I could figure it out, but it may not be the most time-effective method here and I don't want to hold up progress. If there isn't a better person though let me know and I can give this another shot, I'll just need to familiarize myself much more with how the reader works.

@jlowe
Copy link
Member

jlowe commented Nov 3, 2022

The behavior validated in the new tests seems OK to me, but I'm wondering what happens when the specified schema is not a subset of the file? For example, if the user says the file schema is a: string, b: string, c: string, d: string and asks to load columns b and d, what happens if there's only two columns in the file data? Spark would load the second column's data for b and nulls for d in that case.

@@ -948,7 +948,8 @@
the column names: if no names are passed, header=0;
if column names are passed explicitly, header=None.
names : list of str, default None
List of column names to be used.
List of column names to be used. Needs to include names of all column in
the file, or names of all columns selected using `usecols` (indices only).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indices only? pytest example has column names too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the requirement to have the same number of elements as usecols is limited to the case when usecols contains indices.
I included the failing cases in the C++ test UseColsValidation, the options object names indicate the reason why it should fail.
I can add comments there to make it clearer. Also, let me know if this comment should be worded differently to be clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(indices only) is only confusing. does indices only mean usecols uses indices, but not column names?

Rest all looks good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expanded the related comments, I hope they make sense now.

cpp/tests/io/csv_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/csv_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/csv_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/csv_test.cpp Outdated Show resolved Hide resolved
@vuule vuule requested a review from karthikeyann November 8, 2022 21:41
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Minor nitpicks. Look good.

cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One general comment. There's a number of blocks of mildly complex logic that would be well-served with some simple comments on what they do. Examples: Lines, 702, 773, 789. Descriptions of what each block is intended for would help with the overall flow of reading.

cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/csv/reader_impl.cu Show resolved Hide resolved
cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/csv/reader_impl.cu Show resolved Hide resolved
cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved
reader_opts.get_names().size() == detected_column_names.size() or
// Columns are not selected by indices; read first reader_opts.get_names().size() columns
unique_use_cols_indexes.empty());
auto column_names = opts_have_all_col_names ? reader_opts.get_names() : detected_column_names;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't, names are potentially modified in a few places later in the code (empty names -> "Unnamed: col_index", mangle duplicates, apply names to selected column when index-based column selection is used).

@vuule vuule requested review from ttnghia and vyasr November 16, 2022 20:21
@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Nov 16, 2022
@vuule
Copy link
Contributor Author

vuule commented Nov 17, 2022

rerun tests

@vuule
Copy link
Contributor Author

vuule commented Nov 17, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 6de2c4e into rapidsai:branch-22.12 Nov 17, 2022
@vuule vuule deleted the bug-read_csv-usecols-names branch November 17, 2022 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Can't name subset of columns with read_csv
7 participants