Fix issues when both `usecols` and `names` options are used in `read_csv` #12018

vuule · 2022-10-27T23:49:17Z

Description

closes #8973
CSV reader has a few gaps in the logic for column selection and user specified column names:

Users cannot only specify the names of selected columns;
Reader fails in unpredictable ways when only a subset of column names is passed (w/o column selection);

This PR fixes the issues above. Users can now specify column names (can be lower than the actual number of columns) or names of columns selected via their indices (must match the number of indices). If selection via indices is used, the number of column names has to match either the actual number of columns, or the number of selected columns.

Also fixed test an error that went unnoticed due to issues above.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…bug-read_csv-usecols-names

codecov · 2022-10-28T02:03:50Z

Codecov Report

Base: 87.47% // Head: 88.27% // Increases project coverage by +0.79% 🎉

Coverage data is based on head (b8d16e7) compared to base (f817d96).
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-22.12   #12018      +/-   ##
================================================
+ Coverage         87.47%   88.27%   +0.79%     
================================================
  Files               133      137       +4     
  Lines             21826    22607     +781     
================================================
+ Hits              19093    19957     +864     
+ Misses             2733     2650      -83

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/interval.py	`85.45% <0.00%> (-9.10%)`	⬇️
python/cudf/cudf/io/text.py	`91.66% <0.00%> (-8.34%)`	⬇️
python/cudf/cudf/core/_base_index.py	`81.28% <0.00%> (-4.27%)`	⬇️
python/cudf/cudf/io/json.py	`92.06% <0.00%> (-2.68%)`	⬇️
python/cudf/cudf/utils/utils.py	`89.91% <0.00%> (-0.69%)`	⬇️
python/cudf/cudf/core/column/timedelta.py	`90.17% <0.00%> (-0.58%)`	⬇️
python/cudf/cudf/core/column/datetime.py	`89.21% <0.00%> (-0.51%)`	⬇️
python/cudf/cudf/core/column/column.py	`87.96% <0.00%> (-0.46%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`73.72% <0.00%> (-0.41%)`	⬇️
python/cudf/cudf/io/parquet.py	`90.45% <0.00%> (-0.39%)`	⬇️
... and 48 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

vuule · 2022-10-28T05:45:16Z

rerun tests

galipremsagar

🔥 Awesome

…bug-read_csv-usecols-names

vuule · 2022-11-01T20:16:35Z

CC @jlowe in case you want to check if the new behavior matches the expectations

vyasr · 2022-11-02T23:34:19Z

I started trying to give this a quick review but I don't understand the existing logic well enough. It might be best if we had an additional reviewer familiar with this part of the code; I could figure it out, but it may not be the most time-effective method here and I don't want to hold up progress. If there isn't a better person though let me know and I can give this another shot, I'll just need to familiarize myself much more with how the reader works.

jlowe · 2022-11-03T15:16:16Z

The behavior validated in the new tests seems OK to me, but I'm wondering what happens when the specified schema is not a subset of the file? For example, if the user says the file schema is a: string, b: string, c: string, d: string and asks to load columns b and d, what happens if there's only two columns in the file data? Spark would load the second column's data for b and nulls for d in that case.

karthikeyann · 2022-11-04T19:11:28Z

python/cudf/cudf/utils/ioutils.py

@@ -948,7 +948,8 @@
    the column names: if no names are passed, header=0;
    if column names are passed explicitly, header=None.
 names : list of str, default None
-    List of column names to be used.
+    List of column names to be used. Needs to include names of all column in
+    the file, or names of all columns selected using `usecols` (indices only).


indices only? pytest example has column names too.

the requirement to have the same number of elements as usecols is limited to the case when usecols contains indices.
I included the failing cases in the C++ test UseColsValidation, the options object names indicate the reason why it should fail.
I can add comments there to make it clearer. Also, let me know if this comment should be worded differently to be clearer.

(indices only) is only confusing. does indices only mean usecols uses indices, but not column names?

Rest all looks good.

expanded the related comments, I hope they make sense now.

cpp/tests/io/csv_test.cpp

karthikeyann

LGTM 👍

Minor nitpicks. Look good.

cpp/src/io/csv/reader_impl.cu

…bug-read_csv-usecols-names

…/cudf into bug-read_csv-usecols-names

nvdbaranec

One general comment. There's a number of blocks of mildly complex logic that would be well-served with some simple comments on what they do. Examples: Lines, 702, 773, 789. Descriptions of what each block is intended for would help with the overall flow of reading.

cpp/src/io/csv/reader_impl.cu

ttnghia · 2022-11-16T19:03:15Z

cpp/src/io/csv/reader_impl.cu

+      reader_opts.get_names().size() == detected_column_names.size() or
+      // Columns are not selected by indices; read first reader_opts.get_names().size() columns
+      unique_use_cols_indexes.empty());
+  auto column_names = opts_have_all_col_names ? reader_opts.get_names() : detected_column_names;


Can't, names are potentially modified in a few places later in the code (empty names -> "Unnamed: col_index", mangle duplicates, apply names to selected column when index-based column selection is used).

cpp/src/io/csv/reader_impl.cu

…bug-read_csv-usecols-names

Co-authored-by: Vyas Ramasubramani <[email protected]>

…/cudf into bug-read_csv-usecols-names

vuule · 2022-11-17T00:07:06Z

rerun tests

vuule · 2022-11-17T05:58:13Z

@gpucibot merge

vuule added 4 commits October 27, 2022 16:35

fixes

2a7b02d

C++ tests fixes

a079b30

Python test

20dd06a

Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …

f1f738b

…bug-read_csv-usecols-names

vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Oct 27, 2022

vuule self-assigned this Oct 27, 2022

github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Oct 27, 2022

docs

053e9e8

vuule changed the title ~~Fix interactions between usecols and names options in read_csv~~ Fix issues when both usecols and names options are used in read_csv Oct 28, 2022

galipremsagar approved these changes Oct 28, 2022

View reviewed changes

vuule added 4 commits October 28, 2022 16:42

Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …

c3b8df8

…bug-read_csv-usecols-names

allow partial column names w/o selection

6fd14a5

Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …

2a2c3a3

…bug-read_csv-usecols-names

Spark corner case

f04e251

vuule marked this pull request as ready for review November 1, 2022 17:35

vuule requested review from a team as code owners November 1, 2022 17:35

vuule requested review from vyasr, isVoid and karthikeyann November 1, 2022 17:35

karthikeyann reviewed Nov 4, 2022

View reviewed changes

vuule requested a review from karthikeyann November 8, 2022 21:41

vuule added 2 commits November 8, 2022 17:10

revert unrelated fix

fb7a3e4

Merge branch 'branch-22.12' into bug-read_csv-usecols-names

40c3a7a

karthikeyann approved these changes Nov 15, 2022

View reviewed changes

cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved

cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved

vuule added 4 commits November 15, 2022 22:21

Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …

1833bf1

…bug-read_csv-usecols-names

review suggestions

905ba70

update tests to new empty column type

ec848a2

Merge branch 'bug-read_csv-usecols-names' of https://github.com/vuule…

fef4dd4

…/cudf into bug-read_csv-usecols-names

nvdbaranec reviewed Nov 16, 2022

View reviewed changes

vyasr reviewed Nov 16, 2022

View reviewed changes

ttnghia reviewed Nov 16, 2022

View reviewed changes

cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved

ttnghia reviewed Nov 16, 2022

View reviewed changes

cpp/src/io/csv/reader_impl.cu Show resolved Hide resolved

vuule and others added 4 commits November 16, 2022 11:33

Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …

008464d

…bug-read_csv-usecols-names

CTAD

ec9fe6a

Co-authored-by: Vyas Ramasubramani <[email protected]>

Merge branch 'bug-read_csv-usecols-names' of https://github.com/vuule…

b9d80a3

…/cudf into bug-read_csv-usecols-names

review suggestions

7039f22

vuule requested review from ttnghia and vyasr November 16, 2022 20:21

add comments

b8d16e7

vyasr approved these changes Nov 16, 2022

View reviewed changes

ttnghia approved these changes Nov 16, 2022

View reviewed changes

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Nov 16, 2022

nvdbaranec approved these changes Nov 16, 2022

View reviewed changes

rapids-bot bot merged commit 6de2c4e into rapidsai:branch-22.12 Nov 17, 2022

vuule deleted the bug-read_csv-usecols-names branch November 17, 2022 06:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issues when both `usecols` and `names` options are used in `read_csv` #12018

Fix issues when both `usecols` and `names` options are used in `read_csv` #12018

vuule commented Oct 27, 2022 •

edited

Loading

codecov bot commented Oct 28, 2022 •

edited

Loading

vuule commented Oct 28, 2022

galipremsagar left a comment

vuule commented Nov 1, 2022

vyasr commented Nov 2, 2022

jlowe commented Nov 3, 2022

karthikeyann Nov 4, 2022

vuule Nov 5, 2022

karthikeyann Nov 7, 2022

vuule Nov 8, 2022

karthikeyann left a comment

nvdbaranec left a comment

ttnghia Nov 16, 2022

vuule Nov 16, 2022

vuule commented Nov 17, 2022

vuule commented Nov 17, 2022

Fix issues when both usecols and names options are used in read_csv #12018

Fix issues when both usecols and names options are used in read_csv #12018

Conversation

vuule commented Oct 27, 2022 • edited Loading

Description

Checklist

codecov bot commented Oct 28, 2022 • edited Loading

Codecov Report

vuule commented Oct 28, 2022

galipremsagar left a comment

Choose a reason for hiding this comment

vuule commented Nov 1, 2022

vyasr commented Nov 2, 2022

jlowe commented Nov 3, 2022

karthikeyann Nov 4, 2022

Choose a reason for hiding this comment

vuule Nov 5, 2022

Choose a reason for hiding this comment

karthikeyann Nov 7, 2022

Choose a reason for hiding this comment

vuule Nov 8, 2022

Choose a reason for hiding this comment

karthikeyann left a comment

Choose a reason for hiding this comment

nvdbaranec left a comment

Choose a reason for hiding this comment

ttnghia Nov 16, 2022

Choose a reason for hiding this comment

vuule Nov 16, 2022

Choose a reason for hiding this comment

vuule commented Nov 17, 2022

vuule commented Nov 17, 2022

Fix issues when both `usecols` and `names` options are used in `read_csv` #12018

Fix issues when both `usecols` and `names` options are used in `read_csv` #12018

vuule commented Oct 27, 2022 •

edited

Loading

codecov bot commented Oct 28, 2022 •

edited

Loading