Allow users to specify data types for a subset of columns in `read_csv` #10484

vuule · 2022-03-22T19:10:19Z

CSV reader previously assumed that all data types are specified by the user, or none.
This PR changes the logic so that user can pass a map/dictionary to specify type for any subset of columns, and reader infers the type for the remaining columns.
When passing columns as an array, users still need to specify all columns' types, because the array become ambiguous when reading a subset of columns in the file.

…fea-csv-allow-partial-dtype

bdice

Looks great overall! Only one minor change request.

cpp/src/io/csv/reader_impl.cu

galipremsagar · 2022-03-22T20:40:22Z

is this intended for 22.04 or 22.06? - Asking this because the PR and issue are on two different boards.

vuule · 2022-03-22T21:52:32Z

is this intended for 22.04 or 22.06? - Asking this because the PR and issue are on two different boards.

I think it's a bit late for 22.04, would prefer to leave for 22.06 unless the issue is critical. Did not get a comment from @shwina about the severity.

Co-authored-by: Bradley Dice <[email protected]>

shwina · 2022-03-23T17:26:52Z

Apologies for missing the pings about this one. 22.06 should be fine.

…fea-csv-allow-partial-dtype

codecov · 2022-03-24T21:17:48Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.06@8d86ae8). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.06   #10484   +/-   ##
===============================================
  Coverage                ?   86.31%           
===============================================
  Files                   ?      140           
  Lines                   ?    22312           
  Branches                ?        0           
===============================================
  Hits                    ?    19259           
  Misses                  ?     3053           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d86ae8...3d0809d. Read the comment docs.

vuule · 2022-03-24T21:46:56Z

@gpucibot merge

vuule added 6 commits March 21, 2022 18:36

passing old tests

9963dff

Merge branch 'branch-22.04' of https://github.com/rapidsai/cudf into …

fb9c670

…fea-csv-allow-partial-dtype

add test; fix for empty dataframes

960d21b

bit more testing

a55f6db

param order; host_span

2d3fa2a

clean up

a10481e

vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Mar 22, 2022

vuule requested a review from a team as a code owner March 22, 2022 19:10

vuule self-assigned this Mar 22, 2022

vuule requested review from bdice and ttnghia March 22, 2022 19:10

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 22, 2022

rename

34819d0

bdice approved these changes Mar 22, 2022

View reviewed changes

cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved

bdice reviewed Mar 22, 2022

View reviewed changes

cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved

Apply suggestions from code review

6bb668e

Co-authored-by: Bradley Dice <[email protected]>

ttnghia approved these changes Mar 23, 2022

View reviewed changes

vuule added 3 commits March 24, 2022 12:03

Merge branch 'branch-22.06' of https://github.com/rapidsai/cudf into …

5296765

…fea-csv-allow-partial-dtype

Merge branch 'branch-22.06' of https://github.com/rapidsai/cudf into …

58cf16c

…fea-csv-allow-partial-dtype

as_const to match the pointer type

3d0809d

rapids-bot bot merged commit 3a16a7f into rapidsai:branch-22.06 Mar 24, 2022

vuule deleted the fea-csv-allow-partial-dtype branch March 24, 2022 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow users to specify data types for a subset of columns in `read_csv` #10484

Allow users to specify data types for a subset of columns in `read_csv` #10484

vuule commented Mar 22, 2022

bdice left a comment

galipremsagar commented Mar 22, 2022

vuule commented Mar 22, 2022

shwina commented Mar 23, 2022

codecov bot commented Mar 24, 2022 •

edited

Loading

vuule commented Mar 24, 2022

Allow users to specify data types for a subset of columns in read_csv #10484

Allow users to specify data types for a subset of columns in read_csv #10484

Conversation

vuule commented Mar 22, 2022

bdice left a comment

Choose a reason for hiding this comment

galipremsagar commented Mar 22, 2022

vuule commented Mar 22, 2022

shwina commented Mar 23, 2022

codecov bot commented Mar 24, 2022 • edited Loading

Codecov Report

vuule commented Mar 24, 2022

Allow users to specify data types for a subset of columns in `read_csv` #10484

Allow users to specify data types for a subset of columns in `read_csv` #10484

codecov bot commented Mar 24, 2022 •

edited

Loading