Fix reading of CSV files with blank second row #12098

vuule · 2022-11-09T02:12:02Z

Description

There are two options to get the names of columns in a CSV file - header or the first row. In case the first row is used, names are generated, and the only part of the row that is used is the number of detected columns.

This PR fixes the corner case where a blank line after the first (non-header) row causes the reader to detect an additional column (and return an additional column of nulls).
The fix is to break when there is a terminator character within the first row; this only happens with blank row(s) after the first data row. The reader already does this when reading column names from a header, this PR just removes this difference in behavior that was causing the bug.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

codecov · 2022-11-09T04:34:36Z

Codecov Report

Base: 87.47% // Head: 88.10% // Increases project coverage by +0.62% 🎉

Coverage data is based on head (2ba86a6) compared to base (f817d96).
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-22.12   #12098      +/-   ##
================================================
+ Coverage         87.47%   88.10%   +0.62%     
================================================
  Files               133      135       +2     
  Lines             21826    22057     +231     
================================================
+ Hits              19093    19433     +340     
+ Misses             2733     2624     -109

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/interval.py	`85.45% <0.00%> (-9.10%)`	⬇️
python/cudf/cudf/io/text.py	`91.66% <0.00%> (-8.34%)`	⬇️
python/cudf/cudf/core/_base_index.py	`81.28% <0.00%> (-4.27%)`	⬇️
python/cudf/cudf/io/json.py	`92.06% <0.00%> (-2.68%)`	⬇️
python/cudf/cudf/utils/utils.py	`89.91% <0.00%> (-0.69%)`	⬇️
python/cudf/cudf/core/column/timedelta.py	`90.17% <0.00%> (-0.58%)`	⬇️
python/cudf/cudf/core/column/datetime.py	`89.21% <0.00%> (-0.51%)`	⬇️
python/cudf/cudf/core/column/column.py	`87.96% <0.00%> (-0.46%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`73.72% <0.00%> (-0.41%)`	⬇️
python/cudf/cudf/io/parquet.py	`90.45% <0.00%> (-0.39%)`	⬇️
... and 42 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

davidwendt

LGTM

vuule · 2022-11-09T22:40:11Z

@gpucibot merge

vuule added 2 commits November 8, 2022 17:59

stop at /n when getting column names w/o header

1ac14b1

tests

2ba86a6

vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Nov 9, 2022

vuule self-assigned this Nov 9, 2022

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 9, 2022

vuule marked this pull request as ready for review November 9, 2022 19:31

vuule requested a review from a team as a code owner November 9, 2022 19:31

vuule requested review from elstehle and davidwendt November 9, 2022 19:31

davidwendt approved these changes Nov 9, 2022

View reviewed changes

ttnghia approved these changes Nov 9, 2022

View reviewed changes

rapids-bot bot merged commit 4de279d into rapidsai:branch-22.12 Nov 9, 2022

vuule deleted the bug-read_csv-blank-after-first-row branch November 9, 2022 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reading of CSV files with blank second row #12098

Fix reading of CSV files with blank second row #12098

vuule commented Nov 9, 2022

codecov bot commented Nov 9, 2022 •

edited

Loading

davidwendt left a comment

vuule commented Nov 9, 2022

Fix reading of CSV files with blank second row #12098

Fix reading of CSV files with blank second row #12098

Conversation

vuule commented Nov 9, 2022

Description

Checklist

codecov bot commented Nov 9, 2022 • edited Loading

Codecov Report

davidwendt left a comment

Choose a reason for hiding this comment

vuule commented Nov 9, 2022

codecov bot commented Nov 9, 2022 •

edited

Loading