Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes bug in csv_reader_options construction in cython #12021

Conversation

karthikeyann
Copy link
Contributor

Description

Fixes bug in csv_reader_options construction in cython
The false values for csv were not passed to the csv_reader_options during construction in cython code. This is fixed and a unit test is added.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added bug Something isn't working 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer non-breaking Non-breaking change labels Oct 28, 2022
@karthikeyann karthikeyann requested a review from a team as a code owner October 28, 2022 09:53
@karthikeyann karthikeyann self-assigned this Oct 28, 2022
@karthikeyann karthikeyann changed the title fix bug in csv_reader_options construction in cython Fixes bug in csv_reader_options construction in cython Oct 28, 2022
@codecov
Copy link

codecov bot commented Oct 28, 2022

Codecov Report

Base: 87.47% // Head: 88.11% // Increases project coverage by +0.63% 🎉

Coverage data is based on head (3341468) compared to base (f817d96).
Patch has no changes to coverable lines.

❗ Current head 3341468 differs from pull request most recent head 72a78be. Consider uploading reports for the commit 72a78be to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #12021      +/-   ##
================================================
+ Coverage         87.47%   88.11%   +0.63%     
================================================
  Files               133      133              
  Lines             21826    22003     +177     
================================================
+ Hits              19093    19388     +295     
+ Misses             2733     2615     -118     
Impacted Files Coverage Δ
python/cudf/cudf/io/text.py 91.66% <0.00%> (-8.34%) ⬇️
python/cudf/cudf/core/_base_index.py 81.28% <0.00%> (-4.27%) ⬇️
python/cudf/cudf/io/json.py 92.06% <0.00%> (-2.68%) ⬇️
python/cudf/cudf/utils/utils.py 89.91% <0.00%> (-0.69%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.72% <0.00%> (-0.41%) ⬇️
python/cudf/cudf/io/parquet.py 90.45% <0.00%> (-0.39%) ⬇️
python/dask_cudf/dask_cudf/backends.py 84.90% <0.00%> (-0.37%) ⬇️
python/cudf/cudf/core/column/numerical.py 95.21% <0.00%> (-0.29%) ⬇️
python/cudf/cudf/io/orc.py 92.94% <0.00%> (-0.09%) ⬇️
python/cudf/cudf/core/dataframe.py 93.67% <0.00%> (-0.06%) ⬇️
... and 25 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix here looks fine. I'll let the other reviewers follow up on the question of testing.

bool literals give parsing errors as int
"0" and "1" give parsing errors as bool in pandas
@karthikeyann
Copy link
Contributor Author

Pandas behaviour:
bool literals true, false, user passed true_values, false_values, are not considered as valid literal while parsing as integer.
Also, "0" is not considered as false while parsing as bool. similarly for "1".

Our CSV parser:
bool literals are converted to int.
"0" considered as false, "1" considered as true.

These are differences between our parsers and pandas. We should decide where we should deviate from pandas parsers.

@karthikeyann karthikeyann requested review from vyasr and bdice November 2, 2022 08:06
python/cudf/cudf/_lib/csv.pyx Show resolved Hide resolved
@galipremsagar
Copy link
Contributor

Pandas behaviour:
bool literals true, false, user passed true_values, false_values, are not considered as valid literal while parsing as integer.
Also, "0" is not considered as false while parsing as bool. similarly for "1".

The pandas behaviour seems too restrictive here, I'm okay with leaving our behaviour as is and think it's okay to have some flexibility on our csv reader.

@karthikeyann
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit a3d2276 into rapidsai:branch-22.12 Nov 2, 2022
rapids-bot bot pushed a commit that referenced this pull request Nov 15, 2022
This PR will cleanup nested json reader and csv reader's common parsing code.
- Uses `std::optional` for indicating parsing failure in `parse_numeric`
- Cleanup
  - Removed `decode_value` as it only gives only specialization for timestamp and duration types, rest of types are passthrough.
  - Unified `decode_digit`

Depends on #11898 and #12021

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #12022
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Python) Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants