Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add support for category dtypes in CSV reader #12571

Merged
merged 8 commits into from
Jan 21, 2023

Conversation

galipremsagar
Copy link
Contributor

Description

Fixes: #11977, #3960

This PR enables support for category dtypes in dtype parameter. This PR contains a workaround that enables reading columns as categorical dtypes, we can remove this workaround once libcudf has native support for dictionary type mapping to categorical columns.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@galipremsagar galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 18, 2023
@galipremsagar galipremsagar requested a review from a team as a code owner January 18, 2023 19:54
@galipremsagar galipremsagar self-assigned this Jan 18, 2023
@galipremsagar galipremsagar requested a review from vuule January 18, 2023 19:55
@codecov
Copy link

codecov bot commented Jan 18, 2023

Codecov Report

Base: 86.58% // Head: 85.71% // Decreases project coverage by -0.87% ⚠️

Coverage data is based on head (ef8ed19) compared to base (b6dccb3).
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-23.02   #12571      +/-   ##
================================================
- Coverage         86.58%   85.71%   -0.87%     
================================================
  Files               155      155              
  Lines             24368    24865     +497     
================================================
+ Hits              21098    21312     +214     
- Misses             3270     3553     +283     
Impacted Files Coverage Δ
python/cudf/cudf/_version.py 1.41% <0.00%> (-98.59%) ⬇️
python/cudf/cudf/core/buffer/spill_manager.py 72.50% <0.00%> (-7.50%) ⬇️
python/cudf/cudf/core/buffer/spillable_buffer.py 91.07% <0.00%> (-1.78%) ⬇️
python/cudf/cudf/utils/dtypes.py 77.85% <0.00%> (-1.61%) ⬇️
python/cudf/cudf/options.py 86.11% <0.00%> (-1.59%) ⬇️
python/cudf/cudf/core/single_column_frame.py 94.30% <0.00%> (-1.27%) ⬇️
...ython/custreamz/custreamz/tests/test_dataframes.py 98.38% <0.00%> (-1.01%) ⬇️
python/dask_cudf/dask_cudf/io/csv.py 96.34% <0.00%> (-1.00%) ⬇️
python/dask_cudf/dask_cudf/io/parquet.py 91.81% <0.00%> (-0.59%) ⬇️
python/cudf/cudf/core/multiindex.py 91.66% <0.00%> (-0.51%) ⬇️
... and 45 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this seems fine to me

actual = cudf.read_csv(StringIO(csv_buf), dtype=dtype)
expected = pd.read_csv(StringIO(csv_buf), dtype=dtype)

assert_eq(expected, actual)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to see that direct comparison works here.

python/cudf/cudf/tests/test_csv.py Show resolved Hide resolved
@galipremsagar galipremsagar requested a review from vuule January 19, 2023 02:38
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good - just wondering if our type checks can be a little more tightly constrained.

@@ -429,6 +431,25 @@ def read_csv(
column_names=meta_names
))

if dtype is not None:
if isinstance(dtype, abc.Mapping):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for this to be a generic Mapping, or is it guaranteed to be a dict? If you only care about dicts and not dict-like objects, then avoid the abstract class check in favor of isinstance(dtype, dict). It can be considerably faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this might come from unsanitized user input with random dict-like classes as the dtype...? Maybe this is the best we can do (here and in the other thread below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our check's aren't much tighter checking for dict because in pandas dtype is super overloaded param, where for one example they support defaultdict too: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

):
if cudf.api.types.is_categorical_dtype(dtype):
df = df.astype(dtype)
elif isinstance(dtype, abc.Collection):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can dtypes be this broad? Is there a tighter constraint like "list or tuple" that excludes the cases handled above? Just trying to avoid the abstract type check as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered above

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Jan 20, 2023
@galipremsagar
Copy link
Contributor Author

/merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] CategoricalDtype as dtype in CSV reader
3 participants