[REVIEW] Add support for `category` dtypes in CSV reader #12571

galipremsagar · 2023-01-18T19:54:58Z

Description

Fixes: #11977, #3960

This PR enables support for category dtypes in dtype parameter. This PR contains a workaround that enables reading columns as categorical dtypes, we can remove this workaround once libcudf has native support for dictionary type mapping to categorical columns.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

codecov · 2023-01-18T21:20:26Z

Codecov Report

Base: 86.58% // Head: 85.71% // Decreases project coverage by -0.87% ⚠️

Coverage data is based on head (ef8ed19) compared to base (b6dccb3).
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-23.02   #12571      +/-   ##
================================================
- Coverage         86.58%   85.71%   -0.87%     
================================================
  Files               155      155              
  Lines             24368    24865     +497     
================================================
+ Hits              21098    21312     +214     
- Misses             3270     3553     +283

Impacted Files	Coverage Δ
python/cudf/cudf/_version.py	`1.41% <0.00%> (-98.59%)`	⬇️
python/cudf/cudf/core/buffer/spill_manager.py	`72.50% <0.00%> (-7.50%)`	⬇️
python/cudf/cudf/core/buffer/spillable_buffer.py	`91.07% <0.00%> (-1.78%)`	⬇️
python/cudf/cudf/utils/dtypes.py	`77.85% <0.00%> (-1.61%)`	⬇️
python/cudf/cudf/options.py	`86.11% <0.00%> (-1.59%)`	⬇️
python/cudf/cudf/core/single_column_frame.py	`94.30% <0.00%> (-1.27%)`	⬇️
...ython/custreamz/custreamz/tests/test_dataframes.py	`98.38% <0.00%> (-1.01%)`	⬇️
python/dask_cudf/dask_cudf/io/csv.py	`96.34% <0.00%> (-1.00%)`	⬇️
python/dask_cudf/dask_cudf/io/parquet.py	`91.81% <0.00%> (-0.59%)`	⬇️
python/cudf/cudf/core/multiindex.py	`91.66% <0.00%> (-0.51%)`	⬇️
... and 45 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

vuule

FWIW this seems fine to me

vuule · 2023-01-19T01:04:41Z

python/cudf/cudf/tests/test_csv.py

+    actual = cudf.read_csv(StringIO(csv_buf), dtype=dtype)
+    expected = pd.read_csv(StringIO(csv_buf), dtype=dtype)
+
+    assert_eq(expected, actual)


happy to see that direct comparison works here.

python/cudf/cudf/tests/test_csv.py

bdice

Changes look good - just wondering if our type checks can be a little more tightly constrained.

bdice · 2023-01-20T22:04:54Z

python/cudf/cudf/_lib/csv.pyx

@@ -429,6 +431,25 @@ def read_csv(
        column_names=meta_names
    ))

+    if dtype is not None:
+        if isinstance(dtype, abc.Mapping):


Is it possible for this to be a generic Mapping, or is it guaranteed to be a dict? If you only care about dicts and not dict-like objects, then avoid the abstract class check in favor of isinstance(dtype, dict). It can be considerably faster.

I suppose this might come from unsanitized user input with random dict-like classes as the dtype...? Maybe this is the best we can do (here and in the other thread below).

Our check's aren't much tighter checking for dict because in pandas dtype is super overloaded param, where for one example they support defaultdict too: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

bdice · 2023-01-20T22:07:58Z

python/cudf/cudf/_lib/csv.pyx

+        ):
+            if cudf.api.types.is_categorical_dtype(dtype):
+                df = df.astype(dtype)
+        elif isinstance(dtype, abc.Collection):


Can dtypes be this broad? Is there a tighter constraint like "list or tuple" that excludes the cases handled above? Just trying to avoid the abstract type check as above.

Answered above

galipremsagar · 2023-01-21T07:53:47Z

/merge

galipremsagar added 2 commits January 18, 2023 11:17

enable category dtype in read_csv

8c804d4

enable more tests

3647c15

galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 18, 2023

galipremsagar requested a review from a team as a code owner January 18, 2023 19:54

galipremsagar self-assigned this Jan 18, 2023

galipremsagar requested review from bdice and brandon-b-miller January 18, 2023 19:55

galipremsagar requested a review from vuule January 18, 2023 19:55

vuule reviewed Jan 19, 2023

View reviewed changes

galipremsagar added 2 commits January 18, 2023 18:36

add nulls

157326f

update

ba6b161

galipremsagar requested a review from vuule January 19, 2023 02:38

galipremsagar added 2 commits January 18, 2023 20:38

Merge branch 'branch-23.02' into 11977

780e078

Merge branch 'branch-23.02' into 11977

d049605

vuule approved these changes Jan 20, 2023

View reviewed changes

galipremsagar removed the 4 - Needs cuIO Reviewer label Jan 20, 2023

bdice approved these changes Jan 20, 2023

View reviewed changes

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Jan 20, 2023

Merge remote-tracking branch 'upstream/branch-23.02' into 11977

0ee22cf

Merge branch '11977' of https://github.com/galipremsagar/cudf into 11977

ef8ed19

rapids-bot bot merged commit 11f90d1 into rapidsai:branch-23.02 Jan 21, 2023

galipremsagar mentioned this pull request Jan 21, 2023

[BUG] Category dtype gives unexpected hashed values for int32 when reading from CSV #3960

Closed

mroeschke mentioned this pull request Feb 27, 2024

Use less _is_categorical_dtype #15148

Merged

3 tasks

mroeschke mentioned this pull request Dec 5, 2024

Remove cudf._lib.csv in favor in inlining pylibcudf #17485

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Add support for `category` dtypes in CSV reader #12571

[REVIEW] Add support for `category` dtypes in CSV reader #12571

galipremsagar commented Jan 18, 2023

codecov bot commented Jan 18, 2023 •

edited

Loading

vuule left a comment

vuule Jan 19, 2023

bdice left a comment

bdice Jan 20, 2023

bdice Jan 20, 2023

galipremsagar Jan 20, 2023

bdice Jan 20, 2023

galipremsagar Jan 20, 2023

galipremsagar commented Jan 21, 2023

[REVIEW] Add support for category dtypes in CSV reader #12571

[REVIEW] Add support for category dtypes in CSV reader #12571

Conversation

galipremsagar commented Jan 18, 2023

Description

Checklist

codecov bot commented Jan 18, 2023 • edited Loading

Codecov Report

vuule left a comment

Choose a reason for hiding this comment

vuule Jan 19, 2023

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

bdice Jan 20, 2023

Choose a reason for hiding this comment

bdice Jan 20, 2023

Choose a reason for hiding this comment

galipremsagar Jan 20, 2023

Choose a reason for hiding this comment

bdice Jan 20, 2023

Choose a reason for hiding this comment

galipremsagar Jan 20, 2023

Choose a reason for hiding this comment

galipremsagar commented Jan 21, 2023

[REVIEW] Add support for `category` dtypes in CSV reader #12571

[REVIEW] Add support for `category` dtypes in CSV reader #12571

codecov bot commented Jan 18, 2023 •

edited

Loading