-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add support for category
dtypes in CSV reader
#12571
Conversation
Codecov ReportBase: 86.58% // Head: 85.71% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-23.02 #12571 +/- ##
================================================
- Coverage 86.58% 85.71% -0.87%
================================================
Files 155 155
Lines 24368 24865 +497
================================================
+ Hits 21098 21312 +214
- Misses 3270 3553 +283
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this seems fine to me
python/cudf/cudf/tests/test_csv.py
Outdated
actual = cudf.read_csv(StringIO(csv_buf), dtype=dtype) | ||
expected = pd.read_csv(StringIO(csv_buf), dtype=dtype) | ||
|
||
assert_eq(expected, actual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to see that direct comparison works here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good - just wondering if our type checks can be a little more tightly constrained.
@@ -429,6 +431,25 @@ def read_csv( | |||
column_names=meta_names | |||
)) | |||
|
|||
if dtype is not None: | |||
if isinstance(dtype, abc.Mapping): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for this to be a generic Mapping
, or is it guaranteed to be a dict
? If you only care about dict
s and not dict
-like objects, then avoid the abstract class check in favor of isinstance(dtype, dict)
. It can be considerably faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this might come from unsanitized user input with random dict-like classes as the dtype
...? Maybe this is the best we can do (here and in the other thread below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our check's aren't much tighter checking for dict
because in pandas dtype
is super overloaded param, where for one example they support defaultdict
too: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
): | ||
if cudf.api.types.is_categorical_dtype(dtype): | ||
df = df.astype(dtype) | ||
elif isinstance(dtype, abc.Collection): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can dtypes be this broad? Is there a tighter constraint like "list
or tuple
" that excludes the cases handled above? Just trying to avoid the abstract type check as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered above
/merge |
Description
Fixes: #11977, #3960
This PR enables support for
category
dtypes indtype
parameter. This PR contains a workaround that enables reading columns as categorical dtypes, we can remove this workaround oncelibcudf
has native support for dictionary type mapping to categorical columns.Checklist