Skip to content

Commit

Permalink
First check for BaseDtype when infering the data type of an arbitra…
Browse files Browse the repository at this point in the history
…ry object (#13295)

We have an internal utility called `dtype()` that attempts to infer the data type of an arbitrary object. One of the first thing that `dtype()` does is attempt to call `np.dtype(obj)`. That can be slow for extremely large cardinality categorical data types, as it copies data to host (in particular, it attempts to call the object's `__repr__`):

Before this PR:

```python
dtype = cudf.CategoricalDtype(categories=range(100_000_000))
%%time x = cudf.core.dtypes.dtype(dtype)
CPU times: user 3.75 s, sys: 885 ms, total: 4.64 s
Wall time: 4.63 s
```

This PR ensures we attempt to do far less expensive inference first, before calling `np.dtype(...)`.

After this PR: 

```python
%%time x = cudf.core.dtypes.dtype(dtype)
CPU times: user 13 µs, sys: 1 µs, total: 14 µs
Wall time: 19.1 µs
```

Authors:
  - Ashwin Srinath (https://github.com/shwina)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #13295
  • Loading branch information
shwina authored May 5, 2023
1 parent 427e2af commit ceacfa4
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions python/cudf/cudf/core/dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,11 @@ def dtype(arbitrary):
-------
dtype: the cuDF-supported dtype that best matches `arbitrary`
"""
# first, try interpreting arbitrary as a NumPy dtype that we support:
# first, check if `arbitrary` is one of our extension types:
if isinstance(arbitrary, cudf.core.dtypes._BaseDtype):
return arbitrary

# next, try interpreting arbitrary as a NumPy dtype that we support:
try:
np_dtype = np.dtype(arbitrary)
if np_dtype.kind in ("OU"):
Expand All @@ -54,10 +58,6 @@ def dtype(arbitrary):
raise TypeError(f"Unsupported type {np_dtype}")
return np_dtype

# next, check if `arbitrary` is one of our extension types:
if isinstance(arbitrary, cudf.core.dtypes._BaseDtype):
return arbitrary

# use `pandas_dtype` to try and interpret
# `arbitrary` as a Pandas extension type.
# Return the corresponding NumPy/cuDF type.
Expand Down

0 comments on commit ceacfa4

Please sign in to comment.