Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
First check for
BaseDtype
when infering the data type of an arbitra…
…ry object (#13295) We have an internal utility called `dtype()` that attempts to infer the data type of an arbitrary object. One of the first thing that `dtype()` does is attempt to call `np.dtype(obj)`. That can be slow for extremely large cardinality categorical data types, as it copies data to host (in particular, it attempts to call the object's `__repr__`): Before this PR: ```python dtype = cudf.CategoricalDtype(categories=range(100_000_000)) %%time x = cudf.core.dtypes.dtype(dtype) CPU times: user 3.75 s, sys: 885 ms, total: 4.64 s Wall time: 4.63 s ``` This PR ensures we attempt to do far less expensive inference first, before calling `np.dtype(...)`. After this PR: ```python %%time x = cudf.core.dtypes.dtype(dtype) CPU times: user 13 µs, sys: 1 µs, total: 14 µs Wall time: 19.1 µs ``` Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - Bradley Dice (https://github.com/bdice) - Lawrence Mitchell (https://github.com/wence-) URL: #13295
- Loading branch information