First check for `BaseDtype` when infering the data type of an arbitrary object #13295

shwina · 2023-05-04T20:08:43Z

We have an internal utility called dtype() that attempts to infer the data type of an arbitrary object. One of the first thing that dtype() does is attempt to call np.dtype(obj). That can be slow for extremely large cardinality categorical data types, as it copies data to host (in particular, it attempts to call the object's __repr__):

Before this PR:

dtype = cudf.CategoricalDtype(categories=range(100_000_000))
%%time x = cudf.core.dtypes.dtype(dtype)
CPU times: user 3.75 s, sys: 885 ms, total: 4.64 s
Wall time: 4.63 s

This PR ensures we attempt to do far less expensive inference first, before calling np.dtype(...).

After this PR:

%%time x = cudf.core.dtypes.dtype(dtype)
CPU times: user 13 µs, sys: 1 µs, total: 14 µs
Wall time: 19.1 µs

… object

…e-infer-order

bdice

Good catch. How did you come across this?

wence-

Ooof.

shwina · 2023-05-05T10:32:00Z

/merge

shwina added 2 commits May 4, 2023 16:03

Check for BaseDtype first when infering the data type of an arbitrary…

35e4c12

… object

Merge branch 'branch-23.06' of github.com:rapidsai/cudf into fix-dtyp…

176d955

…e-infer-order

shwina requested a review from a team as a code owner May 4, 2023 20:08

shwina requested review from wence- and brandon-b-miller May 4, 2023 20:08

github-actions bot added the Python Affects Python cuDF API. label May 4, 2023

bdice approved these changes May 4, 2023

View reviewed changes

shwina added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels May 4, 2023

wence- approved these changes May 5, 2023

View reviewed changes

rapids-bot bot merged commit ceacfa4 into rapidsai:branch-23.06 May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First check for `BaseDtype` when infering the data type of an arbitrary object #13295

First check for `BaseDtype` when infering the data type of an arbitrary object #13295

shwina commented May 4, 2023

bdice left a comment

wence- left a comment

shwina commented May 5, 2023

First check for BaseDtype when infering the data type of an arbitrary object #13295

First check for BaseDtype when infering the data type of an arbitrary object #13295

Conversation

shwina commented May 4, 2023

bdice left a comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

shwina commented May 5, 2023

First check for `BaseDtype` when infering the data type of an arbitrary object #13295

First check for `BaseDtype` when infering the data type of an arbitrary object #13295