[WIP] cuDF.dtype objects #6160

brandon-b-miller · 2020-09-04T20:17:11Z

Closes #4823
Closes #5714

Implements cudf.StringDtype, cudf.BooleanDtype, cudf.UIntXYDtype, cudf.IntXYDtype, cudf.Datetime64XYDtype, and cudf.Timedelta64XYDtype and the generic types such as cudf.Number to form a numpy-like hierarchy. Removes np.object as a cudf dtype object.
Implements cudf.api.types which is similar to pd.api.types except with the addition of several numpy like utility functions
Implements cudf.Scalar extending cudf._lib.scalar.Scalar, and a set of ops compatible with python scalars. cuDF dtypes type attribute yields these scalars.

If merged, this API would introduce our own object that is attached to cuDF objects via the dtype attribute. In this picture, there would be one separate dtype object for each dtype we support. These are related to each other via a numpy-like hierarchy, e.g. cudf.Float64Dtype() and cudf.Int8Dtype() are each subclasses of cudf.Floating and cudf.Integer respectively, both of which are themselves subclasses of cudf.Number. The intent is to replicate numpy-like usage for these dtypes, meaning we want to get the same result from isinstance(cudf_dtype, cudf.Number) as we would from writing np.issubdtype(numpy_dtype, np.number), except yielding the corresponding cuDF dtype. All cuDF dtype objects carry whatever numpy dtype was previously used as an attribute, so that when we absolutely need to use a numpy dtype, such as when interacting with cupy, we can use cudf.Int64Dtype().to_numpy and recover it.

In places where we previously used numpy logic to determine various type casting rules and operations, such as in uses of np.find_common_type, np.promote_types, and similar, we would now use cudf.api.types.find_common_type. These functions basically drop the cuDF dtypes to their numpy versions and call the numpy function on the results, then promote the result of that back to a cuDF dtype. This means that we as developers should be able to just use them on any combination of cuDF and numpy types with the same APIs and it should just work. In addition some sparsely used attributes of numpy objects are forwarded to the numpy version of the object, such as np.dtype('int64').num.

cuDF dtypes produce GPU backed scalars of their own dtype via their type attribute. Currently these wrap libcudf scalar objects. These are to be used to avoid having to copy scalars to host in as many cases as possible, such as in operations like some_column - some_column.mean(). That said, yielding cuDF scalar objects from reduction operations necessitates that these scalars can be used with other python scalars in an intuitive way. This PR therefore also aims to implement a sensible set of unaops and binops on the resulting scalars. Generally this involves a host copy as libcudf scalars currently do not support these ops and we often need to perform control flow around the result, which forces a host copy in any case. We may be able to get around this by caching the value on the host if we know it hasn't changed.

Ultimately the purpose of this PR is to provide a more unified dtype interface for both the developer and the user. The user will not be confused by mixing of numpy dtypes, numpy object dtypes, custom cuDF dtypes and pandas nullable dtypes - everything will be a cuDF dtype. Furthermore developers will no longer need to worry about special casing their code in many situations, or using imported dicts/functions to map between numpy, pyarrow, libcudf and pandas nullable dtypes. This logic will be contained in the dtype objects themselves at init and available to the programmer on hand.

brandon-b-miller · 2020-09-16T13:19:10Z

@kkraus14 thanks for the review - addressing these questions/items today.

harrism · 2020-10-02T01:38:57Z

Should this be pushed to 0.17?

kkraus14 · 2020-10-02T03:24:40Z

Should this be pushed to 0.17?

Yes

brandon-b-miller · 2020-10-02T12:43:01Z

Yes - a lot more to be done here in dask and elsewhere.

github-actions · 2021-02-16T20:19:35Z

This PR has been marked rotten due to no recent activity in the past 90d. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.

brandon-b-miller · 2021-05-11T17:24:54Z

Closing PR. This PR needs to be more or less reworked after the array refactor. While a lot of the code is useful, the codebase drift and a better understanding of the problem will probably yield a very different looking pull request ultimately.

brandon-b-miller added 30 commits July 28, 2020 06:47

initial dtype work

1c83eac

begin to plumb dtype

33bd96c

migrate dtypes to cudf main __init__

baf138c

numerical column plumbing

bdb87fa

update dtype classes, mappings

4a3fe71

start to plumb stringcolumn

1cf2c3e

inherit from basic cython class

dbc4970

plumb numerical column __repr__, default_na_value

ba42bd8

plumb some parts of unary

60272e2

make a factory and fix bugs

c03be40

more progress on columns, dtype object

7f6cb36

forgot string O

a81c368

more progress

572c39f

column tests pass

4f6f316

working up through test_array_func

ee6ece5

more tests pass

62c5e17

merge 0.15 and resolve conflicts

139465f

handle list dtype in _Dtype

ef5b9cb

fix series syntax error

9320755

add timedelta dtypes

dac2940

fix some numericalcolumn bugs

6eee9eb

fix index type mapping dicts

1ace460

pass all binop tests

df6426b

more progress

92d1a64

all column tests pass

59b3673

move more stuff to cudf.api.types

297a31a

forgot entire api/ folder

e5def6e

fix mutable_column_view

b4d344f

working through dataframe.py tests

22fd5d9

pass join tests

c5a0b62

brandon-b-miller added 10 commits September 16, 2020 08:47

to_numpy -> numpy_dtype

a8b380b

extra to_numpy -> numpy_dtype that were missed

1dc151a

add docstrings, respond to reviews

46a9c2f

minor fixes and code removal

81e6058

remove cudf_dtype_from_pydata_dtype

d7930eb

update api calls for find_common_type to be numpy-like

c290a15

let pandas handle categorical edge cases

e90e325

fix categorical creation and casting throughout cudf

3d8ca2f

remove old code

2653384

continued bugfixes

123784b

brandon-b-miller added 2 - In Progress Currently a work in progress and removed 3 - Ready for Review Ready for review by team labels Oct 2, 2020

brandon-b-miller changed the title ~~[REVIEW] cuDF.dtype objects~~ [WIP] cuDF.dtype objects Oct 2, 2020

This was referenced Jan 11, 2021

[BUG] get_dummies fails in dask-cudf due to dask categorical type checking #7111

Closed

Generalize is_categorical_dtype to support alternative backends dask/dask#7054

Closed

github-actions bot added the rotten label Feb 16, 2021

caryr35 closed this May 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] cuDF.dtype objects #6160

[WIP] cuDF.dtype objects #6160

brandon-b-miller commented Sep 4, 2020 •

edited

Loading

brandon-b-miller commented Sep 16, 2020

harrism commented Oct 2, 2020

kkraus14 commented Oct 2, 2020

brandon-b-miller commented Oct 2, 2020

github-actions bot commented Feb 16, 2021

brandon-b-miller commented May 11, 2021 •

edited by caryr35

Loading

[WIP] cuDF.dtype objects #6160

[WIP] cuDF.dtype objects #6160

Conversation

brandon-b-miller commented Sep 4, 2020 • edited Loading

brandon-b-miller commented Sep 16, 2020

harrism commented Oct 2, 2020

kkraus14 commented Oct 2, 2020

brandon-b-miller commented Oct 2, 2020

github-actions bot commented Feb 16, 2021

brandon-b-miller commented May 11, 2021 • edited by caryr35 Loading

brandon-b-miller commented Sep 4, 2020 •

edited

Loading

brandon-b-miller commented May 11, 2021 •

edited by caryr35

Loading