Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] cuDF.dtype objects #6160

Conversation

brandon-b-miller
Copy link
Contributor

@brandon-b-miller brandon-b-miller commented Sep 4, 2020

Closes #4823
Closes #5714

  • Implements cudf.StringDtype, cudf.BooleanDtype, cudf.UIntXYDtype, cudf.IntXYDtype, cudf.Datetime64XYDtype, and cudf.Timedelta64XYDtype and the generic types such as cudf.Number to form a numpy-like hierarchy. Removes np.object as a cudf dtype object.
  • Implements cudf.api.types which is similar to pd.api.types except with the addition of several numpy like utility functions
  • Implements cudf.Scalar extending cudf._lib.scalar.Scalar, and a set of ops compatible with python scalars. cuDF dtypes type attribute yields these scalars.

If merged, this API would introduce our own object that is attached to cuDF objects via the dtype attribute. In this picture, there would be one separate dtype object for each dtype we support. These are related to each other via a numpy-like hierarchy, e.g. cudf.Float64Dtype() and cudf.Int8Dtype() are each subclasses of cudf.Floating and cudf.Integer respectively, both of which are themselves subclasses of cudf.Number. The intent is to replicate numpy-like usage for these dtypes, meaning we want to get the same result from isinstance(cudf_dtype, cudf.Number) as we would from writing np.issubdtype(numpy_dtype, np.number), except yielding the corresponding cuDF dtype. All cuDF dtype objects carry whatever numpy dtype was previously used as an attribute, so that when we absolutely need to use a numpy dtype, such as when interacting with cupy, we can use cudf.Int64Dtype().to_numpy and recover it.

In places where we previously used numpy logic to determine various type casting rules and operations, such as in uses of np.find_common_type, np.promote_types, and similar, we would now use cudf.api.types.find_common_type. These functions basically drop the cuDF dtypes to their numpy versions and call the numpy function on the results, then promote the result of that back to a cuDF dtype. This means that we as developers should be able to just use them on any combination of cuDF and numpy types with the same APIs and it should just work. In addition some sparsely used attributes of numpy objects are forwarded to the numpy version of the object, such as np.dtype('int64').num.

cuDF dtypes produce GPU backed scalars of their own dtype via their type attribute. Currently these wrap libcudf scalar objects. These are to be used to avoid having to copy scalars to host in as many cases as possible, such as in operations like some_column - some_column.mean(). That said, yielding cuDF scalar objects from reduction operations necessitates that these scalars can be used with other python scalars in an intuitive way. This PR therefore also aims to implement a sensible set of unaops and binops on the resulting scalars. Generally this involves a host copy as libcudf scalars currently do not support these ops and we often need to perform control flow around the result, which forces a host copy in any case. We may be able to get around this by caching the value on the host if we know it hasn't changed.

Ultimately the purpose of this PR is to provide a more unified dtype interface for both the developer and the user. The user will not be confused by mixing of numpy dtypes, numpy object dtypes, custom cuDF dtypes and pandas nullable dtypes - everything will be a cuDF dtype. Furthermore developers will no longer need to worry about special casing their code in many situations, or using imported dicts/functions to map between numpy, pyarrow, libcudf and pandas nullable dtypes. This logic will be contained in the dtype objects themselves at init and available to the programmer on hand.

@brandon-b-miller
Copy link
Contributor Author

@kkraus14 thanks for the review - addressing these questions/items today.

@harrism
Copy link
Member

harrism commented Oct 2, 2020

Should this be pushed to 0.17?

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 2, 2020

Should this be pushed to 0.17?

Yes

@brandon-b-miller brandon-b-miller added 2 - In Progress Currently a work in progress and removed 3 - Ready for Review Ready for review by team labels Oct 2, 2020
@brandon-b-miller brandon-b-miller changed the title [REVIEW] cuDF.dtype objects [WIP] cuDF.dtype objects Oct 2, 2020
@brandon-b-miller
Copy link
Contributor Author

Yes - a lot more to be done here in dask and elsewhere.

@github-actions
Copy link

This PR has been marked rotten due to no recent activity in the past 90d. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.

@caryr35 caryr35 closed this May 11, 2021
@brandon-b-miller
Copy link
Contributor Author

brandon-b-miller commented May 11, 2021

Closing PR. This PR needs to be more or less reworked after the array refactor. While a lot of the code is useful, the codebase drift and a better understanding of the problem will probably yield a very different looking pull request ultimately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Remove object dtype [DISCUSSION] Python cuDF.dtype
6 participants