-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] cuDF.dtype objects #6160
[WIP] cuDF.dtype objects #6160
Conversation
@kkraus14 thanks for the review - addressing these questions/items today. |
Should this be pushed to 0.17? |
Yes |
Yes - a lot more to be done here in dask and elsewhere. |
This PR has been marked rotten due to no recent activity in the past 90d. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. |
Closing PR. This PR needs to be more or less reworked after the array refactor. While a lot of the code is useful, the codebase drift and a better understanding of the problem will probably yield a very different looking pull request ultimately. |
Closes #4823
Closes #5714
cudf.StringDtype
,cudf.BooleanDtype
,cudf.UIntXYDtype
,cudf.IntXYDtype
,cudf.Datetime64XYDtype
, andcudf.Timedelta64XYDtype
and the generic types such ascudf.Number
to form a numpy-like hierarchy. Removesnp.object
as a cudf dtype object.cudf.api.types
which is similar topd.api.types
except with the addition of several numpy like utility functionscudf.Scalar
extendingcudf._lib.scalar.Scalar
, and a set of ops compatible with python scalars. cuDF dtypestype
attribute yields these scalars.If merged, this API would introduce our own object that is attached to cuDF objects via the
dtype
attribute. In this picture, there would be one separate dtype object for each dtype we support. These are related to each other via a numpy-like hierarchy, e.g.cudf.Float64Dtype()
andcudf.Int8Dtype()
are each subclasses ofcudf.Floating
andcudf.Integer
respectively, both of which are themselves subclasses ofcudf.Number
. The intent is to replicate numpy-like usage for these dtypes, meaning we want to get the same result fromisinstance(cudf_dtype, cudf.Number)
as we would from writingnp.issubdtype(numpy_dtype, np.number)
, except yielding the corresponding cuDF dtype. All cuDF dtype objects carry whatever numpy dtype was previously used as an attribute, so that when we absolutely need to use a numpy dtype, such as when interacting withcupy
, we can usecudf.Int64Dtype().to_numpy
and recover it.In places where we previously used numpy logic to determine various type casting rules and operations, such as in uses of
np.find_common_type
,np.promote_types
, and similar, we would now usecudf.api.types.find_common_type
. These functions basically drop the cuDF dtypes to their numpy versions and call the numpy function on the results, then promote the result of that back to a cuDF dtype. This means that we as developers should be able to just use them on any combination of cuDF and numpy types with the same APIs and it should just work. In addition some sparsely used attributes of numpy objects are forwarded to the numpy version of the object, such asnp.dtype('int64').num
.cuDF dtypes produce GPU backed scalars of their own dtype via their
type
attribute. Currently these wrap libcudf scalar objects. These are to be used to avoid having to copy scalars to host in as many cases as possible, such as in operations likesome_column - some_column.mean()
. That said, yielding cuDF scalar objects from reduction operations necessitates that these scalars can be used with other python scalars in an intuitive way. This PR therefore also aims to implement a sensible set of unaops and binops on the resulting scalars. Generally this involves a host copy as libcudf scalars currently do not support these ops and we often need to perform control flow around the result, which forces a host copy in any case. We may be able to get around this by caching the value on the host if we know it hasn't changed.Ultimately the purpose of this PR is to provide a more unified dtype interface for both the developer and the user. The user will not be confused by mixing of numpy dtypes, numpy object dtypes, custom cuDF dtypes and pandas nullable dtypes - everything will be a cuDF dtype. Furthermore developers will no longer need to worry about special casing their code in many situations, or using imported dicts/functions to map between numpy, pyarrow, libcudf and pandas nullable dtypes. This logic will be contained in the dtype objects themselves at init and available to the programmer on hand.