Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] A cudf.dtype function similar to np.dtype(...) #8915

Closed
shwina opened this issue Jul 30, 2021 · 2 comments · Fixed by #8949
Closed

[FEA] A cudf.dtype function similar to np.dtype(...) #8915

shwina opened this issue Jul 30, 2021 · 2 comments · Fixed by #8949
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@shwina
Copy link
Contributor

shwina commented Jul 30, 2021

It would be useful to have acudf.dtype(...) function that takes an arbitrary dtype-like object and coerces it to a dtype that we can handle. Specifcally, it should be a superset of np.dtype(), but also handle Pandas extension types, as well as our own dtypes.

Some examples:

  • Passing a NumPy-like dtype yields a np.dtype() instance:

    cudf.dtype(np.dtype("int8")) -> np.dtype("int8")
    cudf.dtype(np.int8) -> np.dtype("int8")
    cudf.dtype(np.float16) -> np.dtype("float32")
  • Passing a Pandas extension type yields a corresponding NumPy type (since we use NumPy dtypes for our own "nullable" types):

    cudf.dtype(pd.Int8Dtype()) -> np.dtype("int8")
    cudf.dtype(pd.StringDtype()) -> np.dtype("object")
  • Passing a string alias yields the appropriate type:

    cudf.dtype("int8") -> np.dtype("int8")
    cudf.dtype("Boolean") -> np.dtype("bool")
  • Passing Python builtin types works (just like np.dtype)

    cudf.dtype(int) -> np.dtype("int64")
    cudf.dtype(float) -> np.dtype('float64")
  • Passing a cuDF type works:

    cudf.dtype(cudf.ListDtype("int64")) -> cudf.ListDtype("int64")
  • Special case: float16 is coerced to float32 since we don't have a float16 type:

    cudf.dtype(np.float16 | np.dtype("float16") | "float16") -> np.dtype("float32")
  • Special case: unicode types are coerced to np.dtype("object")

    cudf.dtype(np.dtype("U")) -> np.dtype("object")

Note: we could potentially extend this to handle Arrow types too, but let's punt on that for now.

@shwina shwina added feature request New feature or request Needs Triage Need team to review and classify labels Jul 30, 2021
@shwina shwina added the Python Affects Python cuDF API. label Jul 30, 2021
@sarahyurick
Copy link
Contributor

(Copying this here so I don't forget from our slack discussion) Should we also make it so that cudf.dtype(array) for instance returns array.dtype? i.e., anything that has a .dtype attribute, we return its dtype?

@vyasr
Copy link
Contributor

vyasr commented Jul 30, 2021

I'm supportive of creating this feature, but I think it requires some care to get it just right. Some considerations:

  • We don't want to duplicate functionality between this and the various dtype utilities we already have in cudf/core/dtypes.py and cudf/api/types.py. Ideally, where appropriate this function would call through to functions like those.
  • Due to the frequency of calls to our dtype APIs, they turn out to actually be somewhat performance critical. Optimizing them is out of scope for this work, but one thing we should be cognizant of is splitting dtype identification into many small functions rather than one massive one. In cases where we need to use such utilities internally in cudf, we want to be able to take advantage of whatever information we might already have to do checks/conversions faster, as compared to API functions where users could pass any of a large possible set of inputs
  • We should be take this opportunity to very clearly document the behavior of dtype utilities. Some functions can have surprising behaviors and unless we clarify exactly what qualifies as "numeric-like" (for example) it may surprise users. pandas.api.types has this problem, and to some extent we are stuck with it, but we can at least improve documentation and perhaps create more sane internal APIs for use in cuDF code. Here's an example of the kind of (IMO) unexpected behavior that absolutely has to be documented (but is not documented in pandas):
In [13]: pd.api.types.is_list_like({})
Out[13]: True

(Copying this here so I don't forget from our slack discussion) Should we also make it so that cudf.dtype(array) for instance returns array.dtype? i.e., anything that has a .dtype attribute, we return its dtype?

Yes, I think we'd want to do that for consistency. However, in line with my previous points I'm guessing we'll only want that to be true in user-facing functions, not in internal utilities.

@beckernick beckernick added this to the CuDF Python Refactoring milestone Aug 2, 2021
@sarahyurick sarahyurick removed their assignment Aug 3, 2021
@beckernick beckernick removed the Needs Triage Need team to review and classify label Aug 3, 2021
@shwina shwina self-assigned this Aug 3, 2021
rapids-bot bot pushed a commit that referenced this issue Aug 13, 2021
Closes #8915

Authors:
  - Ashwin Srinath (https://github.com/shwina)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #8949
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants