Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudf.dtype function #8949

Merged
merged 27 commits into from
Aug 13, 2021
Merged

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Aug 4, 2021

Closes #8915

@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 4, 2021
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall changes look good, some comments..

Comment on lines 54 to 56
# no NumPy type corresponding to this type
# always object?
return np.dtype("object")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me, as pandas is moving towards object as default type if no dtype is provided:

>>> pd.Series()
<stdin>:1: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
Series([], dtype: float64)

np_dtype = np.dtype("<m8[ns]")
elif np_dtype.str == "<M8":
np_dtype = np.dtype("<M8[ns]")
return np_dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone does a cudf.dtype('complex'), I think we would end up returning np.dtype('complex') here, should we validate if the dtype exists in our cudf type map before returning?

>>> np.dtype('complex')
dtype('complex128')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - let me know if you think the way I'm handling unsupported NumPy types is OK

Comment on lines 2080 to 2081
dtype = pd.api.types.pandas_dtype(dtype)
np_type = np.dtype(dtype).type
np_type = cudf.dtype(dtype).type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we now squeeze these two separate dtype calls into a single cudf.dtype call? or is there something specific about calling pd.api.types.pandas_dtype first?

np_type = cudf.dtype(dtype).type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

python/cudf/cudf/tests/test_dtypes.py Outdated Show resolved Hide resolved
@shwina shwina marked this pull request as ready for review August 9, 2021 21:39
@shwina shwina requested a review from a team as a code owner August 9, 2021 21:39
@shwina shwina requested review from galipremsagar and isVoid August 9, 2021 21:39
@shwina shwina added non-breaking Non-breaking change tech debt improvement Improvement / enhancement to an existing function labels Aug 9, 2021
except TypeError:
pass
else:
if np_dtype.kind not in "biufUOMm":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if np_dtype.kind not in "biufUOMm":
if np_dtype not in cudf._lib.types.np_to_cudf_types:

To make this maintainable should we just lookup our np<->libcudf type-map here? This was any new dtype support added will automatically be supported here by cudf.dtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree -- but it would be nicer if the source of truth was in a more obiously named constant. For exmaple, something like: cudf._lib.types.SUPPORTED_NUMPY_TYPES.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to have a cudf._lib.types.SUPPORTED_NUMPY_TYPES

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a slight problem here where <M8 is an acceptable return type here, but it's not a SUPPORTED_NUMPY_TYPE (supported types are <M8[unit]).

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, some small suggestions. We also should replace all instances of np.dtype throughout cudf if possible.

python/cudf/cudf/api/types.py Outdated Show resolved Hide resolved
python/cudf/cudf/api/types.py Outdated Show resolved Hide resolved
python/cudf/cudf/api/types.py Outdated Show resolved Hide resolved
python/cudf/cudf/api/types.py Outdated Show resolved Hide resolved


@pytest.mark.parametrize(
"in_dtype,expect",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to test more inputs that don't translate to numpy dtypes, specifically more cudf- and pandas-specific extension types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cuDF specific types (i.e., instances of cudf._BaseDtype) are less interesting since we just return those as-is. But I did add a few more tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also worth testing pandas interval/datetime/timedelta dtypes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added interval cases, but as far as I know, Pandas uses numpy datetime/timedelta types as their dtype for DatetimeIndex/TimedeltaIndex.

Co-authored-by: Vyas Ramasubramani <[email protected]>
@codecov
Copy link

codecov bot commented Aug 9, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@2e980b8). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head e826428 differs from pull request most recent head 2a684be. Consider uploading reports for the commit 2a684be to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #8949   +/-   ##
===============================================
  Coverage                ?   10.59%           
===============================================
  Files                   ?      114           
  Lines                   ?    19080           
  Branches                ?        0           
===============================================
  Hits                    ?     2022           
  Misses                  ?    17058           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e980b8...2a684be. Read the comment docs.

@shwina shwina requested a review from a team as a code owner August 10, 2021 19:32
@shwina
Copy link
Contributor Author

shwina commented Aug 11, 2021

rerun tests

@shwina shwina requested review from galipremsagar and vyasr August 11, 2021 23:05
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like something you did got you stuck triggering isort/circular import issues? Anyway, the import changes generally look good along with the main changes. I had a couple of minor additional comments, but nothing pressing.

@@ -787,12 +787,13 @@ cdef class _CPackedColumns:
"""
Construct a ``PackedColumns`` object from a ``cudf.DataFrame``.
"""
from cudf.core import RangeIndex, dtypes
import cudf.core.dtypes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just import _BaseIndex? Not a big deal either way, just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [17]: %%timeit
    ...: import cudf.core.dtypes
    ...: cudf.core.dtypes._BaseDtype
    ...:
    ...:
407 ns ± 3.89 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [19]: %%timeit
    ...: from cudf.core.dtypes import _BaseDtype
    ...: _BaseDtype
    ...:
    ...:
875 ns ± 1.48 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sure it's for performance works for me.



@pytest.mark.parametrize(
"in_dtype,expect",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also worth testing pandas interval/datetime/timedelta dtypes.

python/cudf/cudf/_lib/copying.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/transform.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Show resolved Hide resolved
@shwina
Copy link
Contributor Author

shwina commented Aug 12, 2021

@gpucibot merge

@galipremsagar galipremsagar added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Aug 12, 2021
@galipremsagar
Copy link
Contributor

Thanks for working on this @shwina ! This greatly helps other dtype related APIs especially with the changes coming up on the cuIO side.

@shwina
Copy link
Contributor Author

shwina commented Aug 13, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 2b92220 into rapidsai:branch-21.10 Aug 13, 2021
@galipremsagar galipremsagar added breaking Breaking change and removed non-breaking Non-breaking change labels Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] A cudf.dtype function similar to np.dtype(...)
4 participants