-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cudf.dtype
function
#8949
cudf.dtype
function
#8949
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall changes look good, some comments..
python/cudf/cudf/api/types.py
Outdated
# no NumPy type corresponding to this type | ||
# always object? | ||
return np.dtype("object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me, as pandas is moving towards object
as default type if no dtype is provided:
>>> pd.Series()
<stdin>:1: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
Series([], dtype: float64)
python/cudf/cudf/api/types.py
Outdated
np_dtype = np.dtype("<m8[ns]") | ||
elif np_dtype.str == "<M8": | ||
np_dtype = np.dtype("<M8[ns]") | ||
return np_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone does a cudf.dtype('complex')
, I think we would end up returning np.dtype('complex')
here, should we validate if the dtype exists in our cudf type map before returning?
>>> np.dtype('complex')
dtype('complex128')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed - let me know if you think the way I'm handling unsupported NumPy types is OK
dtype = pd.api.types.pandas_dtype(dtype) | ||
np_type = np.dtype(dtype).type | ||
np_type = cudf.dtype(dtype).type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we now squeeze these two separate dtype calls into a single cudf.dtype
call? or is there something specific about calling pd.api.types.pandas_dtype
first?
np_type = cudf.dtype(dtype).type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
python/cudf/cudf/api/types.py
Outdated
except TypeError: | ||
pass | ||
else: | ||
if np_dtype.kind not in "biufUOMm": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if np_dtype.kind not in "biufUOMm": | |
if np_dtype not in cudf._lib.types.np_to_cudf_types: |
To make this maintainable should we just lookup our np
<->libcudf
type-map here? This was any new dtype support added will automatically be supported here by cudf.dtype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree -- but it would be nicer if the source of truth was in a more obiously named constant. For exmaple, something like: cudf._lib.types.SUPPORTED_NUMPY_TYPES
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to have a cudf._lib.types.SUPPORTED_NUMPY_TYPES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a slight problem here where <M8
is an acceptable return type here, but it's not a SUPPORTED_NUMPY_TYPE
(supported types are <M8[unit]
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, some small suggestions. We also should replace all instances of np.dtype
throughout cudf if possible.
|
||
|
||
@pytest.mark.parametrize( | ||
"in_dtype,expect", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to test more inputs that don't translate to numpy dtypes, specifically more cudf- and pandas-specific extension types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cuDF specific types (i.e., instances of cudf._BaseDtype
) are less interesting since we just return those as-is. But I did add a few more tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also worth testing pandas interval/datetime/timedelta dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added interval cases, but as far as I know, Pandas uses numpy datetime/timedelta types as their dtype for DatetimeIndex
/TimedeltaIndex
.
Co-authored-by: Vyas Ramasubramani <[email protected]>
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #8949 +/- ##
===============================================
Coverage ? 10.59%
===============================================
Files ? 114
Lines ? 19080
Branches ? 0
===============================================
Hits ? 2022
Misses ? 17058
Partials ? 0 Continue to review full report at Codecov.
|
…cudf-dtype-function
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like something you did got you stuck triggering isort/circular import issues? Anyway, the import changes generally look good along with the main changes. I had a couple of minor additional comments, but nothing pressing.
@@ -787,12 +787,13 @@ cdef class _CPackedColumns: | |||
""" | |||
Construct a ``PackedColumns`` object from a ``cudf.DataFrame``. | |||
""" | |||
from cudf.core import RangeIndex, dtypes | |||
import cudf.core.dtypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just import _BaseIndex
? Not a big deal either way, just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [17]: %%timeit
...: import cudf.core.dtypes
...: cudf.core.dtypes._BaseDtype
...:
...:
407 ns ± 3.89 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [19]: %%timeit
...: from cudf.core.dtypes import _BaseDtype
...: _BaseDtype
...:
...:
875 ns ± 1.48 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sure it's for performance works for me.
|
||
|
||
@pytest.mark.parametrize( | ||
"in_dtype,expect", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also worth testing pandas interval/datetime/timedelta dtypes.
@gpucibot merge |
Thanks for working on this @shwina ! This greatly helps other |
…cudf-dtype-function
…cudf-dtype-function
@gpucibot merge |
Closes #8915