Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudf.dtype function #8949

Merged
merged 27 commits into from
Aug 13, 2021
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
60c7c87
Replace cudf.dtype -> np.dtype
shwina Aug 4, 2021
5e50f52
First stab at cudf.dtype
shwina Aug 4, 2021
367b743
Handle datetimes/timedeltas in cudf.dtype
shwina Aug 4, 2021
d04a5f1
Fix test
shwina Aug 4, 2021
85351e9
Handle disallowed numpy types
shwina Aug 5, 2021
3c9dd97
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
shwina Aug 5, 2021
67cca8a
Update python/cudf/cudf/tests/test_dtypes.py
shwina Aug 5, 2021
a10eae0
Some fixes
shwina Aug 6, 2021
89ac918
Remaining failures
shwina Aug 9, 2021
acda2ee
Merge branch 'cudf-dtype-function' of github.com:shwina/cudf into cud…
shwina Aug 9, 2021
64a3290
Style
shwina Aug 9, 2021
a62ab32
Update python/cudf/cudf/api/types.py
shwina Aug 9, 2021
f79e59f
cudf.dtype -> np.dtype
shwina Aug 10, 2021
9dceb80
Merge branch 'cudf-dtype-function' of github.com:shwina/cudf into cud…
shwina Aug 10, 2021
d0bef49
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
shwina Aug 10, 2021
3eba47c
Progress
shwina Aug 11, 2021
048629c
More fix
shwina Aug 11, 2021
40736c4
Early returns
shwina Aug 11, 2021
550c7ba
More tests
shwina Aug 11, 2021
1cfa67c
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
shwina Aug 11, 2021
72d6304
Resolve circular import issues
shwina Aug 11, 2021
c8925f5
Unused import
shwina Aug 12, 2021
26df99a
Space
shwina Aug 12, 2021
fec34d9
Add interval tests
shwina Aug 12, 2021
5fc19a9
:(
shwina Aug 12, 2021
11156f5
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
shwina Aug 12, 2021
2a684be
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
shwina Aug 13, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions python/cudf/cudf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
register_index_accessor,
register_series_accessor,
)
from cudf.api.types import dtype
from cudf.core import (
NA,
BaseIndex,
Expand Down
29 changes: 29 additions & 0 deletions python/cudf/cudf/api/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,35 @@
)


def dtype(arbitrary):
try:
np_dtype = np.dtype(arbitrary)
if np_dtype.name == "float16":
np_dtype = np.dtype("float32")
vyasr marked this conversation as resolved.
Show resolved Hide resolved
elif np_dtype.name in ("object", "str"):
np_dtype = np.dtype("object")
vyasr marked this conversation as resolved.
Show resolved Hide resolved
elif np_dtype.str == "<m8":
np_dtype = np.dtype("<m8[ns]")
elif np_dtype.str == "<M8":
np_dtype = np.dtype("<M8[ns]")
return np_dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone does a cudf.dtype('complex'), I think we would end up returning np.dtype('complex') here, should we validate if the dtype exists in our cudf type map before returning?

>>> np.dtype('complex')
dtype('complex128')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - let me know if you think the way I'm handling unsupported NumPy types is OK

except TypeError:
pass
if isinstance(arbitrary, cudf.core.dtypes._BaseDtype):
return arbitrary
elif isinstance(arbitrary, pd.CategoricalDtype):
return cudf.CategoricalDtype.from_pandas(arbitrary)
elif isinstance(arbitrary, pd.IntervalDtype):
return cudf.IntervalDtype.from_pandas(arbitrary)
pd_dtype = pd.api.types.pandas_dtype(arbitrary)
try:
return pd_dtype.numpy_dtype
vyasr marked this conversation as resolved.
Show resolved Hide resolved
except AttributeError:
# no NumPy type corresponding to this type
# always object?
return np.dtype("object")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me, as pandas is moving towards object as default type if no dtype is provided:

>>> pd.Series()
<stdin>:1: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
Series([], dtype: float64)



def is_numeric_dtype(obj):
"""Check whether the provided array or dtype is of a numeric dtype.

Expand Down
8 changes: 4 additions & 4 deletions python/cudf/cudf/core/column/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -432,7 +432,7 @@ def view(self, dtype: Dtype) -> ColumnBase:

"""

dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)

if dtype.kind in ("o", "u", "s"):
raise TypeError(
Expand Down Expand Up @@ -2078,11 +2078,11 @@ def as_column(
data
)
dtype = pd.api.types.pandas_dtype(dtype)
np_type = np.dtype(dtype).type
np_type = cudf.dtype(dtype).type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we now squeeze these two separate dtype calls into a single cudf.dtype call? or is there something specific about calling pd.api.types.pandas_dtype first?

np_type = cudf.dtype(dtype).type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if np_type == np.bool_:
pa_type = pa.bool_()
else:
pa_type = np_to_pa_dtype(np.dtype(dtype))
pa_type = np_to_pa_dtype(cudf.dtype(dtype))
data = as_column(
pa.array(
arbitrary,
Expand Down Expand Up @@ -2131,7 +2131,7 @@ def _construct_array(
Construct a CuPy or NumPy array from `arbitrary`
"""
try:
dtype = dtype if dtype is None else np.dtype(dtype)
dtype = dtype if dtype is None else cudf.dtype(dtype)
arbitrary = cupy.asarray(arbitrary, dtype=dtype)
except (TypeError, ValueError):
native_dtype = dtype
Expand Down
4 changes: 2 additions & 2 deletions python/cudf/cudf/core/column/datetime.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ def __init__(
mask : Buffer; optional
The validity mask
"""
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)
if data.size % dtype.itemsize:
raise ValueError("Buffer size must be divisible by element size")
if size is None:
Expand Down Expand Up @@ -236,7 +236,7 @@ def __cuda_array_interface__(self) -> Mapping[builtins.str, Any]:
return output

def as_datetime_column(self, dtype: Dtype, **kwargs) -> DatetimeColumn:
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)
if dtype == self.dtype:
return self
return libcudf.unary.cast(self, dtype=dtype)
Expand Down
6 changes: 3 additions & 3 deletions python/cudf/cudf/core/column/numerical.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ def __init__(
The dtype associated with the data Buffer
mask : Buffer, optional
"""
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)
if data.size % dtype.itemsize:
raise ValueError("Buffer size must be divisible by element size")
if size is None:
Expand Down Expand Up @@ -253,7 +253,7 @@ def as_decimal_column(
return libcudf.unary.cast(self, dtype)

def as_numerical_column(self, dtype: Dtype, **kwargs) -> NumericalColumn:
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)
if dtype == self.dtype:
return self
return libcudf.unary.cast(self, dtype)
Expand Down Expand Up @@ -608,7 +608,7 @@ def _safe_cast_to_int(col: ColumnBase, dtype: DtypeObj) -> ColumnBase:
else:
raise TypeError(
f"Cannot safely cast non-equivalent "
f"{col.dtype.type.__name__} to {np.dtype(dtype).type.__name__}"
f"{col.dtype.type.__name__} to {cudf.dtype(dtype).type.__name__}"
)


Expand Down
8 changes: 4 additions & 4 deletions python/cudf/cudf/core/column/string.py
Original file line number Diff line number Diff line change
Expand Up @@ -5054,7 +5054,7 @@ def __contains__(self, item: ScalarLike) -> bool:
def as_numerical_column(
self, dtype: Dtype, **kwargs
) -> "cudf.core.column.NumericalColumn":
out_dtype = np.dtype(dtype)
out_dtype = cudf.dtype(dtype)

if out_dtype.kind in {"i", "u"}:
if not libstrings.is_integer(self).all():
Expand Down Expand Up @@ -5096,7 +5096,7 @@ def _as_datetime_or_timedelta_column(self, dtype, format):
def as_datetime_column(
self, dtype: Dtype, **kwargs
) -> "cudf.core.column.DatetimeColumn":
out_dtype = np.dtype(dtype)
out_dtype = cudf.dtype(dtype)

# infer on host from the first not na element
# or return all null column if all values
Expand All @@ -5120,7 +5120,7 @@ def as_datetime_column(
def as_timedelta_column(
self, dtype: Dtype, **kwargs
) -> "cudf.core.column.TimeDeltaColumn":
out_dtype = np.dtype(dtype)
out_dtype = cudf.dtype(dtype)
format = "%D days %H:%M:%S"
return self._as_datetime_or_timedelta_column(out_dtype, format)

Expand Down Expand Up @@ -5379,7 +5379,7 @@ def view(self, dtype) -> "cudf.core.column.ColumnBase":
raise ValueError(
"Can not produce a view of a string column with nulls"
)
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)
str_byte_offset = self.base_children[0].element_indexing(self.offset)
str_end_byte_offset = self.base_children[0].element_indexing(
self.offset + self.size
Expand Down
4 changes: 2 additions & 2 deletions python/cudf/cudf/core/column/timedelta.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ def __init__(
The number of null values.
If None, it is calculated automatically.
"""
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)
if data.size % dtype.itemsize:
raise ValueError("Buffer size must be divisible by element size")
if size is None:
Expand Down Expand Up @@ -353,7 +353,7 @@ def as_string_column(
)

def as_timedelta_column(self, dtype: Dtype, **kwargs) -> TimeDeltaColumn:
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)
if dtype == self.dtype:
return self
return libcudf.unary.cast(self, dtype=dtype)
Expand Down
6 changes: 6 additions & 0 deletions python/cudf/cudf/core/dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -559,6 +559,12 @@ def to_arrow(self):
pa.from_numpy_dtype(self.subtype), self.closed
)

@classmethod
def from_pandas(cls, pd_dtype: pd.IntervalDtype) -> "IntervalDtype":
return cls(
subtype=pd_dtype.subtype
) # TODO: needs `closed` when we upgrade Pandas


def is_categorical_dtype(obj):
"""Check whether an array-like or dtype is of the Categorical dtype.
Expand Down
3 changes: 2 additions & 1 deletion python/cudf/cudf/core/scalar.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import pyarrow as pa
from pandas._libs.missing import NAType as pd_NAType

import cudf
from cudf._lib.scalar import DeviceScalar, _is_null_host_scalar
from cudf.core.column.column import ColumnBase
from cudf.core.dtypes import Decimal64Dtype, ListDtype, StructDtype
Expand Down Expand Up @@ -171,7 +172,7 @@ def _preprocess_host_value(self, value, dtype):
dtype = value.dtype

if not isinstance(dtype, Decimal64Dtype):
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)

if not valid:
value = NA
Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -3774,7 +3774,7 @@ def one_hot_encoding(self, cats, dtype="float64"):
cats = cats.to_pandas()
else:
cats = pd.Series(cats, dtype="object")
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)

def encode(cat):
if cat is None:
Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/testing/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@ def _get_args_kwars_for_assert_exceptions(func_args_and_kwargs):


def gen_rand(dtype, size, **kwargs):
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)
if dtype.kind == "f":
res = np.random.random(size=size).astype(dtype)
if kwargs.get("positive_only", False):
Expand Down
10 changes: 5 additions & 5 deletions python/cudf/cudf/testing/dataset_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -380,7 +380,7 @@ def rand_dataframe(
)
)
else:
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)
if dtype.kind in ("i", "u"):
column_params.append(
ColumnParameters(
Expand Down Expand Up @@ -428,7 +428,7 @@ def rand_dataframe(
dtype=dtype, size=cardinality
),
is_sorted=False,
dtype=np.dtype(dtype),
dtype=cudf.dtype(dtype),
)
)
elif dtype.kind == "m":
Expand All @@ -440,7 +440,7 @@ def rand_dataframe(
dtype=dtype, size=cardinality
),
is_sorted=False,
dtype=np.dtype(dtype),
dtype=cudf.dtype(dtype),
)
)
elif dtype.kind == "b":
Expand All @@ -450,7 +450,7 @@ def rand_dataframe(
null_frequency=null_frequency,
generator=boolean_generator(cardinality),
is_sorted=False,
dtype=np.dtype(dtype),
dtype=cudf.dtype(dtype),
)
)
else:
Expand Down Expand Up @@ -538,7 +538,7 @@ def get_values_for_nested_data(dtype, lists_max_length):
Returns list of values based on dtype.
"""
cardinality = np.random.randint(0, lists_max_length)
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)
if dtype.kind in ("i", "u"):
values = int_generator(dtype=dtype, size=cardinality)()
elif dtype.kind == "f":
Expand Down
14 changes: 8 additions & 6 deletions python/cudf/cudf/tests/test_binops.py
Original file line number Diff line number Diff line change
Expand Up @@ -931,7 +931,7 @@ def test_ufunc_ops(lhs, rhs, ops):
def dtype_scalar(val, dtype):
if dtype == "str":
return str(val)
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)
if dtype.type in {np.datetime64, np.timedelta64}:
res, _ = np.datetime_data(dtype)
return dtype.type(val, res)
Expand Down Expand Up @@ -1695,13 +1695,15 @@ def test_binops_with_lhs_numpy_scalar(frame, dtype):
)

if dtype == "datetime64[s]":
val = np.dtype(dtype).type(4, "s")
val = cudf.dtype(dtype).type(4, "s")
elif dtype == "timedelta64[s]":
val = np.dtype(dtype).type(4, "s")
val = cudf.dtype(dtype).type(4, "s")
elif dtype == "category":
val = np.int64(4)
elif dtype == "str":
val = str(4)
else:
val = np.dtype(dtype).type(4)
val = cudf.dtype(dtype).type(4)

expected = val == data.to_pandas()
got = val == data
Expand Down Expand Up @@ -2793,11 +2795,11 @@ def test_column_null_scalar_comparison(dtype, null_scalar, cmpop):
# a new series where all the elements are <NA>.

if isinstance(null_scalar, np.datetime64):
if np.dtype(dtype).kind not in "mM":
if cudf.dtype(dtype).kind not in "mM":
pytest.skip()
null_scalar = null_scalar.astype(dtype)

dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)

data = [1, 2, 3, 4, 5]
sr = cudf.Series(data, dtype=dtype)
Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/tests/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -799,7 +799,7 @@ def test_categorical_setitem_with_nan():
@pytest.mark.parametrize("dtype", list(NUMERIC_TYPES) + ["object"])
@pytest.mark.parametrize("input_obj", [[1, cudf.NA, 3]])
def test_series_construction_with_nulls(input_obj, dtype):
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)
input_obj = [
dtype.type(v) if v is not cudf.NA else cudf.NA for v in input_obj
]
Expand Down
5 changes: 3 additions & 2 deletions python/cudf/cudf/tests/test_contains.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import pandas as pd
import pytest

import cudf
from cudf import Series
from cudf.core.index import RangeIndex, as_index
from cudf.testing._utils import (
Expand Down Expand Up @@ -82,7 +83,7 @@ def test_rangeindex_contains():

@pytest.mark.parametrize("dtype", NUMERIC_TYPES)
def test_lists_contains(dtype):
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)
inner_data = np.array([1, 2, 3], dtype=dtype)

data = Series([inner_data])
Expand All @@ -96,7 +97,7 @@ def test_lists_contains(dtype):

@pytest.mark.parametrize("dtype", DATETIME_TYPES + TIMEDELTA_TYPES)
def test_lists_contains_datetime(dtype):
dtype = np.dtype(dtype)
dtype = cudf.dtype(dtype)
inner_data = np.array([1, 2, 3])

unit, _ = np.datetime_data(dtype)
Expand Down
31 changes: 31 additions & 0 deletions python/cudf/cudf/tests/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,3 +257,34 @@ def test_lists_of_structs_dtype(data):

assert_column_array_dtype_equal(got._column, expected)
assert expected.equals(got._column.to_arrow())


@pytest.mark.parametrize(
"in_dtype,expect",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to test more inputs that don't translate to numpy dtypes, specifically more cudf- and pandas-specific extension types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cuDF specific types (i.e., instances of cudf._BaseDtype) are less interesting since we just return those as-is. But I did add a few more tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also worth testing pandas interval/datetime/timedelta dtypes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added interval cases, but as far as I know, Pandas uses numpy datetime/timedelta types as their dtype for DatetimeIndex/TimedeltaIndex.

[
(np.dtype("int8"), np.dtype("int8")),
(np.int8, np.dtype("int8")),
(np.float16, np.dtype("float32")),
(pd.Int8Dtype(), np.dtype("int8")),
(pd.StringDtype(), np.dtype("object")),
("int8", np.dtype("int8")),
("boolean", np.dtype("bool")),
(int, np.dtype("int64")),
shwina marked this conversation as resolved.
Show resolved Hide resolved
(float, np.dtype("float64")),
(cudf.ListDtype("int64"), cudf.ListDtype("int64")),
("float16", np.dtype("float32")),
(np.dtype("U"), np.dtype("object")),
("timedelta64", np.dtype("<m8[ns]")),
("timedelta64[ns]", np.dtype("<m8[ns]")),
("timedelta64[ms]", np.dtype("<m8[ms]")),
("timedelta64[D]", np.dtype("<m8[D]")),
("<m8[s]", np.dtype("<m8[s]")),
("datetime64", np.dtype("<M8[ns]")),
("datetime64[ns]", np.dtype("<M8[ns]")),
("datetime64[ms]", np.dtype("<M8[ms]")),
("datetime64[D]", np.dtype("<M8[D]")),
("<M8[s]", np.dtype("<M8[s]")),
],
)
def test_dtype(in_dtype, expect):
assert_eq(cudf.dtype(in_dtype), expect)
2 changes: 1 addition & 1 deletion python/cudf/cudf/tests/test_joining.py
Original file line number Diff line number Diff line change
Expand Up @@ -810,7 +810,7 @@ def test_join_datetimes_index(dtype):
pdf = pdf_lhs.join(pdf_rhs, sort=True)
gdf = gdf_lhs.join(gdf_rhs, sort=True)

assert gdf["d"].dtype == np.dtype(dtype)
assert gdf["d"].dtype == cudf.dtype(dtype)

assert_join_results_equal(pdf, gdf, how="inner")

Expand Down
Loading