Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] cuDF.dtype objects #6160

Closed
Show file tree
Hide file tree
Changes from 72 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
1c83eac
initial dtype work
brandon-b-miller Jul 28, 2020
33bd96c
begin to plumb dtype
brandon-b-miller Jul 28, 2020
baf138c
migrate dtypes to cudf main __init__
brandon-b-miller Jul 29, 2020
bdb87fa
numerical column plumbing
brandon-b-miller Jul 29, 2020
4a3fe71
update dtype classes, mappings
brandon-b-miller Jul 29, 2020
1cf2c3e
start to plumb stringcolumn
brandon-b-miller Jul 29, 2020
dbc4970
inherit from basic cython class
brandon-b-miller Jul 30, 2020
ba42bd8
plumb numerical column __repr__, default_na_value
brandon-b-miller Jul 30, 2020
60272e2
plumb some parts of unary
brandon-b-miller Jul 30, 2020
c03be40
make a factory and fix bugs
brandon-b-miller Jul 30, 2020
7f6cb36
more progress on columns, dtype object
brandon-b-miller Jul 31, 2020
a81c368
forgot string O
brandon-b-miller Jul 31, 2020
572c39f
more progress
brandon-b-miller Jul 31, 2020
4f6f316
column tests pass
brandon-b-miller Aug 3, 2020
ee6ece5
working up through test_array_func
brandon-b-miller Aug 3, 2020
62c5e17
more tests pass
brandon-b-miller Aug 4, 2020
139465f
merge 0.15 and resolve conflicts
brandon-b-miller Aug 19, 2020
ef5b9cb
handle list dtype in _Dtype
brandon-b-miller Aug 21, 2020
9320755
fix series syntax error
brandon-b-miller Aug 21, 2020
dac2940
add timedelta dtypes
brandon-b-miller Aug 21, 2020
6eee9eb
fix some numericalcolumn bugs
brandon-b-miller Aug 21, 2020
1ace460
fix index type mapping dicts
brandon-b-miller Aug 21, 2020
df6426b
pass all binop tests
brandon-b-miller Aug 24, 2020
92d1a64
more progress
brandon-b-miller Aug 25, 2020
59b3673
all column tests pass
brandon-b-miller Aug 26, 2020
297a31a
move more stuff to cudf.api.types
brandon-b-miller Aug 26, 2020
e5def6e
forgot entire api/ folder
brandon-b-miller Aug 26, 2020
b4d344f
fix mutable_column_view
brandon-b-miller Aug 26, 2020
22fd5d9
working through dataframe.py tests
brandon-b-miller Aug 26, 2020
c5a0b62
pass join tests
brandon-b-miller Aug 27, 2020
d47de03
fix categorical tests
brandon-b-miller Aug 27, 2020
fe180a3
more bugfixes
brandon-b-miller Aug 27, 2020
cad48d0
more progress
brandon-b-miller Aug 28, 2020
6a1785c
all repr tests pass
brandon-b-miller Aug 30, 2020
8552907
all timedelta tests pass
brandon-b-miller Aug 31, 2020
2b59285
sorting tests pass
brandon-b-miller Aug 31, 2020
b2851a2
fix more tests
brandon-b-miller Aug 31, 2020
9540643
hackily pass select_dtype tests
brandon-b-miller Sep 1, 2020
781b42e
all dataframe tests pass!
brandon-b-miller Sep 1, 2020
13fe291
much more progress
brandon-b-miller Sep 2, 2020
3c047ef
fix indexing tests
brandon-b-miller Sep 2, 2020
a139571
less than 10 tests still failing
brandon-b-miller Sep 3, 2020
40a1699
merge 0.16 and resolve conflicts
brandon-b-miller Sep 3, 2020
ea24184
fix bugs
brandon-b-miller Sep 3, 2020
ddf340b
fix a few more bugs
brandon-b-miller Sep 3, 2020
4a14042
construct from string tests
brandon-b-miller Sep 3, 2020
55cec7e
clean up dtypes.py
brandon-b-miller Sep 3, 2020
c28c7b6
fixed some bugs
brandon-b-miller Sep 4, 2020
bad1dc2
a little iteration on dtypes.py
brandon-b-miller Sep 4, 2020
0938507
implement the scalar type attribute
brandon-b-miller Sep 4, 2020
e5a489d
cleanup and style
brandon-b-miller Sep 4, 2020
62a7d5b
bug fixes and type attribute plumbing/iteration
brandon-b-miller Sep 9, 2020
80baff4
fix repr and move around testing utilities
brandon-b-miller Sep 9, 2020
38e11af
clean up reduce.pyx
brandon-b-miller Sep 9, 2020
22b299d
implement cudf::scalar -> cudf.Scalar -> Buffer, column
brandon-b-miller Sep 9, 2020
1552c0a
minor bugfixes
brandon-b-miller Sep 10, 2020
78caafa
add __int__ and __float__ to scalar
brandon-b-miller Sep 10, 2020
a9fe2fb
partially implement scalar binops
brandon-b-miller Sep 11, 2020
455af02
partial tests for scalar binop result dtype
brandon-b-miller Sep 11, 2020
e4c0bf1
scalar binop updates
brandon-b-miller Sep 13, 2020
42828c0
convert a list of cudf.Scalars into a contiguous column
brandon-b-miller Sep 14, 2020
0d3d6a0
migrate scalar methods to python
brandon-b-miller Sep 14, 2020
63e1387
actually include scalar.py and update tests
brandon-b-miller Sep 14, 2020
2005d65
fix the rest of test_reductions.py
brandon-b-miller Sep 14, 2020
523919c
fix indexing error
brandon-b-miller Sep 14, 2020
c730301
fix as_scalar
brandon-b-miller Sep 14, 2020
c5450c2
remove unecessary code
brandon-b-miller Sep 14, 2020
7bc0893
minor bugfixes
brandon-b-miller Sep 14, 2020
a3a4893
scalar plumbing, cudf.api.types additions, bug fixes
brandon-b-miller Sep 15, 2020
6bf121c
add cudf.api.types.isscalar(element)
brandon-b-miller Sep 15, 2020
165f86c
plumbing
brandon-b-miller Sep 15, 2020
cec9528
scalars may __round__
brandon-b-miller Sep 15, 2020
a8b380b
to_numpy -> numpy_dtype
brandon-b-miller Sep 16, 2020
1dc151a
extra to_numpy -> numpy_dtype that were missed
brandon-b-miller Sep 16, 2020
46a9c2f
add docstrings, respond to reviews
brandon-b-miller Sep 16, 2020
81e6058
minor fixes and code removal
brandon-b-miller Sep 17, 2020
d7930eb
remove cudf_dtype_from_pydata_dtype
brandon-b-miller Sep 17, 2020
c290a15
update api calls for find_common_type to be numpy-like
brandon-b-miller Sep 17, 2020
e90e325
let pandas handle categorical edge cases
brandon-b-miller Sep 17, 2020
3d8ca2f
fix categorical creation and casting throughout cudf
brandon-b-miller Sep 17, 2020
2653384
remove old code
brandon-b-miller Sep 17, 2020
123784b
continued bugfixes
brandon-b-miller Sep 17, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 33 additions & 1 deletion python/cudf/cudf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

import rmm

import cudf.api.types
from cudf import core, datasets, testing
from cudf._version import get_versions
from cudf.core import (
Expand All @@ -31,8 +32,39 @@
UInt64Index,
from_pandas,
merge,
Scalar
)
from cudf.core.dtypes import (
BooleanDtype,
CategoricalDtype,
Datetime,
Datetime64MSDtype,
Datetime64NSDtype,
Datetime64SDtype,
Datetime64USDtype,
Flexible,
Float32Dtype,
Float64Dtype,
Floating,
Generic,
Int8Dtype,
Int16Dtype,
Int32Dtype,
Int64Dtype,
Integer,
Number,
StringDtype,
Timedelta,
Timedelta64MSDtype,
Timedelta64NSDtype,
Timedelta64SDtype,
Timedelta64USDtype,
UInt8Dtype,
UInt16Dtype,
UInt32Dtype,
UInt64Dtype,
dtype,
)
from cudf.core.dtypes import CategoricalDtype
from cudf.core.groupby import Grouper
from cudf.core.ops import (
add,
Expand Down
18 changes: 7 additions & 11 deletions python/cudf/cudf/_lib/aggregation.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@ from cudf._lib.types cimport (
underlying_type_t_interpolation,
underlying_type_t_null_policy,
underlying_type_t_type_id,
_Dtype
)
from cudf._lib.types import Interpolation
from cudf.core.dtypes import dtype as cudf_dtype

try:
# Numba >= 0.49
Expand Down Expand Up @@ -241,24 +243,18 @@ cdef class _AggregationFactory:
cdef string cpp_str

# Handling UDF type
nb_type = numpy_support.from_dtype(kwargs['dtype'])
nb_type = numpy_support.from_dtype(kwargs['dtype'].to_numpy)
type_signature = (nb_type[:],)
compiled_op = cudautils.compile_udf(op, type_signature)
output_np_dtype = np.dtype(compiled_op[1])
output_np_dtype = cudf_dtype(np.dtype(compiled_op[1]))
cpp_str = compiled_op[0].encode('UTF-8')
if output_np_dtype not in np_to_cudf_types:
if cudf_dtype(output_np_dtype) not in np_to_cudf_types:
raise TypeError(
"Result of window function has unsupported dtype {}"
.format(op[1])
)
Comment on lines +251 to 255
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we already error in trying to construct the cudf_dtype object here instead of having to check in np_to_cudf_types?

Copy link
Contributor Author

@brandon-b-miller brandon-b-miller Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't need to check the dict, that's correct. The dtype should error upon construction. The only question is if this case deserves its own error. To figure that out I think I need to look closely here and figure out what exactly the user could be doing that would cause them to run into this. More to follow here.

tid = (
<libcudf_types.type_id> (
<underlying_type_t_type_id> (
np_to_cudf_types[output_np_dtype]
)
)
)
out_dtype = libcudf_types.data_type(tid)
cdef _Dtype pydtype = output_np_dtype
out_dtype = pydtype.get_libcudf_type()

agg.c_obj = move(libcudf_aggregation.make_udf_aggregation(
libcudf_aggregation.udf_type.PTX, cpp_str, out_dtype
Expand Down
24 changes: 6 additions & 18 deletions python/cudf/cudf/_lib/binaryop.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,11 @@ from cudf._lib.cpp.types cimport (
type_id,
)

from cudf.utils.dtypes import is_string_dtype
from cudf.api.types import is_string_dtype

from cudf._lib.cpp.binaryop cimport binary_operator
cimport cudf._lib.cpp.binaryop as cpp_binaryop
from cudf._lib.types cimport _Dtype


class BinaryOperation(IntEnum):
Expand Down Expand Up @@ -170,19 +171,13 @@ def binaryop(lhs, rhs, op, dtype):
"""
Dispatches a binary op call to the appropriate libcudf function:
"""
cdef _Dtype py_dtype = dtype
op = BinaryOperation[op.upper()]
cdef binary_operator c_op = <binary_operator> (
<underlying_type_t_binary_operator> op
)
cdef type_id tid = (
<type_id> (
<underlying_type_t_type_id> (
np_to_cudf_types[np.dtype(dtype)]
)
)
)

cdef data_type c_dtype = data_type(tid)
cdef data_type c_dtype = py_dtype.get_libcudf_type()

if isinstance(lhs, Scalar) or np.isscalar(lhs) or lhs is None:

Expand Down Expand Up @@ -229,15 +224,8 @@ def binaryop_udf(Column lhs, Column rhs, udf_ptx, dtype):
"""
cdef column_view c_lhs = lhs.view()
cdef column_view c_rhs = rhs.view()

cdef type_id tid = (
<type_id> (
<underlying_type_t_type_id> (
np_to_cudf_types[np.dtype(dtype)]
)
)
)
cdef data_type c_dtype = data_type(tid)
cdef _Dtype pydtype = dtype
cdef data_type c_dtype = pydtype.get_libcudf_type()

cdef string cpp_str = udf_ptx.encode("UTF-8")

Expand Down
29 changes: 7 additions & 22 deletions python/cudf/cudf/_lib/column.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ import rmm
import cudf

from cudf.core.buffer import Buffer
from cudf.utils.dtypes import is_categorical_dtype, is_list_dtype
from cudf.api.types import is_categorical_dtype, is_list_dtype
import cudf._lib as libcudfxx

from cpython.buffer cimport PyObject_CheckBuffer
Expand Down Expand Up @@ -41,6 +41,8 @@ from cudf._lib.cpp.lists.lists_column_view cimport lists_column_view
from cudf._lib.cpp.scalar.scalar cimport scalar
from cudf._lib.scalar cimport Scalar
cimport cudf._lib.cpp.types as libcudf_types
from cudf._lib.types cimport _Dtype

cimport cudf._lib.cpp.unary as libcudf_unary

cdef class Column:
Expand Down Expand Up @@ -316,14 +318,8 @@ cdef class Column:
col = self.base_children[0]
else:
col = self
data_dtype = col.dtype

cdef libcudf_types.type_id tid = <libcudf_types.type_id> (
<underlying_type_t_type_id> (
np_to_cudf_types[np.dtype(data_dtype)]
)
)
cdef libcudf_types.data_type dtype = libcudf_types.data_type(tid)
cdef _Dtype pydtype = col.dtype
cdef libcudf_types.data_type dtype = pydtype.get_libcudf_type()
cdef libcudf_types.size_type offset = self.offset
cdef vector[mutable_column_view] children
cdef void* data
Expand Down Expand Up @@ -374,19 +370,8 @@ cdef class Column:
else:
col = self

data_dtype = col.dtype
cdef libcudf_types.type_id tid

if not is_list_dtype(self.dtype):
tid = <libcudf_types.type_id> (
<underlying_type_t_type_id> (
np_to_cudf_types[np.dtype(data_dtype)]
)
)
else:
tid = libcudf_types.type_id.LIST

cdef libcudf_types.data_type dtype = libcudf_types.data_type(tid)
cdef _Dtype pydtype = col.dtype
cdef libcudf_types.data_type dtype = pydtype.get_libcudf_type()
cdef libcudf_types.size_type offset = self.offset
cdef vector[column_view] children
cdef void* data
Expand Down
3 changes: 2 additions & 1 deletion python/cudf/cudf/_lib/copying.pyx
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Copyright (c) 2020, NVIDIA CORPORATION.

import pandas as pd
from cudf.api.types import is_integer_dtype

from libcpp cimport bool
from libcpp.memory cimport make_unique, unique_ptr
Expand Down Expand Up @@ -129,7 +130,7 @@ def copy_range(Column input_column,


def gather(Table source_table, Column gather_map, bool keep_index=True):
assert pd.api.types.is_integer_dtype(gather_map.dtype)
assert is_integer_dtype(gather_map.dtype)

cdef unique_ptr[table] c_result
cdef table_view source_table_view
Expand Down
6 changes: 6 additions & 0 deletions python/cudf/cudf/_lib/cpp/scalar/scalar.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ cdef extern from "cudf/scalar/scalar.hpp" namespace "cudf" nogil:
numeric_scalar(T value, bool is_valid) except +
void set_value(T value) except +
T value() except +
T* data() except +

cdef cppclass timestamp_scalar[T](scalar):
timestamp_scalar() except +
Expand All @@ -34,6 +35,8 @@ cdef extern from "cudf/scalar/scalar.hpp" namespace "cudf" nogil:
int64_t ticks_since_epoch_64 "ticks_since_epoch"() except +
int32_t ticks_since_epoch_32 "ticks_since_epoch"() except +
T value() except +
T* data() except +


cdef cppclass duration_scalar[T](scalar):
duration_scalar() except +
Expand All @@ -44,10 +47,13 @@ cdef extern from "cudf/scalar/scalar.hpp" namespace "cudf" nogil:
duration_scalar(int32_t value, bool is_valid) except +
int64_t ticks "count"() except +
T value() except +
T* data() except +


cdef cppclass string_scalar(scalar):
string_scalar() except +
string_scalar(string st) except +
string_scalar(string st, bool is_valid) except +
string_scalar(string_scalar other) except +
string to_string() except +
const char* data() except +
2 changes: 1 addition & 1 deletion python/cudf/cudf/_lib/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ def _drop_unsupported_aggs(Table values, aggs):
if all(len(v) == 0 for v in aggs.values()):
return aggs

from cudf.utils.dtypes import (
from cudf.api.types import (
is_categorical_dtype,
is_string_dtype,
is_list_dtype
Expand Down
9 changes: 4 additions & 5 deletions python/cudf/cudf/_lib/parquet.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ import json
from cython.operator import dereference
import numpy as np

from cudf.utils.dtypes import np_to_pa_dtype, is_categorical_dtype
from cudf.utils.dtypes import np_to_pa_dtype
from cudf.api.types import is_categorical_dtype
from libc.stdlib cimport free
from libc.stdint cimport uint8_t
from libcpp.memory cimport shared_ptr, unique_ptr, make_unique
Expand Down Expand Up @@ -102,7 +103,6 @@ cpdef generate_pandas_metadata(Table table, index):
)
else:
types.append(np_to_pa_dtype(col.dtype))

# Indexes
if index is not False:
for name in table._index.names:
Expand Down Expand Up @@ -134,16 +134,15 @@ cpdef generate_pandas_metadata(Table table, index):
index_descriptors.append(descr)
else:
col_names.append(name)

metadata_df = table.head(0).to_pandas()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed because our duck typing no longer works in this situation?

Copy link
Contributor Author

@brandon-b-miller brandon-b-miller Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I felt like this was a pretty bad hack. The pyarrow function here writes the metadata by subbing in the names of the dtype objects attached to the dataframe. So if we pass it our frame it writes a parquet file that other readers can't read because they don't understand the dtypes. I suspect there's a better way of doing this, but ideally that solution wouldn't need to expect that the pyarrow function works in a specific way, e.g. an actual API for doing this. We might be able to contribute that to pyarrow directly.

metadata = pa.pandas_compat.construct_metadata(
table,
metadata_df,
col_names,
index_levels,
index_descriptors,
index,
types,
)

md = metadata[b'pandas']
json_str = md.decode("utf-8")
return json_str
Expand Down
24 changes: 10 additions & 14 deletions python/cudf/cudf/_lib/reduce.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,14 @@ from cudf._lib.cpp.column.column cimport column
from cudf._lib.scalar cimport Scalar
from cudf._lib.column cimport Column
from cudf._lib.types import np_to_cudf_types
from cudf._lib.types cimport underlying_type_t_type_id
from cudf._lib.types cimport underlying_type_t_type_id, _Dtype
from cudf._lib.move cimport move
from cudf._lib.aggregation cimport make_aggregation, aggregation
from libcpp.memory cimport unique_ptr
import numpy as np
from cudf.core.dtypes import dtype as cudf_dtype
from cudf.api.types import find_common_type
from cudf.core.scalar import Scalar as PyScalar


def reduce(reduction_op, Column incol, dtype=None, **kwargs):
Expand All @@ -29,26 +32,19 @@ def reduce(reduction_op, Column incol, dtype=None, **kwargs):
A numpy data type to use for the output, defaults
to the same type as the input column
"""

dtype = cudf_dtype(dtype)
col_dtype = incol.dtype
if reduction_op in ['sum', 'sum_of_squares', 'product']:
col_dtype = np.find_common_type([col_dtype], [np.uint64])
col_dtype = find_common_type([col_dtype], [np.uint64])
col_dtype = col_dtype if dtype is None else dtype

cdef column_view c_incol_view = incol.view()
cdef unique_ptr[scalar] c_result
cdef unique_ptr[aggregation] c_agg = move(make_aggregation(
reduction_op, kwargs
))
cdef type_id tid = (
<type_id> (
<underlying_type_t_type_id> (
np_to_cudf_types[np.dtype(col_dtype)]
)
)
)

cdef data_type c_out_dtype = data_type(tid)
cdef _Dtype data_dtype = col_dtype
cdef data_type c_out_dtype = data_dtype.get_libcudf_type()

# check empty case
if len(incol) <= incol.null_count:
Expand All @@ -65,8 +61,8 @@ def reduce(reduction_op, Column incol, dtype=None, **kwargs):
c_out_dtype
))

py_result = Scalar.from_unique_ptr(move(c_result))
return py_result.value
cy_result = Scalar.from_unique_ptr(move(c_result))
return PyScalar(cy_result)


def scan(scan_op, Column incol, inclusive, **kwargs):
Expand Down
1 change: 1 addition & 0 deletions python/cudf/cudf/_lib/scalar.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ from libcpp.memory cimport unique_ptr
from libcpp cimport bool

from cudf._lib.cpp.scalar.scalar cimport scalar
from libc.stdint cimport uintptr_t
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed here? It doesn't look to be used in this header.



cdef class Scalar:
Expand Down
Loading