Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc Python/Cython optimizations #7686

Merged
merged 36 commits into from
Mar 24, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
935648b
Move validation directly into set_by_label and use a raw dict to stor…
vyasr Mar 19, 2021
806a3ef
Remove all references to OrderedColumnDict.
vyasr Mar 19, 2021
40a7b17
Move validation to separate method and use in both set_by_label and c…
vyasr Mar 19, 2021
a1c576e
Format with black.
vyasr Mar 19, 2021
788d9d6
Expose parameter to make validation optional.
vyasr Mar 19, 2021
6a64285
Coerce constructor input to dict before calling items.
vyasr Mar 19, 2021
e7d0981
Make construction safe.
vyasr Mar 19, 2021
c39932c
Final cleanup and documentation.
vyasr Mar 19, 2021
4ff09fc
Address style issues.
vyasr Mar 19, 2021
9433582
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into f…
shwina Mar 22, 2021
74f2884
Merge remote-tracking branch 'origin/branch-0.19' into feature/optimi…
vyasr Mar 22, 2021
0178127
CA fix
shwina Mar 22, 2021
efea63d
Prioritize numeric columns
shwina Mar 22, 2021
c3b6444
Lazily compute and delete column length on demand.
vyasr Mar 22, 2021
01b2cf5
Remove redundant clear cache in setitem.
vyasr Mar 22, 2021
8899258
Remove mypy annotation for column length.
vyasr Mar 22, 2021
3507785
Merge branch 'feature/optimize_accessor_copy' of github.com:vyasr/cud…
shwina Mar 22, 2021
7f8e1cd
Undo
shwina Mar 22, 2021
f2e4609
Don't validate when copying type metadata
shwina Mar 22, 2021
72598fb
Prioritize numeric dtypes in is_numerical_dtype
shwina Mar 22, 2021
fa220b6
Add unsafe CA ctor
shwina Mar 22, 2021
3760077
Revert "Prioritize numeric dtypes in is_numerical_dtype"
shwina Mar 22, 2021
de9ca28
Change error message back so that tests pass.
vyasr Mar 23, 2021
e35d03b
Faster is_numerical_dtype
shwina Mar 23, 2021
e2fd533
Faster is_numerical_dtype
shwina Mar 23, 2021
64ca702
Even faster is_numerical_dtype
shwina Mar 23, 2021
749edf1
Enable fast path for constructing a Buffer from a DeviceBuffer
shwina Mar 23, 2021
739ec57
Add validation option to insert and standardize error message.
vyasr Mar 23, 2021
498b70e
Fix style.
vyasr Mar 23, 2021
3cd012b
Merge remote-tracking branch 'vyasr/feature/optimize_accessor_copy' i…
shwina Mar 23, 2021
c28866c
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into v…
shwina Mar 23, 2021
01e13fa
Undo formatting change
shwina Mar 23, 2021
89a0301
Add TODO
shwina Mar 23, 2021
5e73de7
init->create + doc
shwina Mar 24, 2021
a4fe7b4
Use dict comprehension instead of building list
shwina Mar 24, 2021
eadcc9c
Use enumeration instead
shwina Mar 24, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 20 additions & 12 deletions python/cudf/cudf/_lib/table.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -99,22 +99,30 @@ cdef class Table:
cdef vector[unique_ptr[column]].iterator it = columns.begin()

# First construct the index, if any
cdef int i

index = None
if index_names is not None:
index_columns = []
for _ in index_names:
index_columns.append(Column.from_unique_ptr(
move(dereference(it))
))
it += 1
index = Table(dict(zip(index_names, index_columns)))
index_data = ColumnAccessor._create_unsafe(
{
name: Column.from_unique_ptr(
move(dereference(it + i))
)
for i, name in enumerate(index_names)
}
)
index = Table(data=index_data)

# Construct the data dict
data_columns = []
for _ in column_names:
data_columns.append(Column.from_unique_ptr(move(dereference(it))))
it += 1
data = dict(zip(column_names, data_columns))
cdef int n_index_columns = len(index_names) if index_names else 0
data = ColumnAccessor._create_unsafe(
{
name: Column.from_unique_ptr(
move(dereference(it + i + n_index_columns))
)
for i, name in enumerate(column_names)
}
)

return Table(data=data, index=index)

Expand Down
4 changes: 4 additions & 0 deletions python/cudf/cudf/core/buffer.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ def __init__(
self.ptr = data.ptr
self.size = data.size
self._owner = owner or data._owner
elif isinstance(data, rmm.DeviceBuffer):
self.ptr = data.ptr
self.size = data.size
self._owner = data
elif hasattr(data, "__array_interface__") or hasattr(
data, "__cuda_array_interface__"
):
Expand Down
24 changes: 14 additions & 10 deletions python/cudf/cudf/core/column/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -1017,7 +1017,9 @@ def distinct_count(
return cpp_distinct_count(self, ignore_nulls=dropna)

def astype(self, dtype: Dtype, **kwargs) -> ColumnBase:
if is_categorical_dtype(dtype):
if is_numerical_dtype(dtype):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numerical types being the most common [[citation needed]], and is_numerical_dtype now being quite fast, it makes sense to do this check first.

Copy link
Contributor

@vyasr vyasr Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still definitely pro doing this, but I'm working on prototyping the (amortized) constant-time approach I suggested and I'll update you once that's done. Hopefully that will make ordering concerns here largely moot.

return self.as_numerical_column(dtype)
elif is_categorical_dtype(dtype):
return self.as_categorical_column(dtype, **kwargs)
elif pd.api.types.pandas_dtype(dtype).type in {
np.str_,
Expand Down Expand Up @@ -1548,6 +1550,16 @@ def build_column(
"""
dtype = pd.api.types.pandas_dtype(dtype)

if is_numerical_dtype(dtype):
assert data is not None
return cudf.core.column.NumericalColumn(
data=data,
dtype=dtype,
mask=mask,
size=size,
offset=offset,
null_count=null_count,
)
if is_categorical_dtype(dtype):
if not len(children) == 1:
raise ValueError(
Expand Down Expand Up @@ -1634,15 +1646,7 @@ def build_column(
children=children,
)
else:
assert data is not None
return cudf.core.column.NumericalColumn(
data=data,
dtype=dtype,
mask=mask,
size=size,
offset=offset,
null_count=null_count,
)
raise TypeError(f"Unrecognized dtype: {dtype}")


def build_categorical_column(
Expand Down
23 changes: 17 additions & 6 deletions python/cudf/cudf/core/column_accessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,7 @@

import cudf
from cudf.core import column
from cudf.utils.utils import (
cached_property,
to_flat_dict,
to_nested_dict,
)
from cudf.utils.utils import cached_property, to_flat_dict, to_nested_dict

if TYPE_CHECKING:
from cudf.core.column import ColumnBase
Expand Down Expand Up @@ -84,6 +80,21 @@ def __init__(
self.multiindex = multiindex
self._level_names = level_names

@classmethod
def _create_unsafe(
cls,
data: Dict[Any, ColumnBase],
multiindex: bool = False,
level_names=None,
) -> ColumnAccessor:
# create a ColumnAccessor without verifying column
# type or size
obj = cls()
obj._data = data
obj.multiindex = multiindex
obj._level_names = level_names
return obj

def __iter__(self):
return self._data.__iter__()

Expand Down Expand Up @@ -167,7 +178,7 @@ def _column_length(self):
return 0

def _clear_cache(self):
cached_properties = "columns", "names", "_grouped_data"
cached_properties = ("columns", "names", "_grouped_data")
for attr in cached_properties:
try:
self.__delattr__(attr)
Expand Down
4 changes: 3 additions & 1 deletion python/cudf/cudf/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2408,7 +2408,9 @@ def _copy_type_metadata(
for name, col, other_col in zip(
self._data.keys(), self._data.values(), other._data.values()
):
self._data[name] = other_col._copy_type_metadata(col)
self._data.set_by_label(
name, other_col._copy_type_metadata(col), validate=False
)

if include_index:
if self._index is not None and other._index is not None:
Expand Down
15 changes: 6 additions & 9 deletions python/cudf/cudf/utils/dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,16 +144,13 @@ def numeric_normalize_types(*args):


def is_numerical_dtype(obj):
shwina marked this conversation as resolved.
Show resolved Hide resolved
if is_categorical_dtype(obj):
# TODO: we should handle objects with a `.dtype` attribute,
# e.g., arrays, here.
try:
dtype = np.dtype(obj)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone gives us a Pandas nullable integer type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We certainly aren't handling this currently. On branch-0.19

>>> cudf.Series([1, 2, 3], dtype=pd.Int64Dtype())  # TypeError
>>> cudf.utils.dtypes.is_numerical_dtype(pd.Int64Dtype()) # TypeError

I agree we shouldl support this. But how to do so in an efficient way is a difficult question. @vyasr and I were talking about this a couple of days ago, and he has some ideas for how to make dtype introspection faster/cheaper. We can perhaps take on this problem there?

except TypeError:
return False
if is_list_dtype(obj):
return False
return (
np.issubdtype(obj, np.bool_)
or np.issubdtype(obj, np.floating)
or np.issubdtype(obj, np.signedinteger)
or np.issubdtype(obj, np.unsignedinteger)
)
return dtype.kind in "biuf"


def is_string_dtype(obj):
Expand Down