Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding cudf.cut method #8002

Merged
merged 70 commits into from
Jun 11, 2021
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
fd6fb9c
interval dtype and tests
marlenezw Dec 11, 2020
bdce72c
fixing merge conflicts
marlenezw Apr 20, 2021
f4c8329
adding updates from branch-20
marlenezw Apr 20, 2021
7cfd192
removing faulty merge.
marlenezw Apr 20, 2021
800e134
more merge conflict fixes.
marlenezw Apr 20, 2021
43e74d1
Merge branch 'branch-0.20' of https://github.com/rapidsai/cudf into c…
marlenezw Apr 21, 2021
aae0d16
changes that allow us to return catindex.
marlenezw Apr 23, 2021
9709562
updating branch.
marlenezw Apr 23, 2021
5733daf
final changes and tests.
marlenezw Apr 27, 2021
044f54e
Merge branch 'branch-0.20' of https://github.com/rapidsai/cudf into c…
marlenezw Apr 27, 2021
d926f83
updated changes and removing old code.
marlenezw Apr 27, 2021
0892a36
removing unnecessary changes.
marlenezw Apr 27, 2021
5b4936f
changing closed parameters to fix failing tests.
marlenezw Apr 27, 2021
b5a8982
Merge branch 'branch-0.20' of https://github.com/rapidsai/cudf into c…
marlenezw Apr 27, 2021
534bf73
Merge branch 'branch-0.20' of https://github.com/rapidsai/cudf into c…
marlenezw Apr 28, 2021
93ee3bd
more tests.
marlenezw May 5, 2021
0a94e42
resolving merge conflicts
marlenezw May 5, 2021
6a8bcfa
changes for series input.
marlenezw May 5, 2021
ab54eee
adding changes for parameters retbins,labels, and precision.Also allo…
marlenezw May 6, 2021
2ded2e5
Merge branch 'branch-0.20' of https://github.com/rapidsai/cudf into c…
marlenezw May 6, 2021
2a6ea6b
removing breakpoint that was causing failures.
marlenezw May 7, 2021
e1b34b7
handling for bins that are interval index and or a sequence of scalar…
marlenezw May 10, 2021
58cf825
Merge branch 'branch-0.20' of https://github.com/rapidsai/cudf into c…
marlenezw May 10, 2021
33f8d0f
Merge branch 'branch-0.20' of https://github.com/rapidsai/cudf into c…
marlenezw May 11, 2021
548d71c
adding changes to give correct output with series and one more test.
marlenezw May 11, 2021
1ee4f1a
Merge branch 'branch-0.20' of https://github.com/rapidsai/cudf into c…
marlenezw May 11, 2021
257b4d5
adding handling for the case where we have a series, duplicates dropp…
marlenezw May 12, 2021
10d4316
Merge branch 'branch-0.20' of https://github.com/rapidsai/cudf into c…
marlenezw May 12, 2021
eb4f3cb
changing x min and max into scalars to avoid using cupy.
marlenezw May 14, 2021
b20da61
removing breakpoint
marlenezw May 14, 2021
4b0a837
Merge branch 'branch-0.20' of https://github.com/rapidsai/cudf into c…
marlenezw May 14, 2021
30ecf09
fixing some style issues.
marlenezw May 14, 2021
d5d8dc6
Update python/cudf/cudf/tests/test_cut.py
marlenezw May 19, 2021
809bc36
Update python/cudf/cudf/tests/test_cut.py
marlenezw May 19, 2021
00a819f
Update python/cudf/cudf/core/cut.py
marlenezw May 19, 2021
1011cb1
Update python/cudf/cudf/core/cut.py
marlenezw May 19, 2021
8b05f16
Update python/cudf/cudf/core/column/categorical.py
marlenezw May 19, 2021
f32613f
adding base_mask to col to get correct null later.
marlenezw May 19, 2021
0e7f270
resolve merge conflicts.
marlenezw May 19, 2021
2d73337
more changes to tests.
marlenezw May 24, 2021
ad66b38
Merge branch 'branch-21.06' of https://github.com/rapidsai/cudf into …
marlenezw May 24, 2021
17ca933
Merge branch 'branch-21.06' of https://github.com/rapidsai/cudf into …
marlenezw May 26, 2021
c78a0cd
updating tests.
marlenezw May 26, 2021
646f6a8
Merge branch 'branch-21.08' of https://github.com/rapidsai/cudf into …
marlenezw Jun 2, 2021
4d30454
style changes.
marlenezw Jun 2, 2021
b19edc3
fixning mypy style issue.
marlenezw Jun 2, 2021
f1a4e43
Merge branch 'branch-21.08' of https://github.com/rapidsai/cudf into …
marlenezw Jun 4, 2021
8dc5070
fixing error that assumes all categories have a dtype.
marlenezw Jun 4, 2021
62f5739
Update python/cudf/cudf/core/cut.py
marlenezw Jun 4, 2021
9aeafda
Update python/cudf/cudf/core/cut.py
marlenezw Jun 4, 2021
ad85c37
using cupy instead of sequence for calcualting bin value
marlenezw Jun 7, 2021
13dbe21
using cupy.linespace instead of sequence.
marlenezw Jun 8, 2021
c413cac
Merge branch 'branch-21.08' of https://github.com/rapidsai/cudf into …
marlenezw Jun 8, 2021
8bb81fc
style fixes.
marlenezw Jun 8, 2021
e4c7ae5
fixing merge conflicts.
marlenezw Jun 8, 2021
1fa45f1
changing to numpy.
marlenezw Jun 8, 2021
ec4d5b8
keeping bins computations on the host.
marlenezw Jun 8, 2021
d3bb368
style changes.
marlenezw Jun 8, 2021
c8d8ffd
Update python/cudf/cudf/core/cut.py
marlenezw Jun 9, 2021
588d44a
Update python/cudf/cudf/core/dtypes.py
marlenezw Jun 9, 2021
4acb1a7
Update python/cudf/cudf/tests/test_cut.py
marlenezw Jun 9, 2021
e667f64
Update python/cudf/cudf/core/cut.py
marlenezw Jun 9, 2021
57486cd
adding test for raise exception and removing stale code.
marlenezw Jun 9, 2021
d3ffa19
Update python/cudf/cudf/core/cut.py
marlenezw Jun 10, 2021
32a9255
Update python/cudf/cudf/core/cut.py
marlenezw Jun 10, 2021
ac20cb0
Update python/cudf/cudf/core/cut.py
marlenezw Jun 10, 2021
9c924be
removing stale code after switching to host.
marlenezw Jun 10, 2021
a038312
Merge branch 'cut-pr' of https://github.com/marlenezw/cudf into cut-pr
marlenezw Jun 10, 2021
8f9264d
style changes and updates from reviews.
marlenezw Jun 10, 2021
002752a
Merge branch 'branch-21.08' of https://github.com/rapidsai/cudf into …
marlenezw Jun 10, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions python/cudf/cudf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
UInt64Index,
from_pandas,
merge,
cut,
)
from cudf.core.algorithms import factorize
from cudf.core.dtypes import (
Expand Down
1 change: 1 addition & 0 deletions python/cudf/cudf/core/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,4 @@
from cudf.core.multiindex import MultiIndex
from cudf.core.scalar import NA, Scalar
from cudf.core.series import Series
from cudf.core.cut import cut
9 changes: 8 additions & 1 deletion python/cudf/cudf/core/column/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@
is_mixed_with_object_dtype,
min_signed_type,
min_unsigned_type,
is_interval_dtype,
is_struct_dtype,
)

if TYPE_CHECKING:
Expand Down Expand Up @@ -1091,7 +1093,12 @@ def to_pandas(self, index: pd.Index = None, **kwargs) -> pd.Series:

signed_dtype = min_signed_type(len(col.categories))
codes = col.cat().codes.astype(signed_dtype).fillna(-1).to_array()
categories = col.categories.dropna(drop_nan=True).to_pandas()
if is_interval_dtype(col.categories.dtype) or is_struct_dtype(
col.categories.dtype
):
categories = col.categories.to_pandas()
else:
categories = col.categories.dropna(drop_nan=True).to_pandas()
data = pd.Categorical.from_codes(
codes, categories=categories, ordered=col.ordered
)
Expand Down
12 changes: 9 additions & 3 deletions python/cudf/cudf/core/column/interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,9 @@ def to_arrow(self):
def from_struct_column(self, closed="right"):
return IntervalColumn(
size=self.size,
dtype=IntervalDtype(self.dtype.fields["left"], closed),
dtype=IntervalDtype(
self.dtype.fields["left" if "left" else "0"], closed
),
mask=self.base_mask,
offset=self.offset,
null_count=self.null_count,
Expand All @@ -87,7 +89,9 @@ def copy(self, deep=True):
struct_copy = super().copy(deep=deep)
return IntervalColumn(
size=struct_copy.size,
dtype=IntervalDtype(struct_copy.dtype.fields["left"], closed),
dtype=IntervalDtype(
struct_copy.dtype.fields["left" if "left" else "0"], closed
),
mask=struct_copy.base_mask,
offset=struct_copy.offset,
null_count=struct_copy.null_count,
Expand All @@ -100,7 +104,9 @@ def as_interval_column(self, dtype, **kwargs):
# a user can directly input the string `interval` as the dtype
# when creating an interval series or interval dataframe
if dtype == "interval":
dtype = IntervalDtype(self.dtype.fields["left"], self.closed)
dtype = IntervalDtype(
self.dtype.fields["left" if "left" else "0"], self.closed
)
return IntervalColumn(
size=self.size,
dtype=dtype,
Expand Down
197 changes: 197 additions & 0 deletions python/cudf/cudf/core/cut.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
from cudf._lib.labeling import label_bins
from cudf.core.column import as_column
from cudf.core.column import build_categorical_column
from cudf.core.index import IntervalIndex
from cudf.utils.dtypes import is_list_like
import cupy
import cudf


def cut(
x,
bins,
right: bool = True,
labels=None,
retbins: bool = False,
precision: int = 3,
include_lowest: bool = False,
duplicates: str = "raise",
ordered: bool = True,
):

"""
Bin values into discrete intervals.
Use cut when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable.
Parameters
----------
x : array-like
The input array to be binned. Must be 1-dimensional.
bins : int, sequence of scalars, or IntervalIndex
The criteria to bin by.
* int : Defines the number of equal-width bins in the
range of x. The range of x is extended by .1% on each
side to include the minimum and maximum values of x.
right : bool, default True
Indicates whether bins includes the rightmost edge or not.
labels : array or False, default None
Specifies the labels for the returned bins. Must be the same
length as the resulting bins. If False, returns only integer
indicators of thebins. If True,raises an error. When ordered=False,
labels must be provided.
retbins : bool, default False
Whether to return the bins or not.
precision : int, default 3
The precision at which to store and display the bins labels.
include_lowest : bool, default False
Whether the first interval should be left-inclusive or not.
duplicates : {default 'raise', 'drop'}, optional
If bin edges are not unique, raise ValueError or drop non-uniques.
ordered : bool, default True
Whether the labels are ordered or not. Applies to returned types
Categorical and Series (with Categorical dtype). If True,
the resulting categorical will be ordered. If False, the resulting
categorical will be unordered (labels must be provided).
Returns
-------
out : CategoricalIndex
An array-like object representing the respective bin for each value
of x. The type depends on the value of labels.
bins : numpy.ndarray or IntervalIndex.
The computed or specified bins. Only returned when retbins=True.
For scalar or sequence bins, this is an ndarray with the computed
bins. If set duplicates=drop, bins will drop non-unique bin. For
an IntervalIndex bins, this is equal to bins.
Examples
--------
Discretize into three equal-sized bins.
>>> cudf.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
CategoricalIndex([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0],
... (5.0, 7.0],(0.994, 3.0]], categories=[(0.994, 3.0],
... (3.0, 5.0], (5.0, 7.0]], ordered=True, dtype='category')
>>> cudf.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins=True)
(CategoricalIndex([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0],
... (5.0, 7.0],(0.994, 3.0]],categories=[(0.994, 3.0],
... (3.0, 5.0], (5.0, 7.0]],ordered=True, dtype='category'),
array([0.994, 3. , 5. , 7. ]))
>>> cudf.cut(np.array([1, 7, 5, 4, 6, 3]),
... 3, labels=["bad", "medium", "good"])
CategoricalIndex(['bad', 'good', 'medium', 'medium', 'good', 'bad'],
... categories=['bad', 'medium', 'good'],ordered=True,
... dtype='category')
>>> cudf.cut(np.array([1, 7, 5, 4, 6, 3]), 3,
... labels=["B", "A", "B"], ordered=False)
CategoricalIndex(['B', 'B', 'A', 'A', 'B', 'B'], categories=['A', 'B'],
... ordered=False, dtype='category')
>>> cudf.cut([0, 1, 1, 2], bins=4, labels=False)
array([0, 1, 1, 3], dtype=int32)
Passing a Series as an input returns a Series with categorical dtype:
>>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
marlenezw marked this conversation as resolved.
Show resolved Hide resolved
... index=['a', 'b', 'c', 'd', 'e'])
>>> pd.cut(s, 3)
marlenezw marked this conversation as resolved.
Show resolved Hide resolved
"""
left_inclusive = False
right_inclusive = True

if not ordered and labels is None:
raise ValueError("'labels' must be provided if 'ordered = False'")

if duplicates not in ["raise", "drop"]:
raise ValueError(
"invalid value for 'duplicates' parameter, valid options are: "
"raise, drop"
)

# the inputs is a column of the values in the array x
input_arr = as_column(x)

# create the bins
x = cupy.asarray(x)
rng = (x.min(), x.max())
mn, mx = [mi + 0.0 for mi in rng]
bins = cupy.linspace(mn, mx, bins + 1, endpoint=True)

# extend the range of x by 0.1% on each side to include
# the minimum and maximum values of x.
adj = (mx - mn) * 0.001
if right:
bins[0] -= adj
else:
bins[-1] += adj

if right and include_lowest:
bins[0] = bins[0] - 10 ** (-precision)

# adjust bin edges precision
bins = cupy.around(bins, precision)

# checking for the correct inclusivity values
if right:
closed = "right"
elif not right:
marlenezw marked this conversation as resolved.
Show resolved Hide resolved
closed = "left"
left_inclusive = True

if labels is None:
# get labels for categories
interval_labels = IntervalIndex.from_breaks(bins, closed=closed)
elif labels is not False:
if not (is_list_like(labels)):
raise ValueError(
"Bin labels must either be False, None or passed in as a "
"list-like argument"
)
if ordered and len(set(labels)) != len(labels):
raise ValueError(
"labels must be unique if ordered=True; pass ordered=False for"
"duplicate labels"
)
else:
if len(labels) != len(bins) - 1:
raise ValueError(
"Bin labels must be one fewer than the number of bin edges"
)
if not ordered and len(set(labels)) != len(labels):
interval_labels = cudf.CategoricalIndex(
labels, categories=None, ordered=False
)
else:
interval_labels = (
labels if len(set(labels)) == len(labels) else None
)

# get the left and right edges of the bins as columns
left_edges = as_column(bins[:-1:])
right_edges = as_column(bins[+1::])
# the input arr must be changed to the same type as the edges
input_arr = input_arr.astype(left_edges._dtype)
# get the indexes for the appropriate number
index_labels = label_bins(
input_arr, left_edges, left_inclusive, right_edges, right_inclusive
)
if index_labels.base_mask:
index_labels._base_mask = None

if labels is False:
# if labels is false we return the bin indexes
indx_arr = index_labels.values
return indx_arr

if labels is not None:
if labels is not ordered and len(set(labels)) != len(labels):
# when we have duplicate labels and ordered is False, we
# should allow duplicate categories
new_data = [interval_labels[i][0] for i in index_labels.values]
return cudf.CategoricalIndex(
new_data, categories=sorted(set(labels)), ordered=False
)
col = build_categorical_column(
categories=interval_labels, codes=index_labels, ordered=ordered
)
categorical_index = cudf.core.index.as_index(col)
if retbins:
# if retbins is true we return the bins as well
return categorical_index, bins
else:
return categorical_index
marlenezw marked this conversation as resolved.
Show resolved Hide resolved
7 changes: 5 additions & 2 deletions python/cudf/cudf/core/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
ColumnBase,
DatetimeColumn,
IntervalColumn,
StructColumn,
NumericalColumn,
StringColumn,
TimeDeltaColumn,
Expand Down Expand Up @@ -536,7 +537,6 @@ def to_frame(self, index=True, name=None):
col_name = 0
else:
col_name = self.name

return cudf.DataFrame(
{col_name: self._values}, index=self if index else None
)
Expand Down Expand Up @@ -1136,7 +1136,6 @@ def to_series(self, index=None, name=None):
Series
The dtype will be based on the type of the Index values.
"""

return cudf.Series(
self._values,
index=self.copy(deep=False) if index is None else index,
Expand Down Expand Up @@ -2926,6 +2925,10 @@ def as_index(arbitrary, **kwargs) -> Index:
return TimedeltaIndex(arbitrary, **kwargs)
elif isinstance(arbitrary, CategoricalColumn):
return CategoricalIndex(arbitrary, **kwargs)
elif isinstance(arbitrary, IntervalColumn):
return IntervalIndex(arbitrary, **kwargs)
elif isinstance(arbitrary, StructColumn):
return IntervalIndex(arbitrary, **kwargs)
marlenezw marked this conversation as resolved.
Show resolved Hide resolved
elif isinstance(arbitrary, cudf.Series):
return as_index(arbitrary._column, **kwargs)
elif isinstance(arbitrary, pd.RangeIndex):
Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -1217,6 +1217,7 @@ def __str__(self):
return self.to_string()

def __repr__(self):
breakpoint()
_, height = get_terminal_size()
max_rows = (
height
Expand All @@ -1231,7 +1232,6 @@ def __repr__(self):
preprocess = cudf.concat([top, bottom])
else:
preprocess = self.copy()

preprocess.index = preprocess.index._clean_nulls_from_index()
if (
preprocess.nullable
Expand Down
Loading