Skip to content

Commit

Permalink
Rewrites sample API (#10262)
Browse files Browse the repository at this point in the history
This PR rewrites sample API. On function side, this API now accepts a cupy random state or a numpy random state. If a host (numpy) random state is accpeted, the sampled rows should match the result with pandas given the same initial state and operation sequence. On the other hand, if given a device random state, we should expect higher performance if the sample count is large.

Syntatically, this PR refactors existing code into two sub-method that deals with different axis to sample from. The sub-methods are type annotated.

Sampling from `cudf.Index/cudf.MultiIndex` is deprecated. 

This PR is breaking because:
1. User who previously calls `sample` API now gets different rows.
2. To align with pandas API, `keep_index` is renamed to `ignore_index` and its semantic is negated.

Current implementation does not depend on `libcudf.copying.sample`, thus cython bindings are removed.

Performance: at 10K rows, this PR is 39% slower than current. Amounting for 0.3ms. At 100M rows, this PR is 7% slower using cupy random state.
<details>
<summary>Benchmark Axis=0</summary>

```
-------------------------------------------------------------------------------------- benchmark 'axis=0': 6 tests ---------------------------------------------------------------------------------------
Name (time in ms)                                                Min                   Max                  Mean              StdDev                Median                 IQR            Outliers  Rounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sample_df[size100M-AxisIndex-CupyRandomState] (afte)        296.7751 (455.90)     299.2855 (401.57)     297.9519 (448.88)     1.1162 (94.15)      297.7824 (451.66)     2.0472 (192.32)        2;0       5
sample_df[size100M-AxisIndex-NumpyRandomState] (afte)     4,435.3055 (>1000.0)  4,717.0815 (>1000.0)  4,507.1635 (>1000.0)  119.8772 (>1000.0)  4,452.5009 (>1000.0)  115.2876 (>1000.0)       1;0       5
sample_df[size100M-AxisIndex-NumpyRandomState] (befo)       276.1754 (424.26)     276.4792 (370.97)     276.2995 (416.26)     0.1258 (10.61)      276.3024 (419.08)     0.2010 (18.88)         1;0       5
sample_df[size10K-AxisIndex-CupyRandomState] (afte)           1.0789 (1.66)         1.2420 (1.67)         1.1238 (1.69)       0.0683 (5.76)         1.0962 (1.66)       0.0721 (6.77)          1;0       5
sample_df[size10K-AxisIndex-NumpyRandomState] (afte)          0.9018 (1.39)         1.1441 (1.54)         0.9140 (1.38)       0.0182 (1.54)         0.9094 (1.38)       0.0106 (1.0)         11;11     346
sample_df[size10K-AxisIndex-NumpyRandomState] (befo)          0.6510 (1.0)          0.7453 (1.0)          0.6638 (1.0)        0.0119 (1.0)          0.6593 (1.0)        0.0108 (1.01)        76;44     638
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```
</details>

On `axis=1` sample, this PR is faster than current if provided a numpy random state for `random_state` parameter, while slower if provided a seed instead.
<details>
<summary> Benchmark axis=1 </summary>

```
--------------------------------------------------------------------------------- benchmark 'axis=1': 6 tests ----------------------------------------------------------------------------------
Name (time in us)                                               Min                 Max                Mean             StdDev              Median               IQR            Outliers  Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sample_df[size100M-AxisColumn-NumpyRandomState] (afte)     173.2660 (1.0)      290.5080 (1.14)     178.2199 (1.0)       8.0913 (1.58)     175.7130 (1.0)      2.0767 (1.73)      227;419    2707
sample_df[size100M-AxisColumn-Seed] (afte)                 441.9110 (2.55)     617.1150 (2.42)     452.4197 (2.54)     14.1272 (2.76)     447.1345 (2.54)     7.9060 (6.59)      151;162    1484
sample_df[size100M-AxisColumn-Seed] (befo)                 297.1560 (1.72)     477.1500 (1.87)     307.8915 (1.73)     17.2036 (3.36)     300.5620 (1.71)     9.4080 (7.85)      159;168    1695
sample_df[size10K-AxisColumn-NumpyRandomState] (afte)      176.6440 (1.02)     254.9110 (1.0)      180.0217 (1.01)      5.1152 (1.0)      178.8940 (1.02)     1.1990 (1.0)       226;405    3542
sample_df[size10K-AxisColumn-Seed] (afte)                  451.6370 (2.61)     689.8120 (2.71)     465.9937 (2.61)     14.3921 (2.81)     463.0710 (2.64)     6.7365 (5.62)        62;91    1183
sample_df[size10K-AxisColumn-Seed] (befo)                  309.4000 (1.79)     413.9080 (1.62)     316.5210 (1.78)      7.6379 (1.49)     315.2130 (1.79)     5.4100 (4.51)        66;42     826
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```
</details>

Part of #10153

Authors:
  - Michael Wang (https://github.com/isVoid)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10262
  • Loading branch information
isVoid authored Mar 4, 2022
1 parent b5337d7 commit 1e5b01f
Show file tree
Hide file tree
Showing 10 changed files with 499 additions and 323 deletions.
26 changes: 0 additions & 26 deletions python/cudf/cudf/_lib/copying.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -656,32 +656,6 @@ def get_element(Column input_column, size_type index):
)


def sample(input, size_type n,
bool replace, int64_t seed, bool keep_index=True):
cdef table_view tbl_view = table_view_from_table(input, not keep_index)
cdef cpp_copying.sample_with_replacement replacement

if replace:
replacement = cpp_copying.sample_with_replacement.TRUE
else:
replacement = cpp_copying.sample_with_replacement.FALSE

cdef unique_ptr[table] c_output
with nogil:
c_output = move(
cpp_copying.sample(tbl_view, n, replacement, seed)
)

return data_from_unique_ptr(
move(c_output),
column_names=input._column_names,
index_names=(
None if keep_index is False
else input._index_names
)
)


def segmented_gather(Column source_column, Column gather_map):
cdef shared_ptr[lists_column_view] source_LCV = (
make_shared[lists_column_view](source_column.view())
Expand Down
9 changes: 1 addition & 8 deletions python/cudf/cudf/_lib/cpp/copying.pxd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2020, NVIDIA CORPORATION.
# Copyright (c) 2020-2022, NVIDIA CORPORATION.

from libc.stdint cimport int32_t, int64_t, uint8_t
from libcpp cimport bool
Expand Down Expand Up @@ -175,10 +175,3 @@ cdef extern from "cudf/copying.hpp" namespace "cudf" nogil:
ctypedef enum sample_with_replacement:
FALSE 'cudf::sample_with_replacement::FALSE',
TRUE 'cudf::sample_with_replacement::TRUE',

cdef unique_ptr[table] sample (
table_view input,
size_type n,
sample_with_replacement replacement,
int64_t seed
) except +
22 changes: 22 additions & 0 deletions python/cudf/cudf/core/_base_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from __future__ import annotations

import pickle
import warnings
from functools import cached_property
from typing import Any, Set

Expand Down Expand Up @@ -1528,6 +1529,27 @@ def _split_columns_by_levels(self, levels):
[],
)

def sample(
self,
n=None,
frac=None,
replace=False,
weights=None,
random_state=None,
axis=None,
ignore_index=False,
):
warnings.warn(
"Index.sample is deprecated and will be removed.", FutureWarning,
)
return cudf.core.index._index_from_data(
self.to_frame()
.sample(
n, frac, replace, weights, random_state, axis, ignore_index
)
._data
)


def _get_result_name(left_name, right_name):
if left_name == right_name:
Expand Down
27 changes: 27 additions & 0 deletions python/cudf/cudf/core/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
import cudf
import cudf.core.common
from cudf import _lib as libcudf
from cudf._typing import ColumnLike
from cudf.api.types import (
_is_scalar_or_zero_d_array,
is_bool_dtype,
Expand Down Expand Up @@ -6322,6 +6323,32 @@ def nunique(self, axis=0, dropna=True):

return cudf.Series(super().nunique(method="sort", dropna=dropna))

def _sample_axis_1(
self,
n: int,
weights: Optional[ColumnLike],
replace: bool,
random_state: np.random.RandomState,
ignore_index: bool,
):
if replace:
# Since cuDF does not support multiple columns with same name,
# sample with replace=True at axis 1 is unsupported.
raise NotImplementedError(
"Sample is not supported for axis 1/`columns` when"
"`replace=True`."
)

sampled_column_labels = random_state.choice(
self._column_names, size=n, replace=False, p=weights
)

result = self._get_columns_by_label(sampled_column_labels)
if ignore_index:
result.reset_index(drop=True)

return result


def from_dataframe(df, allow_copy=False):
return df_protocol.from_dataframe(df, allow_copy=allow_copy)
Expand Down
195 changes: 1 addition & 194 deletions python/cudf/cudf/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
from cudf.core.window import Rolling
from cudf.utils import ioutils
from cudf.utils.docutils import copy_docstring
from cudf.utils.dtypes import find_common_type, is_column_like
from cudf.utils.dtypes import find_common_type

T = TypeVar("T", bound="Frame")

Expand Down Expand Up @@ -1659,199 +1659,6 @@ def shift(self, periods=1, freq=None, axis=0, fill_value=None):
zip(self._column_names, data_columns), self._index
)

@annotate("FRAME_SAMPLE", color="orange", domain="cudf_python")
def sample(
self,
n=None,
frac=None,
replace=False,
weights=None,
random_state=None,
axis=None,
keep_index=True,
):
"""Return a random sample of items from an axis of object.
You can use random_state for reproducibility.
Parameters
----------
n : int, optional
Number of items from axis to return. Cannot be used with frac.
Default = 1 if frac = None.
frac : float, optional
Fraction of axis items to return. Cannot be used with n.
replace : bool, default False
Allow or disallow sampling of the same row more than once.
replace == True is not yet supported for axis = 1/"columns"
weights : str or ndarray-like, optional
Only supported for axis=1/"columns"
random_state : int, numpy RandomState or None, default None
Seed for the random number generator (if int), or None.
If None, a random seed will be chosen.
if RandomState, seed will be extracted from current state.
axis : {0 or ‘index’, 1 or ‘columns’, None}, default None
Axis to sample. Accepts axis number or name.
Default is stat axis for given data type
(0 for Series and DataFrames). Series and Index doesn't
support axis=1.
Returns
-------
Series or DataFrame or Index
A new object of same type as caller containing n items
randomly sampled from the caller object.
Examples
--------
>>> import cudf as cudf
>>> df = cudf.DataFrame({"a":{1, 2, 3, 4, 5}})
>>> df.sample(3)
a
1 2
3 4
0 1
>>> sr = cudf.Series([1, 2, 3, 4, 5])
>>> sr.sample(10, replace=True)
1 4
3 1
2 4
0 5
0 1
4 5
4 1
0 2
0 3
3 2
dtype: int64
>>> df = cudf.DataFrame(
... {"a":[1, 2], "b":[2, 3], "c":[3, 4], "d":[4, 5]})
>>> df.sample(2, axis=1)
a c
0 1 3
1 2 4
"""

if frac is not None and frac > 1 and not replace:
raise ValueError(
"Replace has to be set to `True` "
"when upsampling the population `frac` > 1."
)
elif frac is not None and n is not None:
raise ValueError(
"Please enter a value for `frac` OR `n`, not both"
)

if frac is None and n is None:
n = 1
elif frac is not None:
if axis is None or axis == 0 or axis == "index":
n = int(round(self.shape[0] * frac))
else:
n = int(round(self.shape[1] * frac))

if axis is None or axis == 0 or axis == "index":
if n > 0 and self.shape[0] == 0:
raise ValueError(
"Cannot take a sample larger than 0 when axis is empty"
)

if not replace and n > self.shape[0]:
raise ValueError(
"Cannot take a larger sample than population "
"when 'replace=False'"
)

if weights is not None:
raise NotImplementedError(
"weights is not yet supported for axis=0/index"
)

if random_state is None:
seed = np.random.randint(
np.iinfo(np.int64).max, dtype=np.int64
)
elif isinstance(random_state, np.random.mtrand.RandomState):
_, keys, pos, _, _ = random_state.get_state()
seed = 0 if pos >= len(keys) else pos
else:
seed = np.int64(random_state)

result = self.__class__._from_data(
*libcudf.copying.sample(
self,
n=n,
replace=replace,
seed=seed,
keep_index=keep_index,
)
)
result._copy_type_metadata(self)

return result
else:
if len(self.shape) != 2:
raise ValueError(
f"No axis named {axis} for "
f"object type {self.__class__}"
)

if replace:
raise NotImplementedError(
"Sample is not supported for "
f"axis {axis} when 'replace=True'"
)

if n > 0 and self.shape[1] == 0:
raise ValueError(
"Cannot take a sample larger than 0 when axis is empty"
)

columns = np.asarray(self._data.names)
if not replace and n > columns.size:
raise ValueError(
"Cannot take a larger sample "
"than population when 'replace=False'"
)

if weights is not None:
if is_column_like(weights):
weights = np.asarray(weights)
else:
raise ValueError(
"Strings can only be passed to weights "
"when sampling from rows on a DataFrame"
)

if columns.size != len(weights):
raise ValueError(
"Weights and axis to be sampled must be of same length"
)

total_weight = weights.sum()
if total_weight != 1:
if not isinstance(weights.dtype, float):
weights = weights.astype("float64")
weights = weights / total_weight

np.random.seed(random_state)
gather_map = np.random.choice(
columns, size=n, replace=replace, p=weights
)

if isinstance(self, cudf.MultiIndex):
# TODO: Need to update this once MultiIndex is refactored,
# should be able to treat it similar to other Frame object
result = cudf.Index(self.to_frame(index=False)[gather_map])
else:
result = self[gather_map]
if not keep_index:
result.index = None

return result

@classmethod
@annotate("FRAME_FROM_ARROW", color="orange", domain="cudf_python")
def from_arrow(cls, data):
Expand Down
Loading

0 comments on commit 1e5b01f

Please sign in to comment.