Rewrites sample API (#10262)

This PR rewrites sample API. On function side, this API now accepts a cupy random state or a numpy random state. If a host (numpy) random state is accpeted, the sampled rows should match the result with pandas given the same initial state and operation sequence. On the other hand, if given a device random state, we should expect higher performance if the sample count is large. Syntatically, this PR refactors existing code into two sub-method that deals with different axis to sample from. The sub-methods are type annotated. Sampling from `cudf.Index/cudf.MultiIndex` is deprecated. This PR is breaking because: 1. User who previously calls `sample` API now gets different rows. 2. To align with pandas API, `keep_index` is renamed to `ignore_index` and its semantic is negated. Current implementation does not depend on `libcudf.copying.sample`, thus cython bindings are removed. Performance: at 10K rows, this PR is 39% slower than current. Amounting for 0.3ms. At 100M rows, this PR is 7% slower using cupy random state. <details> <summary>Benchmark Axis=0</summary> ``` -------------------------------------------------------------------------------------- benchmark 'axis=0': 6 tests --------------------------------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- sample_df[size100M-AxisIndex-CupyRandomState] (afte) 296.7751 (455.90) 299.2855 (401.57) 297.9519 (448.88) 1.1162 (94.15) 297.7824 (451.66) 2.0472 (192.32) 2;0 5 sample_df[size100M-AxisIndex-NumpyRandomState] (afte) 4,435.3055 (>1000.0) 4,717.0815 (>1000.0) 4,507.1635 (>1000.0) 119.8772 (>1000.0) 4,452.5009 (>1000.0) 115.2876 (>1000.0) 1;0 5 sample_df[size100M-AxisIndex-NumpyRandomState] (befo) 276.1754 (424.26) 276.4792 (370.97) 276.2995 (416.26) 0.1258 (10.61) 276.3024 (419.08) 0.2010 (18.88) 1;0 5 sample_df[size10K-AxisIndex-CupyRandomState] (afte) 1.0789 (1.66) 1.2420 (1.67) 1.1238 (1.69) 0.0683 (5.76) 1.0962 (1.66) 0.0721 (6.77) 1;0 5 sample_df[size10K-AxisIndex-NumpyRandomState] (afte) 0.9018 (1.39) 1.1441 (1.54) 0.9140 (1.38) 0.0182 (1.54) 0.9094 (1.38) 0.0106 (1.0) 11;11 346 sample_df[size10K-AxisIndex-NumpyRandomState] (befo) 0.6510 (1.0) 0.7453 (1.0) 0.6638 (1.0) 0.0119 (1.0) 0.6593 (1.0) 0.0108 (1.01) 76;44 638 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` </details> On `axis=1` sample, this PR is faster than current if provided a numpy random state for `random_state` parameter, while slower if provided a seed instead. <details> <summary> Benchmark axis=1 </summary> ``` --------------------------------------------------------------------------------- benchmark 'axis=1': 6 tests ---------------------------------------------------------------------------------- Name (time in us) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ sample_df[size100M-AxisColumn-NumpyRandomState] (afte) 173.2660 (1.0) 290.5080 (1.14) 178.2199 (1.0) 8.0913 (1.58) 175.7130 (1.0) 2.0767 (1.73) 227;419 2707 sample_df[size100M-AxisColumn-Seed] (afte) 441.9110 (2.55) 617.1150 (2.42) 452.4197 (2.54) 14.1272 (2.76) 447.1345 (2.54) 7.9060 (6.59) 151;162 1484 sample_df[size100M-AxisColumn-Seed] (befo) 297.1560 (1.72) 477.1500 (1.87) 307.8915 (1.73) 17.2036 (3.36) 300.5620 (1.71) 9.4080 (7.85) 159;168 1695 sample_df[size10K-AxisColumn-NumpyRandomState] (afte) 176.6440 (1.02) 254.9110 (1.0) 180.0217 (1.01) 5.1152 (1.0) 178.8940 (1.02) 1.1990 (1.0) 226;405 3542 sample_df[size10K-AxisColumn-Seed] (afte) 451.6370 (2.61) 689.8120 (2.71) 465.9937 (2.61) 14.3921 (2.81) 463.0710 (2.64) 6.7365 (5.62) 62;91 1183 sample_df[size10K-AxisColumn-Seed] (befo) 309.4000 (1.79) 413.9080 (1.62) 316.5210 (1.78) 7.6379 (1.49) 315.2130 (1.79) 5.4100 (4.51) 66;42 826 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` </details> Part of #10153 Authors: - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10262
rapidsai · Mar 4, 2022 · 1e5b01f · 1e5b01f
1 parent b5337d7
commit 1e5b01f
Show file tree

Hide file tree

Showing 10 changed files with 499 additions and 323 deletions.
diff --git a/python/cudf/cudf/_lib/copying.pyx b/python/cudf/cudf/_lib/copying.pyx
@@ -656,32 +656,6 @@ def get_element(Column input_column, size_type index):
     )
 
 
-def sample(input, size_type n,
-           bool replace, int64_t seed, bool keep_index=True):
-    cdef table_view tbl_view = table_view_from_table(input, not keep_index)
-    cdef cpp_copying.sample_with_replacement replacement
-
-    if replace:
-        replacement = cpp_copying.sample_with_replacement.TRUE
-    else:
-        replacement = cpp_copying.sample_with_replacement.FALSE
-
-    cdef unique_ptr[table] c_output
-    with nogil:
-        c_output = move(
-            cpp_copying.sample(tbl_view, n, replacement, seed)
-        )
-
-    return data_from_unique_ptr(
-        move(c_output),
-        column_names=input._column_names,
-        index_names=(
-            None if keep_index is False
-            else input._index_names
-        )
-    )
-
-
 def segmented_gather(Column source_column, Column gather_map):
     cdef shared_ptr[lists_column_view] source_LCV = (
         make_shared[lists_column_view](source_column.view())

diff --git a/python/cudf/cudf/_lib/cpp/copying.pxd b/python/cudf/cudf/_lib/cpp/copying.pxd
@@ -1,4 +1,4 @@
-# Copyright (c) 2020, NVIDIA CORPORATION.
+# Copyright (c) 2020-2022, NVIDIA CORPORATION.
 
 from libc.stdint cimport int32_t, int64_t, uint8_t
 from libcpp cimport bool
@@ -175,10 +175,3 @@ cdef extern from "cudf/copying.hpp" namespace "cudf" nogil:
     ctypedef enum sample_with_replacement:
         FALSE 'cudf::sample_with_replacement::FALSE',
         TRUE 'cudf::sample_with_replacement::TRUE',
-
-    cdef unique_ptr[table] sample (
-        table_view input,
-        size_type n,
-        sample_with_replacement replacement,
-        int64_t seed
-    ) except +
diff --git a/python/cudf/cudf/core/_base_index.py b/python/cudf/cudf/core/_base_index.py
@@ -3,6 +3,7 @@
 from __future__ import annotations
 
 import pickle
+import warnings
 from functools import cached_property
 from typing import Any, Set
 
@@ -1528,6 +1529,27 @@ def _split_columns_by_levels(self, levels):
             [],
         )
 
+    def sample(
+        self,
+        n=None,
+        frac=None,
+        replace=False,
+        weights=None,
+        random_state=None,
+        axis=None,
+        ignore_index=False,
+    ):
+        warnings.warn(
+            "Index.sample is deprecated and will be removed.", FutureWarning,
+        )
+        return cudf.core.index._index_from_data(
+            self.to_frame()
+            .sample(
+                n, frac, replace, weights, random_state, axis, ignore_index
+            )
+            ._data
+        )
+
 
 def _get_result_name(left_name, right_name):
     if left_name == right_name:

diff --git a/python/cudf/cudf/core/dataframe.py b/python/cudf/cudf/core/dataframe.py
@@ -36,6 +36,7 @@
 import cudf
 import cudf.core.common
 from cudf import _lib as libcudf
+from cudf._typing import ColumnLike
 from cudf.api.types import (
     _is_scalar_or_zero_d_array,
     is_bool_dtype,
@@ -6322,6 +6323,32 @@ def nunique(self, axis=0, dropna=True):
 
         return cudf.Series(super().nunique(method="sort", dropna=dropna))
 
+    def _sample_axis_1(
+        self,
+        n: int,
+        weights: Optional[ColumnLike],
+        replace: bool,
+        random_state: np.random.RandomState,
+        ignore_index: bool,
+    ):
+        if replace:
+            # Since cuDF does not support multiple columns with same name,
+            # sample with replace=True at axis 1 is unsupported.
+            raise NotImplementedError(
+                "Sample is not supported for axis 1/`columns` when"
+                "`replace=True`."
+            )
+
+        sampled_column_labels = random_state.choice(
+            self._column_names, size=n, replace=False, p=weights
+        )
+
+        result = self._get_columns_by_label(sampled_column_labels)
+        if ignore_index:
+            result.reset_index(drop=True)
+
+        return result
+
 
 def from_dataframe(df, allow_copy=False):
     return df_protocol.from_dataframe(df, allow_copy=allow_copy)

diff --git a/python/cudf/cudf/core/frame.py b/python/cudf/cudf/core/frame.py
@@ -50,7 +50,7 @@
 from cudf.core.window import Rolling
 from cudf.utils import ioutils
 from cudf.utils.docutils import copy_docstring
-from cudf.utils.dtypes import find_common_type, is_column_like
+from cudf.utils.dtypes import find_common_type
 
 T = TypeVar("T", bound="Frame")
 
@@ -1659,199 +1659,6 @@ def shift(self, periods=1, freq=None, axis=0, fill_value=None):
             zip(self._column_names, data_columns), self._index
         )
 
-    @annotate("FRAME_SAMPLE", color="orange", domain="cudf_python")
-    def sample(
-        self,
-        n=None,
-        frac=None,
-        replace=False,
-        weights=None,
-        random_state=None,
-        axis=None,
-        keep_index=True,
-    ):
-        """Return a random sample of items from an axis of object.
-
-        You can use random_state for reproducibility.
-
-        Parameters
-        ----------
-        n : int, optional
-            Number of items from axis to return. Cannot be used with frac.
-            Default = 1 if frac = None.
-        frac : float, optional
-            Fraction of axis items to return. Cannot be used with n.
-        replace : bool, default False
-            Allow or disallow sampling of the same row more than once.
-            replace == True is not yet supported for axis = 1/"columns"
-        weights : str or ndarray-like, optional
-            Only supported for axis=1/"columns"
-        random_state : int, numpy RandomState or None, default None
-            Seed for the random number generator (if int), or None.
-            If None, a random seed will be chosen.
-            if RandomState, seed will be extracted from current state.
-        axis : {0 or ‘index’, 1 or ‘columns’, None}, default None
-            Axis to sample. Accepts axis number or name.
-            Default is stat axis for given data type
-            (0 for Series and DataFrames). Series and Index doesn't
-            support axis=1.
-
-        Returns
-        -------
-        Series or DataFrame or Index
-            A new object of same type as caller containing n items
-            randomly sampled from the caller object.
-
-        Examples
-        --------
-        >>> import cudf as cudf
-        >>> df = cudf.DataFrame({"a":{1, 2, 3, 4, 5}})
-        >>> df.sample(3)
-           a
-        1  2
-        3  4
-        0  1
-
-        >>> sr = cudf.Series([1, 2, 3, 4, 5])
-        >>> sr.sample(10, replace=True)
-        1    4
-        3    1
-        2    4
-        0    5
-        0    1
-        4    5
-        4    1
-        0    2
-        0    3
-        3    2
-        dtype: int64
-
-        >>> df = cudf.DataFrame(
-        ... {"a":[1, 2], "b":[2, 3], "c":[3, 4], "d":[4, 5]})
-        >>> df.sample(2, axis=1)
-           a  c
-        0  1  3
-        1  2  4
-        """
-
-        if frac is not None and frac > 1 and not replace:
-            raise ValueError(
-                "Replace has to be set to `True` "
-                "when upsampling the population `frac` > 1."
-            )
-        elif frac is not None and n is not None:
-            raise ValueError(
-                "Please enter a value for `frac` OR `n`, not both"
-            )
-
-        if frac is None and n is None:
-            n = 1
-        elif frac is not None:
-            if axis is None or axis == 0 or axis == "index":
-                n = int(round(self.shape[0] * frac))
-            else:
-                n = int(round(self.shape[1] * frac))
-
-        if axis is None or axis == 0 or axis == "index":
-            if n > 0 and self.shape[0] == 0:
-                raise ValueError(
-                    "Cannot take a sample larger than 0 when axis is empty"
-                )
-
-            if not replace and n > self.shape[0]:
-                raise ValueError(
-                    "Cannot take a larger sample than population "
-                    "when 'replace=False'"
-                )
-
-            if weights is not None:
-                raise NotImplementedError(
-                    "weights is not yet supported for axis=0/index"
-                )
-
-            if random_state is None:
-                seed = np.random.randint(
-                    np.iinfo(np.int64).max, dtype=np.int64
-                )
-            elif isinstance(random_state, np.random.mtrand.RandomState):
-                _, keys, pos, _, _ = random_state.get_state()
-                seed = 0 if pos >= len(keys) else pos
-            else:
-                seed = np.int64(random_state)
-
-            result = self.__class__._from_data(
-                *libcudf.copying.sample(
-                    self,
-                    n=n,
-                    replace=replace,
-                    seed=seed,
-                    keep_index=keep_index,
-                )
-            )
-            result._copy_type_metadata(self)
-
-            return result
-        else:
-            if len(self.shape) != 2:
-                raise ValueError(
-                    f"No axis named {axis} for "
-                    f"object type {self.__class__}"
-                )
-
-            if replace:
-                raise NotImplementedError(
-                    "Sample is not supported for "
-                    f"axis {axis} when 'replace=True'"
-                )
-
-            if n > 0 and self.shape[1] == 0:
-                raise ValueError(
-                    "Cannot take a sample larger than 0 when axis is empty"
-                )
-
-            columns = np.asarray(self._data.names)
-            if not replace and n > columns.size:
-                raise ValueError(
-                    "Cannot take a larger sample "
-                    "than population when 'replace=False'"
-                )
-
-            if weights is not None:
-                if is_column_like(weights):
-                    weights = np.asarray(weights)
-                else:
-                    raise ValueError(
-                        "Strings can only be passed to weights "
-                        "when sampling from rows on a DataFrame"
-                    )
-
-                if columns.size != len(weights):
-                    raise ValueError(
-                        "Weights and axis to be sampled must be of same length"
-                    )
-
-                total_weight = weights.sum()
-                if total_weight != 1:
-                    if not isinstance(weights.dtype, float):
-                        weights = weights.astype("float64")
-                    weights = weights / total_weight
-
-            np.random.seed(random_state)
-            gather_map = np.random.choice(
-                columns, size=n, replace=replace, p=weights
-            )
-
-            if isinstance(self, cudf.MultiIndex):
-                # TODO: Need to update this once MultiIndex is refactored,
-                # should be able to treat it similar to other Frame object
-                result = cudf.Index(self.to_frame(index=False)[gather_map])
-            else:
-                result = self[gather_map]
-                if not keep_index:
-                    result.index = None
-
-            return result
-
     @classmethod
     @annotate("FRAME_FROM_ARROW", color="orange", domain="cudf_python")
     def from_arrow(cls, data):