Skip to content

Commit

Permalink
Improve User Guide docs (#10663)
Browse files Browse the repository at this point in the history
This PR makes some minor improvements to the cuDF user guide and some docstrings.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #10663
  • Loading branch information
bdice authored Apr 14, 2022
1 parent 77fa49e commit 14a3261
Show file tree
Hide file tree
Showing 7 changed files with 91 additions and 70 deletions.
58 changes: 31 additions & 27 deletions docs/cudf/source/basics/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,36 +15,40 @@ The following table lists all of cudf types. For methods requiring dtype argumen
.. rst-class:: special-table
.. table::

+------------------------+------------------+-------------------------------------------------------------------------------------+---------------------------------------------+
| Kind of Data | Data Type | Scalar | String Aliases |
+========================+==================+=====================================================================================+=============================================+
| Integer | | np.int8_, np.int16_, np.int32_, np.int64_, np.uint8_, np.uint16_, | ``'int8'``, ``'int16'``, ``'int32'``, |
| | | np.uint32_, np.uint64_ | ``'int64'``, ``'uint8'``, ``'uint16'``, |
| | | | ``'uint32'``, ``'uint64'`` |
+------------------------+------------------+-------------------------------------------------------------------------------------+---------------------------------------------+
| Float | | np.float32_, np.float64_ | ``'float32'``, ``'float64'`` |
+------------------------+------------------+-------------------------------------------------------------------------------------+---------------------------------------------+
| Strings | | `str <https://docs.python.org/3/library/stdtypes.html#str>`_ | ``'string'``, ``'object'`` |
+------------------------+------------------+-------------------------------------------------------------------------------------+---------------------------------------------+
| Datetime | | np.datetime64_ | ``'datetime64[s]'``, ``'datetime64[ms]'``, |
| | | | ``'datetime64[us]'``, ``'datetime64[ns]'`` |
+------------------------+------------------+-------------------------------------------------------------------------------------+---------------------------------------------+
| Timedelta | | np.timedelta64_ | ``'timedelta64[s]'``, ``'timedelta64[ms]'``,|
| (duration type) | | | ``'timedelta64[us]'``, ``'timedelta64[ns]'``|
+------------------------+------------------+-------------------------------------------------------------------------------------+---------------------------------------------+
| Categorical | CategoricalDtype | (none) | ``'category'`` |
+------------------------+------------------+-------------------------------------------------------------------------------------+---------------------------------------------+
| Boolean | | np.bool_ | ``'bool'`` |
+------------------------+------------------+-------------------------------------------------------------------------------------+---------------------------------------------+
| Decimal | Decimal32Dtype, | (none) | (none) |
| | Decimal64Dtype, | | |
| | Decimal128Dtype | | |
+------------------------+------------------+-------------------------------------------------------------------------------------+---------------------------------------------+
+-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+
| Kind of Data | Data Type | Scalar | String Aliases |
+=================+==================+==============================================================+==============================================+
| Integer | | np.int8_, np.int16_, np.int32_, np.int64_, np.uint8_, | ``'int8'``, ``'int16'``, ``'int32'``, |
| | | np.uint16_, np.uint32_, np.uint64_ | ``'int64'``, ``'uint8'``, ``'uint16'``, |
| | | | ``'uint32'``, ``'uint64'`` |
+-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+
| Float | | np.float32_, np.float64_ | ``'float32'``, ``'float64'`` |
+-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+
| Strings | | `str <https://docs.python.org/3/library/stdtypes.html#str>`_ | ``'string'``, ``'object'`` |
+-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+
| Datetime | | np.datetime64_ | ``'datetime64[s]'``, ``'datetime64[ms]'``, |
| | | | ``'datetime64[us]'``, ``'datetime64[ns]'`` |
+-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+
| Timedelta | | np.timedelta64_ | ``'timedelta64[s]'``, ``'timedelta64[ms]'``, |
| (duration type) | | | ``'timedelta64[us]'``, ``'timedelta64[ns]'`` |
+-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+
| Categorical | CategoricalDtype | (none) | ``'category'`` |
+-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+
| Boolean | | np.bool_ | ``'bool'`` |
+-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+
| Decimal | Decimal32Dtype, | (none) | (none) |
| | Decimal64Dtype, | | |
| | Decimal128Dtype | | |
+-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+
| Lists | ListDtype | list | ``'list'`` |
+-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+
| Structs | StructDtype | dict | ``'struct'`` |
+-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+

**Note: All dtypes above are Nullable**

.. _np.int8:
.. _np.int16:
.. _np.int8:
.. _np.int16:
.. _np.int32:
.. _np.int64:
.. _np.uint8:
Expand Down
4 changes: 2 additions & 2 deletions docs/cudf/source/basics/internals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ As another example, the ``StringColumn`` backing the Series
2. No mask buffer as there are no nulls in the Series
3. Two children columns:

- A column of 8-bit characters
- A column of UTF-8 characters
``['d', 'o', 'y', 'o', 'u', h' ... '?']``
- A column of "offsets" to the characters column (in this case,
``[0, 2, 5, 9, 12, 19]``)
Expand Down Expand Up @@ -172,7 +172,7 @@ Selecting columns by index:
>>> ca.select_by_index(1)
ColumnAccessor(OrderedColumnDict([('y', <cudf.core.column.string.StringColumn object at 0x7f5a7d578830>)]), multiindex=False, level_names=(None,))
>>> ca.select_by_index([0, 1])
ColumnAccessor(OrderedColumnDict([('x', <cudf.core.column.numerical.NumericalColumn object at 0x7f5a7d5789e0>), ('y', <cudf.core.column.string.StringColumn object at 0x7f5a7d578830>)]), multiindex=False, level_names=(None,))
ColumnAccessor(OrderedColumnDict([('x', <cudf.core.column.numerical.NumericalColumn object at 0x7f5a7d5789e0>), ('y', <cudf.core.column.string.StringColumn object at 0x7f5a7d578830>)]), multiindex=False, level_names=(None,))
>>> ca.select_by_index(slice(1, 3))
ColumnAccessor(OrderedColumnDict([('y', <cudf.core.column.string.StringColumn object at 0x7f5a7d578830>), ('z', <cudf.core.column.numerical.NumericalColumn object at 0x7f5a7d5788c0>)]), multiindex=False, level_names=(None,))
Expand Down
24 changes: 12 additions & 12 deletions docs/cudf/source/basics/io-gds-integration.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
GPUDirect Storage Integration
=============================

Many IO APIs can use GPUDirect Storage (GDS) library to optimize IO operations.
GDS enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU.
GDS also has a compatibility mode that allows the library to fall back to copying through a CPU bounce buffer.
Many IO APIs can use GPUDirect Storage (GDS) library to optimize IO operations.
GDS enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU.
GDS also has a compatibility mode that allows the library to fall back to copying through a CPU bounce buffer.
The SDK is available for download `here <https://developer.nvidia.com/gpudirect-storage>`_.
GDS is also included in CUDA Toolkit 11.4 and higher.

Use of GPUDirect Storage in cuDF is enabled by default, but can be disabled through the environment variable ``LIBCUDF_CUFILE_POLICY``.
This variable also controls the GDS compatibility mode.
Use of GPUDirect Storage in cuDF is enabled by default, but can be disabled through the environment variable ``LIBCUDF_CUFILE_POLICY``.
This variable also controls the GDS compatibility mode.

There are three valid values for the environment variable:

Expand All @@ -20,17 +20,17 @@ If no value is set, behavior will be the same as the "GDS" option.

This environment variable also affects how cuDF treats GDS errors.
When ``LIBCUDF_CUFILE_POLICY`` is set to "GDS" and a GDS API call fails for any reason, cuDF falls back to the internal implementation with bounce buffers.
When ``LIBCUDF_CUFILE_POLICY`` is set to "ALWAYS" and a GDS API call fails for any reason (unlikely, given that the compatibility mode is on),
When ``LIBCUDF_CUFILE_POLICY`` is set to "ALWAYS" and a GDS API call fails for any reason (unlikely, given that the compatibility mode is on),
cuDF throws an exception to propagate the error to te user.

Operations that support the use of GPUDirect Storage:

- `read_avro`
- `read_parquet`
- `read_orc`
- `to_csv`
- `to_parquet`
- `to_orc`
- :py:func:`cudf.read_avro`
- :py:func:`cudf.read_parquet`
- :py:func:`cudf.read_orc`
- :py:meth:`cudf.DataFrame.to_csv`
- :py:meth:`cudf.DataFrame.to_parquet`
- :py:meth:`cudf.DataFrame.to_orc`

Several parameters that can be used to tune the performance of GDS-enabled I/O are exposed through environment variables:

Expand Down
4 changes: 2 additions & 2 deletions docs/cudf/source/basics/io-nvcomp-integration.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
nvCOMP Integration
=============================

Some types of compression/decompression can be performed using either `nvCOMP library <https://github.com/NVIDIA/nvcomp>`_ or the internal implementation.
Some types of compression/decompression can be performed using either the `nvCOMP library <https://github.com/NVIDIA/nvcomp>`_ or the internal implementation.

Which implementation is used by default depends on the data format and the compression type.
Behavior can be influenced through environment variable ``LIBCUDF_NVCOMP_POLICY``.

There are three valid values for the environment variable:

- "STABLE": Only enable the nvCOMP in places where it has been deemed stable for production use.
- "STABLE": Only enable the nvCOMP in places where it has been deemed stable for production use.
- "ALWAYS": Enable all available uses of nvCOMP, including new, experimental combinations.
- "OFF": Disable nvCOMP use whenever possible and use the internal implementations instead.

Expand Down
46 changes: 31 additions & 15 deletions python/cudf/cudf/core/cut.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Copyright (c) 2021-2022, NVIDIA CORPORATION.

from collections.abc import Sequence

import cupy
Expand All @@ -21,21 +23,27 @@ def cut(
duplicates: str = "raise",
ordered: bool = True,
):
"""Bin values into discrete intervals.
"""
Bin values into discrete intervals.
Use cut when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable.
Parameters
----------
x : array-like
The input array to be binned. Must be 1-dimensional.
bins : int, sequence of scalars, or IntervalIndex
The criteria to bin by.
* int : Defines the number of equal-width bins in the
range of x. The range of x is extended by .1% on each
side to include the minimum and maximum values of x.
* int : Defines the number of equal-width bins in the range of `x`. The
range of `x` is extended by .1% on each side to include the minimum
and maximum values of `x`.
* sequence of scalars : Defines the bin edges allowing for non-uniform
width. No extension of the range of `x` is done.
* IntervalIndex : Defines the exact bins to be used. Note that
IntervalIndex for `bins` must be non-overlapping.
right : bool, default True
Indicates whether bins includes the rightmost edge or not.
labels : array or False, default None
Expand Down Expand Up @@ -66,30 +74,38 @@ def cut(
For scalar or sequence bins, this is an ndarray with the computed
bins. If set duplicates=drop, bins will drop non-unique bin. For
an IntervalIndex bins, this is equal to bins.
Examples
--------
Discretize into three equal-sized bins.
>>> cudf.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
CategoricalIndex([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0],
... (5.0, 7.0],(0.994, 3.0]], categories=[(0.994, 3.0],
... (3.0, 5.0], (5.0, 7.0]], ordered=True, dtype='category')
(5.0, 7.0], (0.994, 3.0]], categories=[(0.994, 3.0],
(3.0, 5.0], (5.0, 7.0]], ordered=True, dtype='category')
>>> cudf.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins=True)
(CategoricalIndex([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0],
... (5.0, 7.0],(0.994, 3.0]],categories=[(0.994, 3.0],
... (3.0, 5.0], (5.0, 7.0]],ordered=True, dtype='category'),
array([0.994, 3. , 5. , 7. ]))
(5.0, 7.0], (0.994, 3.0]], categories=[(0.994, 3.0],
(3.0, 5.0], (5.0, 7.0]], ordered=True, dtype='category'),
array([0.994, 3. , 5. , 7. ]))
>>> cudf.cut(np.array([1, 7, 5, 4, 6, 3]),
... 3, labels=["bad", "medium", "good"])
... 3, labels=["bad", "medium", "good"])
CategoricalIndex(['bad', 'good', 'medium', 'medium', 'good', 'bad'],
... categories=['bad', 'medium', 'good'],ordered=True,
... dtype='category')
categories=['bad', 'medium', 'good'],ordered=True,
dtype='category')
>>> cudf.cut(np.array([1, 7, 5, 4, 6, 3]), 3,
... labels=["B", "A", "B"], ordered=False)
... labels=["B", "A", "B"], ordered=False)
CategoricalIndex(['B', 'B', 'A', 'A', 'B', 'B'], categories=['A', 'B'],
... ordered=False, dtype='category')
ordered=False, dtype='category')
>>> cudf.cut([0, 1, 1, 2], bins=4, labels=False)
array([0, 1, 1, 3], dtype=int32)
Passing a Series as an input returns a Series with categorical dtype:
>>> s = cudf.Series(np.array([2, 4, 6, 8, 10]),
... index=['a', 'b', 'c', 'd', 'e'])
>>> cudf.cut(s, 3)
Expand Down
21 changes: 11 additions & 10 deletions python/cudf/cudf/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -566,19 +566,20 @@ def mult(df):
.. code-block::
>>> df = pd.DataFrame({
'a': [1, 1, 2, 2],
'b': [1, 2, 1, 2],
'c': [1, 2, 3, 4]})
... 'a': [1, 1, 2, 2],
... 'b': [1, 2, 1, 2],
... 'c': [1, 2, 3, 4],
... })
>>> gdf = cudf.from_pandas(df)
>>> df.groupby('a').apply(lambda x: x.iloc[[0]])
a b c
a
1 0 1 1 1
2 2 2 1 3
a b c
a
1 0 1 1 1
2 2 2 1 3
>>> gdf.groupby('a').apply(lambda x: x.iloc[[0]])
a b c
0 1 1 1
2 2 1 3
a b c
0 1 1 1
2 2 1 3
"""
if not callable(function):
raise TypeError(f"type {type(function)} is not callable")
Expand Down
4 changes: 2 additions & 2 deletions python/cudf/cudf/core/single_column_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,8 +81,8 @@ def name(self, value):

@property # type: ignore
@_cudf_nvtx_annotate
def ndim(self):
"""Get the dimensionality (always 1 for single-columned frames)."""
def ndim(self): # noqa: D401
"""Number of dimensions of the underlying data, by definition 1."""
return 1

@property # type: ignore
Expand Down

0 comments on commit 14a3261

Please sign in to comment.