diff --git a/conda/environments/cudf_dev_cuda11.5.yml b/conda/environments/cudf_dev_cuda11.5.yml index bdde007e33e..15f4bff583e 100644 --- a/conda/environments/cudf_dev_cuda11.5.yml +++ b/conda/environments/cudf_dev_cuda11.5.yml @@ -54,6 +54,10 @@ dependencies: - hypothesis - sphinx-markdown-tables - sphinx-copybutton + - sphinx-autobuild + - myst-nb + - scipy + - dask-cuda - mimesis<4.1 - packaging - protobuf diff --git a/docs/cudf/source/basics/basics.rst b/docs/cudf/source/basics/basics.rst deleted file mode 100644 index 9b8983fba49..00000000000 --- a/docs/cudf/source/basics/basics.rst +++ /dev/null @@ -1,62 +0,0 @@ -Basics -====== - - -Supported Dtypes ----------------- - -cuDF uses dtypes for Series or individual columns of a DataFrame. cuDF uses NumPy dtypes, NumPy provides support for ``float``, ``int``, ``bool``, -``'timedelta64[s]'``, ``'timedelta64[ms]'``, ``'timedelta64[us]'``, ``'timedelta64[ns]'``, ``'datetime64[s]'``, ``'datetime64[ms]'``, -``'datetime64[us]'``, ``'datetime64[ns]'`` (note that NumPy does not support timezone-aware datetimes). - - -The following table lists all of cudf types. For methods requiring dtype arguments, strings can be specified as indicated. See the respective documentation sections for more on each type. - -.. rst-class:: special-table -.. table:: - - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Kind of Data | Data Type | Scalar | String Aliases | - +=================+==================+==============================================================+==============================================+ - | Integer | | np.int8_, np.int16_, np.int32_, np.int64_, np.uint8_, | ``'int8'``, ``'int16'``, ``'int32'``, | - | | | np.uint16_, np.uint32_, np.uint64_ | ``'int64'``, ``'uint8'``, ``'uint16'``, | - | | | | ``'uint32'``, ``'uint64'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Float | | np.float32_, np.float64_ | ``'float32'``, ``'float64'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Strings | | `str `_ | ``'string'``, ``'object'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Datetime | | np.datetime64_ | ``'datetime64[s]'``, ``'datetime64[ms]'``, | - | | | | ``'datetime64[us]'``, ``'datetime64[ns]'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Timedelta | | np.timedelta64_ | ``'timedelta64[s]'``, ``'timedelta64[ms]'``, | - | (duration type) | | | ``'timedelta64[us]'``, ``'timedelta64[ns]'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Categorical | CategoricalDtype | (none) | ``'category'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Boolean | | np.bool_ | ``'bool'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Decimal | Decimal32Dtype, | (none) | (none) | - | | Decimal64Dtype, | | | - | | Decimal128Dtype | | | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Lists | ListDtype | list | ``'list'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - | Structs | StructDtype | dict | ``'struct'`` | - +-----------------+------------------+--------------------------------------------------------------+----------------------------------------------+ - -**Note: All dtypes above are Nullable** - -.. _np.int8: -.. _np.int16: -.. _np.int32: -.. _np.int64: -.. _np.uint8: -.. _np.uint16: -.. _np.uint32: -.. _np.uint64: -.. _np.float32: -.. _np.float64: -.. _np.bool: https://numpy.org/doc/stable/user/basics.types.html -.. _np.datetime64: https://numpy.org/doc/stable/reference/arrays.datetime.html#basic-datetimes -.. _np.timedelta64: https://numpy.org/doc/stable/reference/arrays.datetime.html#datetime-and-timedelta-arithmetic diff --git a/docs/cudf/source/basics/index.rst b/docs/cudf/source/basics/index.rst deleted file mode 100644 index a29866d7e32..00000000000 --- a/docs/cudf/source/basics/index.rst +++ /dev/null @@ -1,15 +0,0 @@ -====== -Basics -====== - - -.. toctree:: - :maxdepth: 2 - - basics - io.rst - groupby.rst - PandasCompat.rst - dask-cudf.rst - internals.rst - \ No newline at end of file diff --git a/docs/cudf/source/conf.py b/docs/cudf/source/conf.py index d65b77ef74b..c8b30120924 100644 --- a/docs/cudf/source/conf.py +++ b/docs/cudf/source/conf.py @@ -46,10 +46,13 @@ "numpydoc", "IPython.sphinxext.ipython_console_highlighting", "IPython.sphinxext.ipython_directive", - "nbsphinx", "PandasCompat", + "myst_nb", ] +jupyter_execute_notebooks = "force" +execution_timeout = 300 + copybutton_prompt_text = ">>> " autosummary_generate = True ipython_mplbackend = "str" diff --git a/docs/cudf/source/index.rst b/docs/cudf/source/index.rst index 90b287bd1b6..2c1df4a0c12 100644 --- a/docs/cudf/source/index.rst +++ b/docs/cudf/source/index.rst @@ -14,7 +14,6 @@ the details of CUDA programming. :caption: Contents: user_guide/index - basics/index api_docs/index diff --git a/docs/cudf/source/user_guide/10min.ipynb b/docs/cudf/source/user_guide/10min.ipynb index ab006847fc6..02e1ba40f1f 100644 --- a/docs/cudf/source/user_guide/10min.ipynb +++ b/docs/cudf/source/user_guide/10min.ipynb @@ -2,6 +2,7 @@ "cells": [ { "cell_type": "markdown", + "id": "e9357872", "metadata": {}, "source": [ "10 Minutes to cuDF and Dask-cuDF\n", @@ -26,6 +27,7 @@ { "cell_type": "code", "execution_count": 1, + "id": "92eed4cb", "metadata": {}, "outputs": [], "source": [ @@ -45,6 +47,7 @@ }, { "cell_type": "markdown", + "id": "ed6c6047", "metadata": {}, "source": [ "Object Creation\n", @@ -53,6 +56,7 @@ }, { "cell_type": "markdown", + "id": "aeedd961", "metadata": {}, "source": [ "Creating a `cudf.Series` and `dask_cudf.Series`." @@ -61,6 +65,7 @@ { "cell_type": "code", "execution_count": 2, + "id": "cf8b08e5", "metadata": {}, "outputs": [ { @@ -87,6 +92,7 @@ { "cell_type": "code", "execution_count": 3, + "id": "083a5898", "metadata": {}, "outputs": [ { @@ -112,6 +118,7 @@ }, { "cell_type": "markdown", + "id": "6346e1b1", "metadata": {}, "source": [ "Creating a `cudf.DataFrame` and a `dask_cudf.DataFrame` by specifying values for each column." @@ -120,6 +127,7 @@ { "cell_type": "code", "execution_count": 4, + "id": "83d1e7f5", "metadata": {}, "outputs": [ { @@ -313,6 +321,7 @@ { "cell_type": "code", "execution_count": 5, + "id": "71b61d62", "metadata": {}, "outputs": [ { @@ -502,6 +511,7 @@ }, { "cell_type": "markdown", + "id": "c7cb5abc", "metadata": {}, "source": [ "Creating a `cudf.DataFrame` from a pandas `Dataframe` and a `dask_cudf.Dataframe` from a `cudf.Dataframe`.\n", @@ -512,6 +522,7 @@ { "cell_type": "code", "execution_count": 6, + "id": "07a62244", "metadata": {}, "outputs": [ { @@ -586,6 +597,7 @@ { "cell_type": "code", "execution_count": 7, + "id": "f5cb0c65", "metadata": {}, "outputs": [ { @@ -658,6 +670,7 @@ }, { "cell_type": "markdown", + "id": "025eac40", "metadata": {}, "source": [ "Viewing Data\n", @@ -666,6 +679,7 @@ }, { "cell_type": "markdown", + "id": "47a567e8", "metadata": {}, "source": [ "Viewing the top rows of a GPU dataframe." @@ -674,6 +688,7 @@ { "cell_type": "code", "execution_count": 8, + "id": "ab8cbdb8", "metadata": {}, "outputs": [ { @@ -737,6 +752,7 @@ { "cell_type": "code", "execution_count": 9, + "id": "2e923d8a", "metadata": {}, "outputs": [ { @@ -799,6 +815,7 @@ }, { "cell_type": "markdown", + "id": "61257b4b", "metadata": {}, "source": [ "Sorting by values." @@ -807,6 +824,7 @@ { "cell_type": "code", "execution_count": 10, + "id": "512770f9", "metadata": {}, "outputs": [ { @@ -996,6 +1014,7 @@ { "cell_type": "code", "execution_count": 11, + "id": "1a13993f", "metadata": {}, "outputs": [ { @@ -1184,6 +1203,7 @@ }, { "cell_type": "markdown", + "id": "19bce4c4", "metadata": {}, "source": [ "Selection\n", @@ -1194,6 +1214,7 @@ }, { "cell_type": "markdown", + "id": "ba55980e", "metadata": {}, "source": [ "Selecting a single column, which initially yields a `cudf.Series` or `dask_cudf.Series`. Calling `compute` results in a `cudf.Series` (equivalent to `df.a`)." @@ -1202,6 +1223,7 @@ { "cell_type": "code", "execution_count": 12, + "id": "885989a6", "metadata": {}, "outputs": [ { @@ -1242,6 +1264,7 @@ { "cell_type": "code", "execution_count": 13, + "id": "14a74255", "metadata": {}, "outputs": [ { @@ -1281,6 +1304,7 @@ }, { "cell_type": "markdown", + "id": "498d79f2", "metadata": {}, "source": [ "## Selection by Label" @@ -1288,6 +1312,7 @@ }, { "cell_type": "markdown", + "id": "4b8b8e13", "metadata": {}, "source": [ "Selecting rows from index 2 to index 5 from columns 'a' and 'b'." @@ -1296,6 +1321,7 @@ { "cell_type": "code", "execution_count": 14, + "id": "d40bc19c", "metadata": {}, "outputs": [ { @@ -1368,6 +1394,7 @@ { "cell_type": "code", "execution_count": 15, + "id": "7688535b", "metadata": {}, "outputs": [ { @@ -1439,6 +1466,7 @@ }, { "cell_type": "markdown", + "id": "8a64ce7a", "metadata": {}, "source": [ "## Selection by Position" @@ -1446,6 +1474,7 @@ }, { "cell_type": "markdown", + "id": "dfba2bb2", "metadata": {}, "source": [ "Selecting via integers and integer slices, like numpy/pandas. Note that this functionality is not available for Dask-cuDF DataFrames." @@ -1454,6 +1483,7 @@ { "cell_type": "code", "execution_count": 16, + "id": "fb8d6d43", "metadata": {}, "outputs": [ { @@ -1477,6 +1507,7 @@ { "cell_type": "code", "execution_count": 17, + "id": "263231da", "metadata": {}, "outputs": [ { @@ -1542,6 +1573,7 @@ }, { "cell_type": "markdown", + "id": "2223b089", "metadata": {}, "source": [ "You can also select elements of a `DataFrame` or `Series` with direct index access." @@ -1550,6 +1582,7 @@ { "cell_type": "code", "execution_count": 18, + "id": "13f6158b", "metadata": {}, "outputs": [ { @@ -1613,6 +1646,7 @@ { "cell_type": "code", "execution_count": 19, + "id": "3cf4aa26", "metadata": {}, "outputs": [ { @@ -1634,6 +1668,7 @@ }, { "cell_type": "markdown", + "id": "ff633b2d", "metadata": {}, "source": [ "## Boolean Indexing" @@ -1641,6 +1676,7 @@ }, { "cell_type": "markdown", + "id": "bbdef48f", "metadata": {}, "source": [ "Selecting rows in a `DataFrame` or `Series` by direct Boolean indexing." @@ -1649,6 +1685,7 @@ { "cell_type": "code", "execution_count": 20, + "id": "becb916f", "metadata": {}, "outputs": [ { @@ -1726,6 +1763,7 @@ { "cell_type": "code", "execution_count": 21, + "id": "b9475c43", "metadata": {}, "outputs": [ { @@ -1802,6 +1840,7 @@ }, { "cell_type": "markdown", + "id": "ecf982f5", "metadata": {}, "source": [ "Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API." @@ -1810,6 +1849,7 @@ { "cell_type": "code", "execution_count": 22, + "id": "fc2fc9f9", "metadata": {}, "outputs": [ { @@ -1866,6 +1906,7 @@ { "cell_type": "code", "execution_count": 23, + "id": "1a05a07f", "metadata": {}, "outputs": [ { @@ -1921,6 +1962,7 @@ }, { "cell_type": "markdown", + "id": "7f8955a0", "metadata": {}, "source": [ "You can also pass local variables to Dask-cuDF queries, via the `local_dict` keyword. With standard cuDF, you may either use the `local_dict` keyword or directly pass the variable via the `@` keyword. Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`." @@ -1929,6 +1971,7 @@ { "cell_type": "code", "execution_count": 24, + "id": "49485a4b", "metadata": {}, "outputs": [ { @@ -1986,6 +2029,7 @@ { "cell_type": "code", "execution_count": 25, + "id": "0f3a9116", "metadata": {}, "outputs": [ { @@ -2042,6 +2086,7 @@ }, { "cell_type": "markdown", + "id": "c355af07", "metadata": {}, "source": [ "Using the `isin` method for filtering." @@ -2050,6 +2095,7 @@ { "cell_type": "code", "execution_count": 26, + "id": "f44a5a57", "metadata": {}, "outputs": [ { @@ -2112,6 +2158,7 @@ }, { "cell_type": "markdown", + "id": "79a50beb", "metadata": {}, "source": [ "## MultiIndex" @@ -2119,6 +2166,7 @@ }, { "cell_type": "markdown", + "id": "14e70234", "metadata": {}, "source": [ "cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see `Grouping` below) automatically produces a DataFrame with a MultiIndex." @@ -2127,6 +2175,7 @@ { "cell_type": "code", "execution_count": 27, + "id": "882973ed", "metadata": {}, "outputs": [ { @@ -2153,6 +2202,7 @@ }, { "cell_type": "markdown", + "id": "c10971cc", "metadata": {}, "source": [ "This index can back either axis of a DataFrame." @@ -2161,6 +2211,7 @@ { "cell_type": "code", "execution_count": 28, + "id": "5417aeb9", "metadata": {}, "outputs": [ { @@ -2238,6 +2289,7 @@ { "cell_type": "code", "execution_count": 29, + "id": "4d6fb4ff", "metadata": {}, "outputs": [ { @@ -2311,6 +2363,7 @@ }, { "cell_type": "markdown", + "id": "63dc11d8", "metadata": {}, "source": [ "Accessing values of a DataFrame with a MultiIndex. Note that slicing is not yet supported." @@ -2319,6 +2372,7 @@ { "cell_type": "code", "execution_count": 30, + "id": "3644920c", "metadata": {}, "outputs": [ { @@ -2340,6 +2394,7 @@ }, { "cell_type": "markdown", + "id": "697a9a36", "metadata": {}, "source": [ "Missing Data\n", @@ -2348,6 +2403,7 @@ }, { "cell_type": "markdown", + "id": "86655274", "metadata": {}, "source": [ "Missing data can be replaced by using the `fillna` method." @@ -2356,6 +2412,7 @@ { "cell_type": "code", "execution_count": 31, + "id": "28b06c52", "metadata": {}, "outputs": [ { @@ -2381,6 +2438,7 @@ { "cell_type": "code", "execution_count": 32, + "id": "7fb6a126", "metadata": {}, "outputs": [ { @@ -2405,6 +2463,7 @@ }, { "cell_type": "markdown", + "id": "7a0b732f", "metadata": {}, "source": [ "Operations\n", @@ -2413,6 +2472,7 @@ }, { "cell_type": "markdown", + "id": "1e8b0464", "metadata": {}, "source": [ "## Stats" @@ -2420,6 +2480,7 @@ }, { "cell_type": "markdown", + "id": "7523512b", "metadata": {}, "source": [ "Calculating descriptive statistics for a `Series`." @@ -2428,6 +2489,7 @@ { "cell_type": "code", "execution_count": 33, + "id": "f7cb604e", "metadata": {}, "outputs": [ { @@ -2448,6 +2510,7 @@ { "cell_type": "code", "execution_count": 34, + "id": "b8957a5f", "metadata": {}, "outputs": [ { @@ -2467,6 +2530,7 @@ }, { "cell_type": "markdown", + "id": "71fa928a", "metadata": {}, "source": [ "## Applymap" @@ -2474,6 +2538,7 @@ }, { "cell_type": "markdown", + "id": "d98d6f7b", "metadata": {}, "source": [ "Applying functions to a `Series`. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html) to apply a function to each partition of the distributed dataframe." @@ -2482,8 +2547,17 @@ { "cell_type": "code", "execution_count": 35, + "id": "5e627811", "metadata": {}, "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/ashwin/workspace/rapids/cudf/python/cudf/cudf/core/series.py:2223: FutureWarning: Series.applymap is deprecated and will be removed in a future cuDF release. Use Series.apply instead.\n", + " warnings.warn(\n" + ] + }, { "data": { "text/plain": [ @@ -2525,6 +2599,7 @@ { "cell_type": "code", "execution_count": 36, + "id": "96cf628e", "metadata": {}, "outputs": [ { @@ -2564,6 +2639,7 @@ }, { "cell_type": "markdown", + "id": "cd69c00a", "metadata": {}, "source": [ "## Histogramming" @@ -2571,6 +2647,7 @@ }, { "cell_type": "markdown", + "id": "39982866", "metadata": {}, "source": [ "Counting the number of occurrences of each unique value of variable." @@ -2579,6 +2656,7 @@ { "cell_type": "code", "execution_count": 37, + "id": "62808675", "metadata": {}, "outputs": [ { @@ -2619,6 +2697,7 @@ { "cell_type": "code", "execution_count": 38, + "id": "5b2a42ce", "metadata": {}, "outputs": [ { @@ -2658,6 +2737,7 @@ }, { "cell_type": "markdown", + "id": "2d7e62e4", "metadata": {}, "source": [ "## String Methods" @@ -2665,6 +2745,7 @@ }, { "cell_type": "markdown", + "id": "4e704eca", "metadata": {}, "source": [ "Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the cuDF API documentation for more information." @@ -2673,6 +2754,7 @@ { "cell_type": "code", "execution_count": 39, + "id": "c73e70bb", "metadata": {}, "outputs": [ { @@ -2703,6 +2785,7 @@ { "cell_type": "code", "execution_count": 40, + "id": "697c1c94", "metadata": {}, "outputs": [ { @@ -2732,6 +2815,7 @@ }, { "cell_type": "markdown", + "id": "dfc1371e", "metadata": {}, "source": [ "## Concat" @@ -2739,6 +2823,7 @@ }, { "cell_type": "markdown", + "id": "f6fb9b53", "metadata": {}, "source": [ "Concatenating `Series` and `DataFrames` row-wise." @@ -2747,6 +2832,7 @@ { "cell_type": "code", "execution_count": 41, + "id": "60538bbd", "metadata": {}, "outputs": [ { @@ -2778,6 +2864,7 @@ { "cell_type": "code", "execution_count": 42, + "id": "17953847", "metadata": {}, "outputs": [ { @@ -2808,6 +2895,7 @@ }, { "cell_type": "markdown", + "id": "27f0d621", "metadata": {}, "source": [ "## Join" @@ -2815,6 +2903,7 @@ }, { "cell_type": "markdown", + "id": "fd35f1a7", "metadata": {}, "source": [ "Performing SQL style merges. Note that the dataframe order is not maintained, but may be restored post-merge by sorting by the index." @@ -2823,6 +2912,7 @@ { "cell_type": "code", "execution_count": 43, + "id": "52ada00a", "metadata": {}, "outputs": [ { @@ -2916,6 +3006,7 @@ { "cell_type": "code", "execution_count": 44, + "id": "409fcf92", "metadata": {}, "outputs": [ { @@ -3003,6 +3094,7 @@ }, { "cell_type": "markdown", + "id": "d9dcb86b", "metadata": {}, "source": [ "## Append" @@ -3010,6 +3102,7 @@ }, { "cell_type": "markdown", + "id": "1f896819", "metadata": {}, "source": [ "Appending values from another `Series` or array-like object." @@ -3018,13 +3111,14 @@ { "cell_type": "code", "execution_count": 45, + "id": "9976c1ce", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "/home/ashwin/workspace/rapids/cudf/python/cudf/cudf/core/indexed_frame.py:2271: FutureWarning: append is deprecated and will be removed in a future version. Use concat instead.\n", + "/home/ashwin/workspace/rapids/cudf/python/cudf/cudf/core/indexed_frame.py:2329: FutureWarning: append is deprecated and will be removed in a future version. Use concat instead.\n", " warnings.warn(\n" ] }, @@ -3056,6 +3150,7 @@ { "cell_type": "code", "execution_count": 46, + "id": "fe5c54ab", "metadata": {}, "outputs": [ { @@ -3085,6 +3180,7 @@ }, { "cell_type": "markdown", + "id": "9fa10ef3", "metadata": {}, "source": [ "## Grouping" @@ -3092,6 +3188,7 @@ }, { "cell_type": "markdown", + "id": "8a6e41f5", "metadata": {}, "source": [ "Like pandas, cuDF and Dask-cuDF support the Split-Apply-Combine groupby paradigm." @@ -3100,6 +3197,7 @@ { "cell_type": "code", "execution_count": 47, + "id": "2a8cafa7", "metadata": {}, "outputs": [], "source": [ @@ -3111,6 +3209,7 @@ }, { "cell_type": "markdown", + "id": "0179d60c", "metadata": {}, "source": [ "Grouping and then applying the `sum` function to the grouped data." @@ -3119,6 +3218,7 @@ { "cell_type": "code", "execution_count": 48, + "id": "7c56d186", "metadata": {}, "outputs": [ { @@ -3193,6 +3293,7 @@ { "cell_type": "code", "execution_count": 49, + "id": "f8823b30", "metadata": {}, "outputs": [ { @@ -3266,6 +3367,7 @@ }, { "cell_type": "markdown", + "id": "a84cb883", "metadata": {}, "source": [ "Grouping hierarchically then applying the `sum` function to grouped data." @@ -3274,6 +3376,7 @@ { "cell_type": "code", "execution_count": 50, + "id": "2184e3ad", "metadata": {}, "outputs": [ { @@ -3364,6 +3467,7 @@ { "cell_type": "code", "execution_count": 51, + "id": "4ec311c1", "metadata": {}, "outputs": [ { @@ -3453,6 +3557,7 @@ }, { "cell_type": "markdown", + "id": "dedfeb1b", "metadata": {}, "source": [ "Grouping and applying statistical functions to specific columns, using `agg`." @@ -3461,6 +3566,7 @@ { "cell_type": "code", "execution_count": 52, + "id": "2563d8b2", "metadata": {}, "outputs": [ { @@ -3531,6 +3637,7 @@ { "cell_type": "code", "execution_count": 53, + "id": "22c77e75", "metadata": {}, "outputs": [ { @@ -3600,6 +3707,7 @@ }, { "cell_type": "markdown", + "id": "6d074822", "metadata": {}, "source": [ "## Transpose" @@ -3607,6 +3715,7 @@ }, { "cell_type": "markdown", + "id": "16c0f0a8", "metadata": {}, "source": [ "Transposing a dataframe, using either the `transpose` method or `T` property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask-cuDF." @@ -3615,6 +3724,7 @@ { "cell_type": "code", "execution_count": 54, + "id": "e265861e", "metadata": {}, "outputs": [ { @@ -3682,6 +3792,7 @@ { "cell_type": "code", "execution_count": 55, + "id": "1fe9b972", "metadata": {}, "outputs": [ { @@ -3744,14 +3855,16 @@ }, { "cell_type": "markdown", + "id": "9ce02827", "metadata": {}, "source": [ "Time Series\n", - "------------\n" + "------------" ] }, { "cell_type": "markdown", + "id": "fec907ff", "metadata": {}, "source": [ "`DataFrames` supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps." @@ -3760,6 +3873,7 @@ { "cell_type": "code", "execution_count": 56, + "id": "7a425d3f", "metadata": {}, "outputs": [ { @@ -3839,6 +3953,7 @@ { "cell_type": "code", "execution_count": 57, + "id": "87f0e56e", "metadata": {}, "outputs": [ { @@ -3911,6 +4026,7 @@ }, { "cell_type": "markdown", + "id": "0d0e541c", "metadata": {}, "source": [ "Categoricals\n", @@ -3919,6 +4035,7 @@ }, { "cell_type": "markdown", + "id": "a36f9543", "metadata": {}, "source": [ "`DataFrames` support categorical columns." @@ -3927,6 +4044,7 @@ { "cell_type": "code", "execution_count": 58, + "id": "05bd8be8", "metadata": {}, "outputs": [ { @@ -4013,6 +4131,7 @@ { "cell_type": "code", "execution_count": 59, + "id": "676b4963", "metadata": {}, "outputs": [ { @@ -4097,6 +4216,7 @@ }, { "cell_type": "markdown", + "id": "e24f2e7b", "metadata": {}, "source": [ "Accessing the categories of a column. Note that this is currently not supported in Dask-cuDF." @@ -4105,6 +4225,7 @@ { "cell_type": "code", "execution_count": 60, + "id": "06310c36", "metadata": {}, "outputs": [ { @@ -4124,6 +4245,7 @@ }, { "cell_type": "markdown", + "id": "4eb6f858", "metadata": {}, "source": [ "Accessing the underlying code values of each categorical observation." @@ -4132,6 +4254,7 @@ { "cell_type": "code", "execution_count": 61, + "id": "0f6db260", "metadata": {}, "outputs": [ { @@ -4158,6 +4281,7 @@ { "cell_type": "code", "execution_count": 62, + "id": "b87c4375", "metadata": {}, "outputs": [ { @@ -4183,6 +4307,7 @@ }, { "cell_type": "markdown", + "id": "3f816916", "metadata": {}, "source": [ "Converting Data Representation\n", @@ -4191,6 +4316,7 @@ }, { "cell_type": "markdown", + "id": "64a17f6d", "metadata": {}, "source": [ "## Pandas" @@ -4198,6 +4324,7 @@ }, { "cell_type": "markdown", + "id": "3acdcacc", "metadata": {}, "source": [ "Converting a cuDF and Dask-cuDF `DataFrame` to a pandas `DataFrame`." @@ -4206,6 +4333,7 @@ { "cell_type": "code", "execution_count": 63, + "id": "d1fed919", "metadata": {}, "outputs": [ { @@ -4302,6 +4430,7 @@ { "cell_type": "code", "execution_count": 64, + "id": "567c7363", "metadata": {}, "outputs": [ { @@ -4397,6 +4526,7 @@ }, { "cell_type": "markdown", + "id": "c2121453", "metadata": {}, "source": [ "## Numpy" @@ -4404,6 +4534,7 @@ }, { "cell_type": "markdown", + "id": "a9faa2c5", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `DataFrame` to a numpy `ndarray`." @@ -4412,6 +4543,7 @@ { "cell_type": "code", "execution_count": 65, + "id": "5490d226", "metadata": {}, "outputs": [ { @@ -4451,6 +4583,7 @@ { "cell_type": "code", "execution_count": 66, + "id": "b77ac8ae", "metadata": {}, "outputs": [ { @@ -4489,6 +4622,7 @@ }, { "cell_type": "markdown", + "id": "1d24d30f", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `Series` to a numpy `ndarray`." @@ -4497,6 +4631,7 @@ { "cell_type": "code", "execution_count": 67, + "id": "f71a0ba3", "metadata": {}, "outputs": [ { @@ -4518,6 +4653,7 @@ { "cell_type": "code", "execution_count": 68, + "id": "a45a74b5", "metadata": {}, "outputs": [ { @@ -4538,6 +4674,7 @@ }, { "cell_type": "markdown", + "id": "0d78a4d2", "metadata": {}, "source": [ "## Arrow" @@ -4545,6 +4682,7 @@ }, { "cell_type": "markdown", + "id": "7e35b829", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `DataFrame` to a PyArrow `Table`." @@ -4553,6 +4691,7 @@ { "cell_type": "code", "execution_count": 69, + "id": "bb9e9a2a", "metadata": {}, "outputs": [ { @@ -4584,6 +4723,7 @@ { "cell_type": "code", "execution_count": 70, + "id": "4d020de7", "metadata": {}, "outputs": [ { @@ -4614,14 +4754,16 @@ }, { "cell_type": "markdown", + "id": "ace7b4f9", "metadata": {}, "source": [ "Getting Data In/Out\n", - "------------------------\n" + "------------------------" ] }, { "cell_type": "markdown", + "id": "161abb12", "metadata": {}, "source": [ "## CSV" @@ -4629,6 +4771,7 @@ }, { "cell_type": "markdown", + "id": "7e5dc381", "metadata": {}, "source": [ "Writing to a CSV file." @@ -4637,6 +4780,7 @@ { "cell_type": "code", "execution_count": 71, + "id": "3a59715f", "metadata": {}, "outputs": [], "source": [ @@ -4649,6 +4793,7 @@ { "cell_type": "code", "execution_count": 72, + "id": "4ebe98ed", "metadata": {}, "outputs": [], "source": [ @@ -4657,6 +4802,7 @@ }, { "cell_type": "markdown", + "id": "0479fc4f", "metadata": {}, "source": [ "Reading from a csv file." @@ -4665,6 +4811,7 @@ { "cell_type": "code", "execution_count": 73, + "id": "1a70e831", "metadata": {}, "outputs": [ { @@ -4897,6 +5044,7 @@ { "cell_type": "code", "execution_count": 74, + "id": "4c3d9ca3", "metadata": {}, "outputs": [ { @@ -5128,6 +5276,7 @@ }, { "cell_type": "markdown", + "id": "3d739c6e", "metadata": {}, "source": [ "Reading all CSV files in a directory into a single `dask_cudf.DataFrame`, using the star wildcard." @@ -5136,6 +5285,7 @@ { "cell_type": "code", "execution_count": 75, + "id": "cb7187d2", "metadata": {}, "outputs": [ { @@ -5547,6 +5697,7 @@ }, { "cell_type": "markdown", + "id": "c0939a1e", "metadata": {}, "source": [ "## Parquet" @@ -5554,6 +5705,7 @@ }, { "cell_type": "markdown", + "id": "14e6a634", "metadata": {}, "source": [ "Writing to parquet files, using the CPU via PyArrow." @@ -5562,6 +5714,7 @@ { "cell_type": "code", "execution_count": 76, + "id": "1812346f", "metadata": {}, "outputs": [], "source": [ @@ -5570,6 +5723,7 @@ }, { "cell_type": "markdown", + "id": "093cd0fe", "metadata": {}, "source": [ "Reading parquet files with a GPU-accelerated parquet reader." @@ -5578,6 +5732,7 @@ { "cell_type": "code", "execution_count": 77, + "id": "2354b20b", "metadata": {}, "outputs": [ { @@ -5809,6 +5964,7 @@ }, { "cell_type": "markdown", + "id": "132c3ff2", "metadata": {}, "source": [ "Writing to parquet files from a `dask_cudf.DataFrame` using PyArrow under the hood." @@ -5817,6 +5973,7 @@ { "cell_type": "code", "execution_count": 78, + "id": "c5d7686c", "metadata": {}, "outputs": [ { @@ -5836,6 +5993,7 @@ }, { "cell_type": "markdown", + "id": "0d73d1dd", "metadata": {}, "source": [ "## ORC" @@ -5843,6 +6001,7 @@ }, { "cell_type": "markdown", + "id": "61b5f466", "metadata": {}, "source": [ "Reading ORC files." @@ -5851,6 +6010,33 @@ { "cell_type": "code", "execution_count": 79, + "id": "93364ff3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'/home/ashwin/workspace/rapids/cudf/python/cudf/cudf/tests/data/orc/TestOrcFile.test1.orc'" + ] + }, + "execution_count": 79, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import os\n", + "from pathlib import Path\n", + "current_dir = os.path.dirname(os.path.realpath(\"__file__\"))\n", + "cudf_root = Path(current_dir).parents[3]\n", + "file_path = os.path.join(cudf_root, \"python\", \"cudf\", \"cudf\", \"tests\", \"data\", \"orc\", \"TestOrcFile.test1.orc\")\n", + "file_path" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "2b6785c7", "metadata": {}, "outputs": [ { @@ -5941,18 +6127,19 @@ "1 [{'key': 'chani', 'value': {'int1': 5, 'string... " ] }, - "execution_count": 79, + "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "df2 = cudf.read_orc('/rapids/cudf/python/cudf/cudf/tests/data/orc/TestOrcFile.test1.orc')\n", + "df2 = cudf.read_orc(file_path)\n", "df2" ] }, { "cell_type": "markdown", + "id": "238ce6a4", "metadata": {}, "source": [ "Dask Performance Tips\n", @@ -5967,6 +6154,7 @@ }, { "cell_type": "markdown", + "id": "3de9aeca", "metadata": {}, "source": [ "First, we set up a GPU cluster. With our `client` set up, Dask-cuDF computation will be distributed across the GPUs in the cluster." @@ -5974,15 +6162,16 @@ }, { "cell_type": "code", - "execution_count": 80, + "execution_count": 81, + "id": "e4852d48", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "2022-03-29 12:21:32,328 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n", - "2022-03-29 12:21:32,394 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n" + "2022-04-21 13:26:06,860 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n", + "2022-04-21 13:26:06,904 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n" ] }, { @@ -5992,7 +6181,7 @@ "
\n", "
\n", "

Client

\n", - "

Client-4be800f5-af7c-11ec-8df8-c8d9d2247354

\n", + "

Client-20d00fd5-c198-11ec-906c-c8d9d2247354

\n", " \n", "\n", " \n", @@ -6021,7 +6210,7 @@ " \n", "
\n", "

LocalCUDACluster

\n", - "

137d0882

\n", + "

47648c26

\n", "
\n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", "
\n", @@ -6058,11 +6247,11 @@ "
\n", "
\n", "

Scheduler

\n", - "

Scheduler-08f95e9e-2c10-4d66-a103-955ab4218e91

\n", + "

Scheduler-f28bff16-cb70-452c-b8af-b9299a8d7b20

\n", " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", "
\n", - " Comm: tcp://127.0.0.1:35157\n", + " Comm: tcp://127.0.0.1:33995\n", " \n", " Workers: 2\n", @@ -6104,7 +6293,7 @@ " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", @@ -6158,7 +6347,7 @@ "
\n", - " Comm: tcp://127.0.0.1:41411\n", + " Comm: tcp://127.0.0.1:40479\n", " \n", " Total threads: 1\n", @@ -6112,7 +6301,7 @@ "
\n", - " Dashboard: http://127.0.0.1:40997/status\n", + " Dashboard: http://127.0.0.1:38985/status\n", " \n", " Memory: 22.89 GiB\n", @@ -6120,13 +6309,13 @@ "
\n", - " Nanny: tcp://127.0.0.1:42959\n", + " Nanny: tcp://127.0.0.1:33447\n", "
\n", - " Local directory: /home/ashwin/workspace/rapids/cudf/docs/cudf/source/user_guide/dask-worker-space/worker-ruvvgno2\n", + " Local directory: /home/ashwin/workspace/rapids/cudf/docs/cudf/source/user_guide/dask-worker-space/worker-be7zg92w\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", @@ -6216,10 +6405,10 @@ "" ], "text/plain": [ - "" + "" ] }, - "execution_count": 80, + "execution_count": 81, "metadata": {}, "output_type": "execute_result" } @@ -6237,6 +6426,7 @@ }, { "cell_type": "markdown", + "id": "181e4d10", "metadata": {}, "source": [ "### Persisting Data\n", @@ -6245,7 +6435,8 @@ }, { "cell_type": "code", - "execution_count": 81, + "execution_count": 82, + "id": "d47a1142", "metadata": {}, "outputs": [ { @@ -6321,7 +6512,7 @@ "" ] }, - "execution_count": 81, + "execution_count": 82, "metadata": {}, "output_type": "execute_result" } @@ -6337,36 +6528,37 @@ }, { "cell_type": "code", - "execution_count": 82, + "execution_count": 83, + "id": "c3cb612a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Tue Mar 29 12:21:33 2022 \n", - "+-----------------------------------------------------------------------------+\n", - "| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |\n", - "|-------------------------------+----------------------+----------------------+\n", - "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", - "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", - "| | | MIG M. |\n", - "|===============================+======================+======================|\n", - "| 0 Quadro GV100 Off | 00000000:15:00.0 Off | Off |\n", - "| 36% 49C P2 50W / 250W | 1113MiB / 32508MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 1 Quadro GV100 Off | 00000000:2D:00.0 Off | Off |\n", - "| 40% 54C P2 50W / 250W | 306MiB / 32498MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - " \n", - "+-----------------------------------------------------------------------------+\n", - "| Processes: |\n", - "| GPU GI CI PID Type Process name GPU Memory |\n", - "| ID ID Usage |\n", - "|=============================================================================|\n", - "+-----------------------------------------------------------------------------+\n" + "Thu Apr 21 13:26:07 2022 \r\n", + "+-----------------------------------------------------------------------------+\r\n", + "| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |\r\n", + "|-------------------------------+----------------------+----------------------+\r\n", + "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\r\n", + "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\r\n", + "| | | MIG M. |\r\n", + "|===============================+======================+======================|\r\n", + "| 0 Quadro GV100 Off | 00000000:15:00.0 Off | Off |\r\n", + "| 39% 52C P2 51W / 250W | 1115MiB / 32508MiB | 0% Default |\r\n", + "| | | N/A |\r\n", + "+-------------------------------+----------------------+----------------------+\r\n", + "| 1 Quadro GV100 Off | 00000000:2D:00.0 Off | Off |\r\n", + "| 43% 57C P2 52W / 250W | 306MiB / 32498MiB | 0% Default |\r\n", + "| | | N/A |\r\n", + "+-------------------------------+----------------------+----------------------+\r\n", + " \r\n", + "+-----------------------------------------------------------------------------+\r\n", + "| Processes: |\r\n", + "| GPU GI CI PID Type Process name GPU Memory |\r\n", + "| ID ID Usage |\r\n", + "|=============================================================================|\r\n", + "+-----------------------------------------------------------------------------+\r\n" ] } ], @@ -6376,6 +6568,7 @@ }, { "cell_type": "markdown", + "id": "b98810c4", "metadata": {}, "source": [ "Because Dask is lazy, the computation has not yet occurred. We can see that there are twenty tasks in the task graph and we've used about 800 MB of memory. We can force computation by using `persist`. By forcing execution, the result is now explicitly in memory and our task graph only contains one task per partition (the baseline)." @@ -6383,7 +6576,8 @@ }, { "cell_type": "code", - "execution_count": 83, + "execution_count": 84, + "id": "a929577c", "metadata": {}, "outputs": [ { @@ -6459,7 +6653,7 @@ "" ] }, - "execution_count": 83, + "execution_count": 84, "metadata": {}, "output_type": "execute_result" } @@ -6471,36 +6665,37 @@ }, { "cell_type": "code", - "execution_count": 84, + "execution_count": 85, + "id": "8aa7c079", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Tue Mar 29 12:21:34 2022 \n", - "+-----------------------------------------------------------------------------+\n", - "| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |\n", - "|-------------------------------+----------------------+----------------------+\n", - "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", - "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", - "| | | MIG M. |\n", - "|===============================+======================+======================|\n", - "| 0 Quadro GV100 Off | 00000000:15:00.0 Off | Off |\n", - "| 36% 49C P2 50W / 250W | 1113MiB / 32508MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - "| 1 Quadro GV100 Off | 00000000:2D:00.0 Off | Off |\n", - "| 40% 54C P2 50W / 250W | 306MiB / 32498MiB | 0% Default |\n", - "| | | N/A |\n", - "+-------------------------------+----------------------+----------------------+\n", - " \n", - "+-----------------------------------------------------------------------------+\n", - "| Processes: |\n", - "| GPU GI CI PID Type Process name GPU Memory |\n", - "| ID ID Usage |\n", - "|=============================================================================|\n", - "+-----------------------------------------------------------------------------+\n" + "Thu Apr 21 13:26:08 2022 \r\n", + "+-----------------------------------------------------------------------------+\r\n", + "| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |\r\n", + "|-------------------------------+----------------------+----------------------+\r\n", + "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\r\n", + "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\r\n", + "| | | MIG M. |\r\n", + "|===============================+======================+======================|\r\n", + "| 0 Quadro GV100 Off | 00000000:15:00.0 Off | Off |\r\n", + "| 39% 52C P2 52W / 250W | 1115MiB / 32508MiB | 3% Default |\r\n", + "| | | N/A |\r\n", + "+-------------------------------+----------------------+----------------------+\r\n", + "| 1 Quadro GV100 Off | 00000000:2D:00.0 Off | Off |\r\n", + "| 43% 57C P2 51W / 250W | 306MiB / 32498MiB | 0% Default |\r\n", + "| | | N/A |\r\n", + "+-------------------------------+----------------------+----------------------+\r\n", + " \r\n", + "+-----------------------------------------------------------------------------+\r\n", + "| Processes: |\r\n", + "| GPU GI CI PID Type Process name GPU Memory |\r\n", + "| ID ID Usage |\r\n", + "|=============================================================================|\r\n", + "+-----------------------------------------------------------------------------+\r\n" ] } ], @@ -6510,6 +6705,7 @@ }, { "cell_type": "markdown", + "id": "ff9e14b6", "metadata": {}, "source": [ "Because we forced computation, we now have a larger object in distributed GPU memory." @@ -6517,6 +6713,7 @@ }, { "cell_type": "markdown", + "id": "bb3b3dee", "metadata": {}, "source": [ "### Wait\n", @@ -6527,7 +6724,8 @@ }, { "cell_type": "code", - "execution_count": 85, + "execution_count": 86, + "id": "ef71bf00", "metadata": {}, "outputs": [], "source": [ @@ -6545,6 +6743,7 @@ }, { "cell_type": "markdown", + "id": "e1099ec0", "metadata": {}, "source": [ "This function will do a basic transformation of every column in the dataframe, but the time spent in the function will vary due to the `time.sleep` statement randomly adding 1-60 seconds of time. We'll run this on every partition of our dataframe using `map_partitions`, which adds the task to our task-graph, and store the result. We can then call `persist` to force execution." @@ -6552,7 +6751,8 @@ }, { "cell_type": "code", - "execution_count": 86, + "execution_count": 87, + "id": "700dd799", "metadata": {}, "outputs": [], "source": [ @@ -6562,6 +6762,7 @@ }, { "cell_type": "markdown", + "id": "5eb83a7e", "metadata": {}, "source": [ "However, some partitions will be done **much** sooner than others. If we had downstream processes that should wait for all partitions to be completed, we can enforce that behavior using `wait`." @@ -6569,16 +6770,17 @@ }, { "cell_type": "code", - "execution_count": 87, + "execution_count": 88, + "id": "73bccf94", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "DoneAndNotDoneFutures(done={, , , , }, not_done=set())" + "DoneAndNotDoneFutures(done={, , , , }, not_done=set())" ] }, - "execution_count": 87, + "execution_count": 88, "metadata": {}, "output_type": "execute_result" } @@ -6589,21 +6791,22 @@ }, { "cell_type": "markdown", + "id": "447301f5", "metadata": {}, "source": [ - "## With `wait`, we can safely proceed on in our workflow." + "With `wait`, we can safely proceed on in our workflow." ] }, { "cell_type": "code", "execution_count": null, + "id": "7e06fcf4", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { - "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", @@ -6620,21 +6823,8 @@ "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" - }, - "toc": { - "base_numbering": 1, - "nav_menu": {}, - "number_sections": true, - "sideBar": true, - "skip_h1_title": false, - "title_cell": "Table of Contents", - "title_sidebar": "Contents", - "toc_cell": false, - "toc_position": {}, - "toc_section_display": true, - "toc_window_display": false } }, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 5 } diff --git a/docs/cudf/source/basics/PandasCompat.rst b/docs/cudf/source/user_guide/PandasCompat.rst similarity index 100% rename from docs/cudf/source/basics/PandasCompat.rst rename to docs/cudf/source/user_guide/PandasCompat.rst diff --git a/docs/cudf/source/user_guide/Working-with-missing-data.ipynb b/docs/cudf/source/user_guide/Working-with-missing-data.ipynb index 54fe774060e..e57aec25fed 100644 --- a/docs/cudf/source/user_guide/Working-with-missing-data.ipynb +++ b/docs/cudf/source/user_guide/Working-with-missing-data.ipynb @@ -2,6 +2,7 @@ "cells": [ { "cell_type": "markdown", + "id": "f8ffbea7", "metadata": {}, "source": [ "# Working with missing data" @@ -9,6 +10,7 @@ }, { "cell_type": "markdown", + "id": "7e3ab093", "metadata": {}, "source": [ "In this section, we will discuss missing (also referred to as `NA`) values in cudf. cudf supports having missing values in all dtypes. These missing values are represented by ``. These values are also referenced as \"null values\"." @@ -16,6 +18,7 @@ }, { "cell_type": "markdown", + "id": "d970a34a", "metadata": {}, "source": [ "1. [How to Detect missing values](#How-to-Detect-missing-values)\n", @@ -35,6 +38,7 @@ }, { "cell_type": "markdown", + "id": "8d657a82", "metadata": {}, "source": [ "## How to Detect missing values" @@ -42,6 +46,7 @@ }, { "cell_type": "markdown", + "id": "9ea9f672", "metadata": {}, "source": [ "To detect missing values, you can use `isna()` and `notna()` functions." @@ -50,6 +55,7 @@ { "cell_type": "code", "execution_count": 1, + "id": "58050adb", "metadata": {}, "outputs": [], "source": [ @@ -60,6 +66,7 @@ { "cell_type": "code", "execution_count": 2, + "id": "416d73da", "metadata": {}, "outputs": [], "source": [ @@ -69,6 +76,7 @@ { "cell_type": "code", "execution_count": 3, + "id": "5dfc6bc3", "metadata": {}, "outputs": [ { @@ -141,6 +149,7 @@ { "cell_type": "code", "execution_count": 4, + "id": "4d7f7a6d", "metadata": {}, "outputs": [ { @@ -213,6 +222,7 @@ { "cell_type": "code", "execution_count": 5, + "id": "40edca67", "metadata": {}, "outputs": [ { @@ -236,6 +246,7 @@ }, { "cell_type": "markdown", + "id": "acdf29d7", "metadata": {}, "source": [ "One has to be mindful that in Python (and NumPy), the nan's don’t compare equal, but None's do. Note that cudf/NumPy uses the fact that `np.nan != np.nan`, and treats `None` like `np.nan`." @@ -244,6 +255,7 @@ { "cell_type": "code", "execution_count": 6, + "id": "c269c1f5", "metadata": {}, "outputs": [ { @@ -264,6 +276,7 @@ { "cell_type": "code", "execution_count": 7, + "id": "99fb083a", "metadata": {}, "outputs": [ { @@ -283,22 +296,23 @@ }, { "cell_type": "markdown", + "id": "4fdb8bc7", "metadata": {}, "source": [ - "So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information.\n", - "\n" + "So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information." ] }, { "cell_type": "code", "execution_count": 8, + "id": "630ef6bb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", - "1 False\n", + "1 \n", "2 False\n", "3 False\n", "Name: b, dtype: bool" @@ -316,6 +330,7 @@ { "cell_type": "code", "execution_count": 9, + "id": "8162e383", "metadata": {}, "outputs": [], "source": [ @@ -325,6 +340,7 @@ { "cell_type": "code", "execution_count": 10, + "id": "199775b3", "metadata": {}, "outputs": [ { @@ -348,14 +364,15 @@ { "cell_type": "code", "execution_count": 11, + "id": "cd09d80c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "0 False\n", - "1 False\n", - "2 False\n", + "0 \n", + "1 \n", + "2 \n", "dtype: bool" ] }, @@ -371,6 +388,7 @@ { "cell_type": "code", "execution_count": 12, + "id": "6b23bb0c", "metadata": {}, "outputs": [], "source": [ @@ -380,6 +398,7 @@ { "cell_type": "code", "execution_count": 13, + "id": "cafb79ee", "metadata": {}, "outputs": [ { @@ -403,6 +422,7 @@ { "cell_type": "code", "execution_count": 14, + "id": "13363897", "metadata": {}, "outputs": [ { @@ -425,6 +445,7 @@ }, { "cell_type": "markdown", + "id": "208a3776", "metadata": {}, "source": [ "## Float dtypes and missing data" @@ -432,16 +453,18 @@ }, { "cell_type": "markdown", + "id": "2c174b88", "metadata": {}, "source": [ "Because ``NaN`` is a float, a column of integers with even one missing values is cast to floating-point dtype. However this doesn't happen by default.\n", "\n", - "By default if a ``NaN`` value is passed to `Series` constructor, it is treated as `` value. " + "By default if a ``NaN`` value is passed to `Series` constructor, it is treated as `` value." ] }, { "cell_type": "code", "execution_count": 15, + "id": "c59c3c54", "metadata": {}, "outputs": [ { @@ -464,6 +487,7 @@ }, { "cell_type": "markdown", + "id": "a9eb2d9c", "metadata": {}, "source": [ "Hence to consider a ``NaN`` as ``NaN`` you will have to pass `nan_as_null=False` parameter into `Series` constructor." @@ -472,6 +496,7 @@ { "cell_type": "code", "execution_count": 16, + "id": "ecc5ae92", "metadata": {}, "outputs": [ { @@ -494,6 +519,7 @@ }, { "cell_type": "markdown", + "id": "d1db7b08", "metadata": {}, "source": [ "## Datetimes" @@ -501,15 +527,16 @@ }, { "cell_type": "markdown", + "id": "548d3734", "metadata": {}, "source": [ - "For `datetime64` types, cudf doesn't support having `NaT` values. Instead these values which are specific to numpy and pandas are considered as null values(``) in cudf. The actual underlying value of `NaT` is `min(int64)` and cudf retains the underlying value when converting a cudf object to pandas object.\n", - "\n" + "For `datetime64` types, cudf doesn't support having `NaT` values. Instead these values which are specific to numpy and pandas are considered as null values(``) in cudf. The actual underlying value of `NaT` is `min(int64)` and cudf retains the underlying value when converting a cudf object to pandas object." ] }, { "cell_type": "code", "execution_count": 17, + "id": "de70f244", "metadata": {}, "outputs": [ { @@ -535,6 +562,7 @@ { "cell_type": "code", "execution_count": 18, + "id": "8411a914", "metadata": {}, "outputs": [ { @@ -557,6 +585,7 @@ }, { "cell_type": "markdown", + "id": "df664145", "metadata": {}, "source": [ "any operations on rows having `` values in `datetime` column will result in `` value at the same location in resulting column:" @@ -565,6 +594,7 @@ { "cell_type": "code", "execution_count": 19, + "id": "829c32d0", "metadata": {}, "outputs": [ { @@ -587,6 +617,7 @@ }, { "cell_type": "markdown", + "id": "aa8031ef", "metadata": {}, "source": [ "## Calculations with missing data" @@ -594,6 +625,7 @@ }, { "cell_type": "markdown", + "id": "c587fae2", "metadata": {}, "source": [ "Null values propagate naturally through arithmetic operations between pandas objects." @@ -602,6 +634,7 @@ { "cell_type": "code", "execution_count": 20, + "id": "f8f2aec7", "metadata": {}, "outputs": [], "source": [ @@ -611,6 +644,7 @@ { "cell_type": "code", "execution_count": 21, + "id": "0c8a3011", "metadata": {}, "outputs": [], "source": [ @@ -620,6 +654,7 @@ { "cell_type": "code", "execution_count": 22, + "id": "052f6c2b", "metadata": {}, "outputs": [ { @@ -698,6 +733,7 @@ { "cell_type": "code", "execution_count": 23, + "id": "0fb0a083", "metadata": {}, "outputs": [ { @@ -776,6 +812,7 @@ { "cell_type": "code", "execution_count": 24, + "id": "6f8152c0", "metadata": {}, "outputs": [ { @@ -853,6 +890,7 @@ }, { "cell_type": "markdown", + "id": "11170d49", "metadata": {}, "source": [ "While summing the data along a series, `NA` values will be treated as `0`." @@ -861,6 +899,7 @@ { "cell_type": "code", "execution_count": 25, + "id": "45081790", "metadata": {}, "outputs": [ { @@ -886,6 +925,7 @@ { "cell_type": "code", "execution_count": 26, + "id": "39922658", "metadata": {}, "outputs": [ { @@ -905,6 +945,7 @@ }, { "cell_type": "markdown", + "id": "6e99afe0", "metadata": {}, "source": [ "Since `NA` values are treated as `0`, the mean would result to 2 in this case `(1 + 0 + 2 + 3 + 0)/5 = 2`" @@ -913,6 +954,7 @@ { "cell_type": "code", "execution_count": 27, + "id": "b2f16ddb", "metadata": {}, "outputs": [ { @@ -932,6 +974,7 @@ }, { "cell_type": "markdown", + "id": "07f2ec5a", "metadata": {}, "source": [ "To preserve `NA` values in the above calculations, `sum` & `mean` support `skipna` parameter.\n", @@ -942,6 +985,7 @@ { "cell_type": "code", "execution_count": 28, + "id": "d4a463a0", "metadata": {}, "outputs": [ { @@ -962,6 +1006,7 @@ { "cell_type": "code", "execution_count": 29, + "id": "a944c42e", "metadata": {}, "outputs": [ { @@ -981,6 +1026,7 @@ }, { "cell_type": "markdown", + "id": "fb8c8f18", "metadata": {}, "source": [ "Cumulative methods like `cumsum` and `cumprod` ignore `NA` values by default." @@ -989,6 +1035,7 @@ { "cell_type": "code", "execution_count": 30, + "id": "4f2a7306", "metadata": {}, "outputs": [ { @@ -1013,6 +1060,7 @@ }, { "cell_type": "markdown", + "id": "c8f6054b", "metadata": {}, "source": [ "To preserve `NA` values in cumulative methods, provide `skipna=False`." @@ -1021,6 +1069,7 @@ { "cell_type": "code", "execution_count": 31, + "id": "d4c46776", "metadata": {}, "outputs": [ { @@ -1045,6 +1094,7 @@ }, { "cell_type": "markdown", + "id": "67077d65", "metadata": {}, "source": [ "## Sum/product of Null/nans" @@ -1052,6 +1102,7 @@ }, { "cell_type": "markdown", + "id": "ffbb9ca1", "metadata": {}, "source": [ "The sum of an empty or all-NA Series of a DataFrame is 0." @@ -1060,6 +1111,7 @@ { "cell_type": "code", "execution_count": 32, + "id": "f430c9ce", "metadata": {}, "outputs": [ { @@ -1080,6 +1132,7 @@ { "cell_type": "code", "execution_count": 33, + "id": "7fde514b", "metadata": {}, "outputs": [ { @@ -1100,6 +1153,7 @@ { "cell_type": "code", "execution_count": 34, + "id": "56cedd17", "metadata": {}, "outputs": [ { @@ -1119,6 +1173,7 @@ }, { "cell_type": "markdown", + "id": "cb188adb", "metadata": {}, "source": [ "The product of an empty or all-NA Series of a DataFrame is 1." @@ -1127,6 +1182,7 @@ { "cell_type": "code", "execution_count": 35, + "id": "d20bbbef", "metadata": {}, "outputs": [ { @@ -1147,6 +1203,7 @@ { "cell_type": "code", "execution_count": 36, + "id": "75abbcfa", "metadata": {}, "outputs": [ { @@ -1167,6 +1224,7 @@ { "cell_type": "code", "execution_count": 37, + "id": "becce0cc", "metadata": {}, "outputs": [ { @@ -1186,6 +1244,7 @@ }, { "cell_type": "markdown", + "id": "0e899e03", "metadata": {}, "source": [ "## NA values in GroupBy" @@ -1193,6 +1252,7 @@ }, { "cell_type": "markdown", + "id": "7fb20874", "metadata": {}, "source": [ "`NA` groups in GroupBy are automatically excluded. For example:" @@ -1201,6 +1261,7 @@ { "cell_type": "code", "execution_count": 38, + "id": "1379037c", "metadata": {}, "outputs": [ { @@ -1279,6 +1340,7 @@ { "cell_type": "code", "execution_count": 39, + "id": "d6b91e6f", "metadata": {}, "outputs": [ { @@ -1345,6 +1407,7 @@ }, { "cell_type": "markdown", + "id": "cb83fb11", "metadata": {}, "source": [ "It is also possible to include `NA` in groups by passing `dropna=False`" @@ -1353,9 +1416,8 @@ { "cell_type": "code", "execution_count": 40, - "metadata": { - "scrolled": true - }, + "id": "768c3e50", + "metadata": {}, "outputs": [ { "data": { @@ -1426,6 +1488,7 @@ }, { "cell_type": "markdown", + "id": "133816b4", "metadata": {}, "source": [ "## Inserting missing data" @@ -1433,6 +1496,7 @@ }, { "cell_type": "markdown", + "id": "306082ad", "metadata": {}, "source": [ "All dtypes support insertion of missing value by assignment. Any specific location in series can made null by assigning it to `None`." @@ -1441,6 +1505,7 @@ { "cell_type": "code", "execution_count": 41, + "id": "7ddde1fe", "metadata": {}, "outputs": [], "source": [ @@ -1450,6 +1515,7 @@ { "cell_type": "code", "execution_count": 42, + "id": "16e54597", "metadata": {}, "outputs": [ { @@ -1474,6 +1540,7 @@ { "cell_type": "code", "execution_count": 43, + "id": "f628f94d", "metadata": {}, "outputs": [], "source": [ @@ -1483,9 +1550,8 @@ { "cell_type": "code", "execution_count": 44, - "metadata": { - "scrolled": true - }, + "id": "b30590b7", + "metadata": {}, "outputs": [ { "data": { @@ -1508,6 +1574,7 @@ }, { "cell_type": "markdown", + "id": "a1b123d0", "metadata": {}, "source": [ "## Filling missing values: fillna" @@ -1515,6 +1582,7 @@ }, { "cell_type": "markdown", + "id": "114aa23a", "metadata": {}, "source": [ "`fillna()` can fill in `NA` & `NaN` values with non-NA data." @@ -1523,6 +1591,7 @@ { "cell_type": "code", "execution_count": 45, + "id": "59e22668", "metadata": {}, "outputs": [ { @@ -1601,6 +1670,7 @@ { "cell_type": "code", "execution_count": 46, + "id": "05c221ee", "metadata": {}, "outputs": [ { @@ -1625,6 +1695,7 @@ }, { "cell_type": "markdown", + "id": "401f91b2", "metadata": {}, "source": [ "## Filling with cudf Object" @@ -1632,6 +1703,7 @@ }, { "cell_type": "markdown", + "id": "e79346d6", "metadata": {}, "source": [ "You can also fillna using a dict or Series that is alignable. The labels of the dict or index of the Series must match the columns of the frame you wish to fill. The use case of this is to fill a DataFrame with the mean of that column." @@ -1640,6 +1712,7 @@ { "cell_type": "code", "execution_count": 47, + "id": "f52c5d8f", "metadata": {}, "outputs": [], "source": [ @@ -1650,6 +1723,7 @@ { "cell_type": "code", "execution_count": 48, + "id": "6affebe9", "metadata": {}, "outputs": [], "source": [ @@ -1659,6 +1733,7 @@ { "cell_type": "code", "execution_count": 49, + "id": "1ce1b96f", "metadata": {}, "outputs": [], "source": [ @@ -1668,6 +1743,7 @@ { "cell_type": "code", "execution_count": 50, + "id": "90829195", "metadata": {}, "outputs": [], "source": [ @@ -1677,6 +1753,7 @@ { "cell_type": "code", "execution_count": 51, + "id": "c0feac14", "metadata": {}, "outputs": [ { @@ -1708,63 +1785,63 @@ " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", + " \n", + " \n", " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", " \n", " \n", "
\n", - " Comm: tcp://127.0.0.1:41341\n", + " Comm: tcp://127.0.0.1:40519\n", " \n", " Total threads: 1\n", @@ -6166,7 +6355,7 @@ "
\n", - " Dashboard: http://127.0.0.1:39963/status\n", + " Dashboard: http://127.0.0.1:40951/status\n", " \n", " Memory: 22.89 GiB\n", @@ -6174,13 +6363,13 @@ "
\n", - " Nanny: tcp://127.0.0.1:33675\n", + " Nanny: tcp://127.0.0.1:39133\n", "
\n", - " Local directory: /home/ashwin/workspace/rapids/cudf/docs/cudf/source/user_guide/dask-worker-space/worker-phx0wjv_\n", + " Local directory: /home/ashwin/workspace/rapids/cudf/docs/cudf/source/user_guide/dask-worker-space/worker-3v0c20ux\n", "
00.7712450.0510241.199239-0.408268-0.676643-1.274743
1-1.1680410.702664-0.270806-0.029322-0.873593-1.214105
2-1.467009-0.143080-0.806151-0.8663711.081735-0.226840
3NaN-0.610798-0.2728950.8122781.074973
4NaNNaN1.396784-0.366725
5-0.439343-1.016239NaNNaN
61.093102-0.7647580.6751231.067536NaN
70.003098-0.7226480.2215682.025961NaN
8-0.095899-1.285156-0.300566-0.3172411.0112750.674891
90.1094652.497843-1.199856-0.877041-1.919394-1.029201
\n", @@ -1772,16 +1849,16 @@ ], "text/plain": [ " A B C\n", - "0 0.771245 0.051024 1.199239\n", - "1 -1.168041 0.702664 -0.270806\n", - "2 -1.467009 -0.143080 -0.806151\n", - "3 NaN -0.610798 -0.272895\n", - "4 NaN NaN 1.396784\n", - "5 -0.439343 NaN NaN\n", - "6 1.093102 -0.764758 NaN\n", - "7 0.003098 -0.722648 NaN\n", - "8 -0.095899 -1.285156 -0.300566\n", - "9 0.109465 2.497843 -1.199856" + "0 -0.408268 -0.676643 -1.274743\n", + "1 -0.029322 -0.873593 -1.214105\n", + "2 -0.866371 1.081735 -0.226840\n", + "3 NaN 0.812278 1.074973\n", + "4 NaN NaN -0.366725\n", + "5 -1.016239 NaN NaN\n", + "6 0.675123 1.067536 NaN\n", + "7 0.221568 2.025961 NaN\n", + "8 -0.317241 1.011275 0.674891\n", + "9 -0.877041 -1.919394 -1.029201" ] }, "execution_count": 51, @@ -1796,6 +1873,7 @@ { "cell_type": "code", "execution_count": 52, + "id": "a07c1260", "metadata": {}, "outputs": [ { @@ -1827,63 +1905,63 @@ "
00.7712450.0510241.199239-0.408268-0.676643-1.274743
1-1.1680410.702664-0.270806-0.029322-0.873593-1.214105
2-1.467009-0.143080-0.806151-0.8663711.081735-0.226840
3-0.149173-0.610798-0.272895-0.3272240.8122781.074973
4-0.149173-0.0343641.396784-0.3272240.316145-0.366725
5-0.439343-0.034364-0.036322-1.0162390.316145-0.337393
61.093102-0.764758-0.0363220.6751231.067536-0.337393
70.003098-0.722648-0.0363220.2215682.025961-0.337393
8-0.095899-1.285156-0.300566-0.3172411.0112750.674891
90.1094652.497843-1.199856-0.877041-1.919394-1.029201
\n", @@ -1891,16 +1969,16 @@ ], "text/plain": [ " A B C\n", - "0 0.771245 0.051024 1.199239\n", - "1 -1.168041 0.702664 -0.270806\n", - "2 -1.467009 -0.143080 -0.806151\n", - "3 -0.149173 -0.610798 -0.272895\n", - "4 -0.149173 -0.034364 1.396784\n", - "5 -0.439343 -0.034364 -0.036322\n", - "6 1.093102 -0.764758 -0.036322\n", - "7 0.003098 -0.722648 -0.036322\n", - "8 -0.095899 -1.285156 -0.300566\n", - "9 0.109465 2.497843 -1.199856" + "0 -0.408268 -0.676643 -1.274743\n", + "1 -0.029322 -0.873593 -1.214105\n", + "2 -0.866371 1.081735 -0.226840\n", + "3 -0.327224 0.812278 1.074973\n", + "4 -0.327224 0.316145 -0.366725\n", + "5 -1.016239 0.316145 -0.337393\n", + "6 0.675123 1.067536 -0.337393\n", + "7 0.221568 2.025961 -0.337393\n", + "8 -0.317241 1.011275 0.674891\n", + "9 -0.877041 -1.919394 -1.029201" ] }, "execution_count": 52, @@ -1915,6 +1993,7 @@ { "cell_type": "code", "execution_count": 53, + "id": "9e70d61a", "metadata": {}, "outputs": [ { @@ -1946,63 +2025,63 @@ "
00.7712450.0510241.199239-0.408268-0.676643-1.274743
1-1.1680410.702664-0.270806-0.029322-0.873593-1.214105
2-1.467009-0.143080-0.806151-0.8663711.081735-0.226840
3NaN-0.610798-0.2728950.8122781.074973
4NaN-0.0343641.3967840.316145-0.366725
5-0.439343-0.034364-0.036322-1.0162390.316145-0.337393
61.093102-0.764758-0.0363220.6751231.067536-0.337393
70.003098-0.722648-0.0363220.2215682.025961-0.337393
8-0.095899-1.285156-0.300566-0.3172411.0112750.674891
90.1094652.497843-1.199856-0.877041-1.919394-1.029201
\n", @@ -2010,16 +2089,16 @@ ], "text/plain": [ " A B C\n", - "0 0.771245 0.051024 1.199239\n", - "1 -1.168041 0.702664 -0.270806\n", - "2 -1.467009 -0.143080 -0.806151\n", - "3 NaN -0.610798 -0.272895\n", - "4 NaN -0.034364 1.396784\n", - "5 -0.439343 -0.034364 -0.036322\n", - "6 1.093102 -0.764758 -0.036322\n", - "7 0.003098 -0.722648 -0.036322\n", - "8 -0.095899 -1.285156 -0.300566\n", - "9 0.109465 2.497843 -1.199856" + "0 -0.408268 -0.676643 -1.274743\n", + "1 -0.029322 -0.873593 -1.214105\n", + "2 -0.866371 1.081735 -0.226840\n", + "3 NaN 0.812278 1.074973\n", + "4 NaN 0.316145 -0.366725\n", + "5 -1.016239 0.316145 -0.337393\n", + "6 0.675123 1.067536 -0.337393\n", + "7 0.221568 2.025961 -0.337393\n", + "8 -0.317241 1.011275 0.674891\n", + "9 -0.877041 -1.919394 -1.029201" ] }, "execution_count": 53, @@ -2033,6 +2112,7 @@ }, { "cell_type": "markdown", + "id": "0ace728d", "metadata": {}, "source": [ "## Dropping axis labels with missing data: dropna" @@ -2040,15 +2120,16 @@ }, { "cell_type": "markdown", + "id": "2ccd7115", "metadata": {}, "source": [ - "Missing data can be excluded using `dropna()`:\n", - "\n" + "Missing data can be excluded using `dropna()`:" ] }, { "cell_type": "code", "execution_count": 54, + "id": "98c57be7", "metadata": {}, "outputs": [ { @@ -2127,6 +2208,7 @@ { "cell_type": "code", "execution_count": 55, + "id": "bc3f273a", "metadata": {}, "outputs": [ { @@ -2187,6 +2269,7 @@ { "cell_type": "code", "execution_count": 56, + "id": "a48d4de0", "metadata": {}, "outputs": [ { @@ -2249,14 +2332,16 @@ }, { "cell_type": "markdown", + "id": "0b1954f9", "metadata": {}, "source": [ - "An equivalent `dropna()` is available for Series. " + "An equivalent `dropna()` is available for Series." ] }, { "cell_type": "code", "execution_count": 57, + "id": "2dd8f660", "metadata": {}, "outputs": [ { @@ -2279,6 +2364,7 @@ }, { "cell_type": "markdown", + "id": "121eb6d7", "metadata": {}, "source": [ "## Replacing generic values" @@ -2286,6 +2372,7 @@ }, { "cell_type": "markdown", + "id": "3cc4c5f1", "metadata": {}, "source": [ "Often times we want to replace arbitrary values with other values.\n", @@ -2296,6 +2383,7 @@ { "cell_type": "code", "execution_count": 58, + "id": "e6c14e8a", "metadata": {}, "outputs": [], "source": [ @@ -2305,6 +2393,7 @@ { "cell_type": "code", "execution_count": 59, + "id": "a852f0cb", "metadata": {}, "outputs": [ { @@ -2330,6 +2419,7 @@ { "cell_type": "code", "execution_count": 60, + "id": "f6ac12eb", "metadata": {}, "outputs": [ { @@ -2354,6 +2444,7 @@ }, { "cell_type": "markdown", + "id": "a6e1b6d7", "metadata": {}, "source": [ "We can also replace any value with a `` value." @@ -2362,6 +2453,7 @@ { "cell_type": "code", "execution_count": 61, + "id": "f0156bff", "metadata": {}, "outputs": [ { @@ -2386,6 +2478,7 @@ }, { "cell_type": "markdown", + "id": "6673eefb", "metadata": {}, "source": [ "You can replace a list of values by a list of other values:" @@ -2394,6 +2487,7 @@ { "cell_type": "code", "execution_count": 62, + "id": "f3110f5b", "metadata": {}, "outputs": [ { @@ -2418,6 +2512,7 @@ }, { "cell_type": "markdown", + "id": "61521e8b", "metadata": {}, "source": [ "You can also specify a mapping dict:" @@ -2426,6 +2521,7 @@ { "cell_type": "code", "execution_count": 63, + "id": "45862d05", "metadata": {}, "outputs": [ { @@ -2450,6 +2546,7 @@ }, { "cell_type": "markdown", + "id": "04a34549", "metadata": {}, "source": [ "For a DataFrame, you can specify individual values by column:" @@ -2458,6 +2555,7 @@ { "cell_type": "code", "execution_count": 64, + "id": "348caa64", "metadata": {}, "outputs": [], "source": [ @@ -2467,6 +2565,7 @@ { "cell_type": "code", "execution_count": 65, + "id": "cca41ec4", "metadata": {}, "outputs": [ { @@ -2545,6 +2644,7 @@ { "cell_type": "code", "execution_count": 66, + "id": "64334693", "metadata": {}, "outputs": [ { @@ -2622,6 +2722,7 @@ }, { "cell_type": "markdown", + "id": "2f0ceec7", "metadata": {}, "source": [ "## String/regular expression replacement" @@ -2629,6 +2730,7 @@ }, { "cell_type": "markdown", + "id": "c6f44740", "metadata": {}, "source": [ "cudf supports replacing string values using `replace` API:" @@ -2637,6 +2739,7 @@ { "cell_type": "code", "execution_count": 67, + "id": "031d3533", "metadata": {}, "outputs": [], "source": [ @@ -2646,6 +2749,7 @@ { "cell_type": "code", "execution_count": 68, + "id": "12b41efb", "metadata": {}, "outputs": [], "source": [ @@ -2655,6 +2759,7 @@ { "cell_type": "code", "execution_count": 69, + "id": "d450df49", "metadata": {}, "outputs": [ { @@ -2732,6 +2837,7 @@ { "cell_type": "code", "execution_count": 70, + "id": "f823bc46", "metadata": {}, "outputs": [ { @@ -2809,6 +2915,7 @@ { "cell_type": "code", "execution_count": 71, + "id": "bc52f6e9", "metadata": {}, "outputs": [ { @@ -2885,14 +2992,16 @@ }, { "cell_type": "markdown", + "id": "7c1087be", "metadata": {}, "source": [ - "Replace a few different values (list -> list):\n" + "Replace a few different values (list -> list):" ] }, { "cell_type": "code", "execution_count": 72, + "id": "7e23eba9", "metadata": {}, "outputs": [ { @@ -2969,6 +3078,7 @@ }, { "cell_type": "markdown", + "id": "42845a9c", "metadata": {}, "source": [ "Only search in column 'b' (dict -> dict):" @@ -2977,6 +3087,7 @@ { "cell_type": "code", "execution_count": 73, + "id": "d2e79805", "metadata": {}, "outputs": [ { @@ -3053,6 +3164,7 @@ }, { "cell_type": "markdown", + "id": "774b42a6", "metadata": {}, "source": [ "## Numeric replacement" @@ -3060,6 +3172,7 @@ }, { "cell_type": "markdown", + "id": "1c1926ac", "metadata": {}, "source": [ "`replace()` can also be used similar to `fillna()`." @@ -3068,6 +3181,7 @@ { "cell_type": "code", "execution_count": 74, + "id": "355a2f0d", "metadata": {}, "outputs": [], "source": [ @@ -3077,6 +3191,7 @@ { "cell_type": "code", "execution_count": 75, + "id": "d9eed372", "metadata": {}, "outputs": [], "source": [ @@ -3086,6 +3201,7 @@ { "cell_type": "code", "execution_count": 76, + "id": "ae944244", "metadata": {}, "outputs": [ { @@ -3116,70 +3232,70 @@ " \n", " \n", " 0\n", - " <NA>\n", - " <NA>\n", + " -0.089358787\n", + " -0.728419386\n", " \n", " \n", " 1\n", - " <NA>\n", - " <NA>\n", + " -2.141612003\n", + " -0.574415182\n", " \n", " \n", " 2\n", - " 0.123160746\n", - " 1.09464783\n", + " <NA>\n", + " <NA>\n", " \n", " \n", " 3\n", - " <NA>\n", - " <NA>\n", + " 0.774643462\n", + " 2.07287721\n", " \n", " \n", " 4\n", - " <NA>\n", - " <NA>\n", + " 0.93799853\n", + " -1.054129436\n", " \n", " \n", " 5\n", - " 0.68137677\n", - " -0.357346253\n", + " <NA>\n", + " <NA>\n", " \n", " \n", " 6\n", - " <NA>\n", - " <NA>\n", + " -0.435293012\n", + " 1.163009584\n", " \n", " \n", " 7\n", - " <NA>\n", - " <NA>\n", + " 1.346623287\n", + " 0.31961371\n", " \n", " \n", " 8\n", - " 1.173285961\n", - " -0.968616065\n", + " <NA>\n", + " <NA>\n", " \n", " \n", " 9\n", - " 0.147922362\n", - " -0.154880098\n", + " <NA>\n", + " <NA>\n", " \n", " \n", "\n", "
" ], "text/plain": [ - " 0 1\n", - "0 \n", - "1 \n", - "2 0.123160746 1.09464783\n", - "3 \n", - "4 \n", - "5 0.68137677 -0.357346253\n", - "6 \n", - "7 \n", - "8 1.173285961 -0.968616065\n", - "9 0.147922362 -0.154880098" + " 0 1\n", + "0 -0.089358787 -0.728419386\n", + "1 -2.141612003 -0.574415182\n", + "2 \n", + "3 0.774643462 2.07287721\n", + "4 0.93799853 -1.054129436\n", + "5 \n", + "6 -0.435293012 1.163009584\n", + "7 1.346623287 0.31961371\n", + "8 \n", + "9 " ] }, "execution_count": 76, @@ -3193,15 +3309,16 @@ }, { "cell_type": "markdown", + "id": "0f32607c", "metadata": {}, "source": [ - "Replacing more than one value is possible by passing a list.\n", - "\n" + "Replacing more than one value is possible by passing a list." ] }, { "cell_type": "code", "execution_count": 77, + "id": "59b81c60", "metadata": {}, "outputs": [], "source": [ @@ -3211,6 +3328,7 @@ { "cell_type": "code", "execution_count": 78, + "id": "01a71d4c", "metadata": {}, "outputs": [ { @@ -3241,70 +3359,70 @@ " \n", " \n", " 0\n", - " 5.000000\n", - " 5.000000\n", + " 10.000000\n", + " -0.728419\n", " \n", " \n", " 1\n", - " 5.000000\n", - " 5.000000\n", + " -2.141612\n", + " -0.574415\n", " \n", " \n", " 2\n", - " 0.123161\n", - " 1.094648\n", + " 5.000000\n", + " 5.000000\n", " \n", " \n", " 3\n", - " 5.000000\n", - " 5.000000\n", + " 0.774643\n", + " 2.072877\n", " \n", " \n", " 4\n", - " 5.000000\n", - " 5.000000\n", + " 0.937999\n", + " -1.054129\n", " \n", " \n", " 5\n", - " 0.681377\n", - " -0.357346\n", + " 5.000000\n", + " 5.000000\n", " \n", " \n", " 6\n", - " 5.000000\n", - " 5.000000\n", + " -0.435293\n", + " 1.163010\n", " \n", " \n", " 7\n", - " 5.000000\n", - " 5.000000\n", + " 1.346623\n", + " 0.319614\n", " \n", " \n", " 8\n", - " 1.173286\n", - " -0.968616\n", + " 5.000000\n", + " 5.000000\n", " \n", " \n", " 9\n", - " 0.147922\n", - " -0.154880\n", + " 5.000000\n", + " 5.000000\n", " \n", " \n", "\n", "" ], "text/plain": [ - " 0 1\n", - "0 5.000000 5.000000\n", - "1 5.000000 5.000000\n", - "2 0.123161 1.094648\n", - "3 5.000000 5.000000\n", - "4 5.000000 5.000000\n", - "5 0.681377 -0.357346\n", - "6 5.000000 5.000000\n", - "7 5.000000 5.000000\n", - "8 1.173286 -0.968616\n", - "9 0.147922 -0.154880" + " 0 1\n", + "0 10.000000 -0.728419\n", + "1 -2.141612 -0.574415\n", + "2 5.000000 5.000000\n", + "3 0.774643 2.072877\n", + "4 0.937999 -1.054129\n", + "5 5.000000 5.000000\n", + "6 -0.435293 1.163010\n", + "7 1.346623 0.319614\n", + "8 5.000000 5.000000\n", + "9 5.000000 5.000000" ] }, "execution_count": 78, @@ -3318,15 +3436,16 @@ }, { "cell_type": "markdown", + "id": "1080e97b", "metadata": {}, "source": [ - "You can also operate on the DataFrame in place:\n", - "\n" + "You can also operate on the DataFrame in place:" ] }, { "cell_type": "code", "execution_count": 79, + "id": "5f0859d7", "metadata": {}, "outputs": [], "source": [ @@ -3336,6 +3455,7 @@ { "cell_type": "code", "execution_count": 80, + "id": "5cf28369", "metadata": {}, "outputs": [ { @@ -3366,70 +3486,70 @@ " \n", " \n", " 0\n", - " <NA>\n", - " <NA>\n", + " -0.089358787\n", + " -0.728419386\n", " \n", " \n", " 1\n", - " <NA>\n", - " <NA>\n", + " -2.141612003\n", + " -0.574415182\n", " \n", " \n", " 2\n", - " 0.123160746\n", - " 1.09464783\n", + " <NA>\n", + " <NA>\n", " \n", " \n", " 3\n", - " <NA>\n", - " <NA>\n", + " 0.774643462\n", + " 2.07287721\n", " \n", " \n", " 4\n", - " <NA>\n", - " <NA>\n", + " 0.93799853\n", + " -1.054129436\n", " \n", " \n", " 5\n", - " 0.68137677\n", - " -0.357346253\n", + " <NA>\n", + " <NA>\n", " \n", " \n", " 6\n", - " <NA>\n", - " <NA>\n", + " -0.435293012\n", + " 1.163009584\n", " \n", " \n", " 7\n", - " <NA>\n", - " <NA>\n", + " 1.346623287\n", + " 0.31961371\n", " \n", " \n", " 8\n", - " 1.173285961\n", - " -0.968616065\n", + " <NA>\n", + " <NA>\n", " \n", " \n", " 9\n", - " 0.147922362\n", - " -0.154880098\n", + " <NA>\n", + " <NA>\n", " \n", " \n", "\n", "" ], "text/plain": [ - " 0 1\n", - "0 \n", - "1 \n", - "2 0.123160746 1.09464783\n", - "3 \n", - "4 \n", - "5 0.68137677 -0.357346253\n", - "6 \n", - "7 \n", - "8 1.173285961 -0.968616065\n", - "9 0.147922362 -0.154880098" + " 0 1\n", + "0 -0.089358787 -0.728419386\n", + "1 -2.141612003 -0.574415182\n", + "2 \n", + "3 0.774643462 2.07287721\n", + "4 0.93799853 -1.054129436\n", + "5 \n", + "6 -0.435293012 1.163009584\n", + "7 1.346623287 0.31961371\n", + "8 \n", + "9 " ] }, "execution_count": 80, @@ -3444,7 +3564,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -3458,9 +3578,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.9" + "version": "3.8.13" } }, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 5 } diff --git a/docs/cudf/source/user_guide/10min-cudf-cupy.ipynb b/docs/cudf/source/user_guide/cupy-interop.ipynb similarity index 87% rename from docs/cudf/source/user_guide/10min-cudf-cupy.ipynb rename to docs/cudf/source/user_guide/cupy-interop.ipynb index 1bcb9335256..3f444fe16a5 100644 --- a/docs/cudf/source/user_guide/10min-cudf-cupy.ipynb +++ b/docs/cudf/source/user_guide/cupy-interop.ipynb @@ -2,9 +2,10 @@ "cells": [ { "cell_type": "markdown", + "id": "8e5e6878", "metadata": {}, "source": [ - "# 10 Minutes to cuDF and CuPy\n", + "# Interoperability between cuDF and CuPy\n", "\n", "This notebook provides introductory examples of how you can use cuDF and CuPy together to take advantage of CuPy array functionality (such as advanced linear algebra operations)." ] @@ -12,6 +13,7 @@ { "cell_type": "code", "execution_count": 1, + "id": "8b2d45c3", "metadata": {}, "outputs": [], "source": [ @@ -29,6 +31,7 @@ }, { "cell_type": "markdown", + "id": "e7e64b1a", "metadata": {}, "source": [ "### Converting a cuDF DataFrame to a CuPy Array\n", @@ -45,15 +48,16 @@ { "cell_type": "code", "execution_count": 2, + "id": "45c482ab", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "183 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", - "553 µs ± 6.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n", - "546 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" + "118 µs ± 77.2 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n", + "360 µs ± 6.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n", + "355 µs ± 722 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" ] } ], @@ -72,6 +76,7 @@ { "cell_type": "code", "execution_count": 3, + "id": "a565effc", "metadata": {}, "outputs": [ { @@ -98,6 +103,7 @@ }, { "cell_type": "markdown", + "id": "0759ab29", "metadata": {}, "source": [ "### Converting a cuDF Series to a CuPy Array" @@ -105,27 +111,29 @@ }, { "cell_type": "markdown", + "id": "4f35ffbd", "metadata": {}, "source": [ "There are also multiple ways to convert a cuDF Series to a CuPy array:\n", "\n", "1. We can pass the Series to `cupy.asarray` as cuDF Series exposes [`__cuda_array_interface__`](https://docs-cupy.chainer.org/en/stable/reference/interoperability.html).\n", "2. We can leverage the dlpack interface `to_dlpack()`. \n", - "3. We can also use `Series.values` \n" + "3. We can also use `Series.values`" ] }, { "cell_type": "code", "execution_count": 4, + "id": "8f97f304", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "76.8 µs ± 636 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", - "198 µs ± 2.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", - "181 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" + "54.4 µs ± 66 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n", + "125 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n", + "119 µs ± 805 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n" ] } ], @@ -140,6 +148,7 @@ { "cell_type": "code", "execution_count": 5, + "id": "f96d5676", "metadata": {}, "outputs": [ { @@ -160,6 +169,7 @@ }, { "cell_type": "markdown", + "id": "c36e5b88", "metadata": {}, "source": [ "From here, we can proceed with normal CuPy workflows, such as reshaping the array, getting the diagonal, or calculating the norm." @@ -168,6 +178,7 @@ { "cell_type": "code", "execution_count": 6, + "id": "2a7ae43f", "metadata": {}, "outputs": [ { @@ -195,6 +206,7 @@ { "cell_type": "code", "execution_count": 7, + "id": "b442a30c", "metadata": {}, "outputs": [ { @@ -219,6 +231,7 @@ { "cell_type": "code", "execution_count": 8, + "id": "be7f4d32", "metadata": {}, "outputs": [ { @@ -238,6 +251,7 @@ }, { "cell_type": "markdown", + "id": "b353bded", "metadata": {}, "source": [ "### Converting a CuPy Array to a cuDF DataFrame\n", @@ -256,13 +270,14 @@ { "cell_type": "code", "execution_count": 9, + "id": "8887b253", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "23.9 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + "14.3 ms ± 33.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], @@ -273,6 +288,7 @@ { "cell_type": "code", "execution_count": 10, + "id": "08ec4ffa", "metadata": {}, "outputs": [ { @@ -475,6 +491,7 @@ }, { "cell_type": "markdown", + "id": "6804d291", "metadata": {}, "source": [ "We can check whether our array is Fortran contiguous by using cupy.isfortran or looking at the [flags](https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.ndarray.html#cupy.ndarray.flags) of the array." @@ -483,6 +500,7 @@ { "cell_type": "code", "execution_count": 11, + "id": "65b8bd0d", "metadata": {}, "outputs": [ { @@ -502,6 +520,7 @@ }, { "cell_type": "markdown", + "id": "151982ad", "metadata": {}, "source": [ "In this case, we'll need to convert it before going to a cuDF DataFrame. In the next two cells, we create the DataFrame by leveraging dlpack and the CUDA array interface, respectively." @@ -510,13 +529,14 @@ { "cell_type": "code", "execution_count": 12, + "id": "27b2f563", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "9.15 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + "6.57 ms ± 9.08 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], @@ -530,13 +550,14 @@ { "cell_type": "code", "execution_count": 13, + "id": "0a0cc290", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "5.74 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" + "4.48 ms ± 7.89 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], @@ -550,6 +571,7 @@ { "cell_type": "code", "execution_count": 14, + "id": "0d2c5beb", "metadata": {}, "outputs": [ { @@ -753,6 +775,7 @@ }, { "cell_type": "markdown", + "id": "395e2bba", "metadata": {}, "source": [ "### Converting a CuPy Array to a cuDF Series\n", @@ -763,6 +786,7 @@ { "cell_type": "code", "execution_count": 15, + "id": "d8518208", "metadata": {}, "outputs": [ { @@ -787,6 +811,7 @@ }, { "cell_type": "markdown", + "id": "7e159619", "metadata": {}, "source": [ "### Interweaving CuDF and CuPy for Smooth PyData Workflows\n", @@ -799,6 +824,7 @@ { "cell_type": "code", "execution_count": 16, + "id": "2bb8ed81", "metadata": {}, "outputs": [ { @@ -1000,6 +1026,7 @@ }, { "cell_type": "markdown", + "id": "2f3d4e78", "metadata": {}, "source": [ "We can just transform it into a CuPy array and use the `axis` argument of `sum`." @@ -1008,6 +1035,7 @@ { "cell_type": "code", "execution_count": 17, + "id": "2dde030d", "metadata": {}, "outputs": [ { @@ -1035,6 +1063,7 @@ }, { "cell_type": "markdown", + "id": "4450dcc3", "metadata": {}, "source": [ "With just that single line, we're able to seamlessly move between data structures in this ecosystem, giving us enormous flexibility without sacrificing speed." @@ -1042,6 +1071,7 @@ }, { "cell_type": "markdown", + "id": "61bfb868", "metadata": {}, "source": [ "### Converting a cuDF DataFrame to a CuPy Sparse Matrix\n", @@ -1054,6 +1084,7 @@ { "cell_type": "code", "execution_count": 18, + "id": "e531fd15", "metadata": {}, "outputs": [], "source": [ @@ -1072,6 +1103,7 @@ }, { "cell_type": "markdown", + "id": "3f5e6ade", "metadata": {}, "source": [ "We can define a sparsely populated DataFrame to illustrate this conversion to either sparse matrix format." @@ -1080,6 +1112,7 @@ { "cell_type": "code", "execution_count": 19, + "id": "58c7e074", "metadata": {}, "outputs": [], "source": [ @@ -1095,6 +1128,7 @@ { "cell_type": "code", "execution_count": 20, + "id": "9265228d", "metadata": {}, "outputs": [ { @@ -1143,115 +1177,115 @@ " \n", " \n", " 0\n", - " 0.000000\n", " 0.0\n", " 0.0\n", - " 0.000000\n", " 0.0\n", - " 9.37476\n", - " 0.000000\n", + " 0.0\n", " 0.0\n", " 0.0\n", " 0.000000\n", - " 6.237859\n", " 0.0\n", " 0.0\n", " 0.000000\n", " 0.0\n", " 0.0\n", + " 0.0\n", " 0.00000\n", + " 0.000000\n", " 0.0\n", " 0.0\n", - " 0.000000\n", + " 0.0\n", + " 0.0\n", + " 11.308953\n", " \n", " \n", " 1\n", - " 0.000000\n", " 0.0\n", " 0.0\n", - " 0.000000\n", " 0.0\n", - " 0.00000\n", - " 0.000000\n", " 0.0\n", " 0.0\n", - " 0.000000\n", + " 0.0\n", " 0.000000\n", " 0.0\n", " 0.0\n", - " 0.065878\n", + " -5.241297\n", + " 0.0\n", + " 0.0\n", + " 0.0\n", + " 17.58476\n", + " 0.000000\n", " 0.0\n", " 0.0\n", - " 12.35705\n", " 0.0\n", " 0.0\n", " 0.000000\n", " \n", " \n", " 2\n", - " 3.232751\n", " 0.0\n", " 0.0\n", - " 0.000000\n", " 0.0\n", - " 0.00000\n", - " 8.341915\n", " 0.0\n", " 0.0\n", - " 0.000000\n", + " 0.0\n", " 0.000000\n", " 0.0\n", " 0.0\n", " 0.000000\n", " 0.0\n", " 0.0\n", + " 0.0\n", " 0.00000\n", + " 0.000000\n", + " 0.0\n", " 0.0\n", " 0.0\n", - " 3.110362\n", + " 0.0\n", + " 0.000000\n", " \n", " \n", " 3\n", - " 0.000000\n", " 0.0\n", " 0.0\n", - " 0.000000\n", " 0.0\n", - " 0.00000\n", - " 0.000000\n", " 0.0\n", " 0.0\n", - " 0.000000\n", + " 0.0\n", " 0.000000\n", " 0.0\n", " 0.0\n", " 0.000000\n", " 0.0\n", " 0.0\n", + " 0.0\n", " 0.00000\n", + " 10.869279\n", + " 0.0\n", + " 0.0\n", " 0.0\n", " 0.0\n", " 0.000000\n", " \n", " \n", " 4\n", - " 0.000000\n", " 0.0\n", " 0.0\n", - " 7.743024\n", " 0.0\n", - " 0.00000\n", - " 0.000000\n", " 0.0\n", " 0.0\n", - " 5.987098\n", - " 0.000000\n", + " 0.0\n", + " 2.526274\n", " 0.0\n", " 0.0\n", " 0.000000\n", " 0.0\n", " 0.0\n", + " 0.0\n", " 0.00000\n", + " 0.000000\n", + " 0.0\n", + " 0.0\n", " 0.0\n", " 0.0\n", " 0.000000\n", @@ -1261,19 +1295,19 @@ "" ], "text/plain": [ - " a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 \\\n", - "0 0.000000 0.0 0.0 0.000000 0.0 9.37476 0.000000 0.0 0.0 0.000000 \n", - "1 0.000000 0.0 0.0 0.000000 0.0 0.00000 0.000000 0.0 0.0 0.000000 \n", - "2 3.232751 0.0 0.0 0.000000 0.0 0.00000 8.341915 0.0 0.0 0.000000 \n", - "3 0.000000 0.0 0.0 0.000000 0.0 0.00000 0.000000 0.0 0.0 0.000000 \n", - "4 0.000000 0.0 0.0 7.743024 0.0 0.00000 0.000000 0.0 0.0 5.987098 \n", + " a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 \\\n", + "0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 \n", + "1 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 -5.241297 0.0 0.0 0.0 \n", + "2 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 \n", + "3 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 \n", + "4 0.0 0.0 0.0 0.0 0.0 0.0 2.526274 0.0 0.0 0.000000 0.0 0.0 0.0 \n", "\n", - " a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 \n", - "0 6.237859 0.0 0.0 0.000000 0.0 0.0 0.00000 0.0 0.0 0.000000 \n", - "1 0.000000 0.0 0.0 0.065878 0.0 0.0 12.35705 0.0 0.0 0.000000 \n", - "2 0.000000 0.0 0.0 0.000000 0.0 0.0 0.00000 0.0 0.0 3.110362 \n", - "3 0.000000 0.0 0.0 0.000000 0.0 0.0 0.00000 0.0 0.0 0.000000 \n", - "4 0.000000 0.0 0.0 0.000000 0.0 0.0 0.00000 0.0 0.0 0.000000 " + " a13 a14 a15 a16 a17 a18 a19 \n", + "0 0.00000 0.000000 0.0 0.0 0.0 0.0 11.308953 \n", + "1 17.58476 0.000000 0.0 0.0 0.0 0.0 0.000000 \n", + "2 0.00000 0.000000 0.0 0.0 0.0 0.0 0.000000 \n", + "3 0.00000 10.869279 0.0 0.0 0.0 0.0 0.000000 \n", + "4 0.00000 0.000000 0.0 0.0 0.0 0.0 0.000000 " ] }, "execution_count": 20, @@ -1288,63 +1322,64 @@ { "cell_type": "code", "execution_count": 21, + "id": "5ba1a551", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - " (2, 0)\t3.2327506467190874\n", - " (259, 0)\t10.723428115951062\n", - " (643, 0)\t0.47763624588488707\n", - " (899, 0)\t8.857065309921685\n", - " (516, 0)\t8.792407143276648\n", - " (262, 0)\t2.1900894573805396\n", - " (390, 0)\t5.007630701229646\n", - " (646, 0)\t6.630703075588639\n", - " (392, 0)\t5.573713453854357\n", - " (776, 0)\t10.501281989515688\n", - " (904, 0)\t8.261890175181366\n", - " (1033, 0)\t-0.41106824704220446\n", - " (522, 0)\t12.619952511457068\n", - " (139, 0)\t12.753348070606792\n", - " (141, 0)\t4.936902335394504\n", - " (270, 0)\t-1.7695949916946174\n", - " (782, 0)\t4.378746787324408\n", - " (15, 0)\t8.554141682891935\n", - " (527, 0)\t5.1994882136423\n", - " (912, 0)\t2.6101212854793125\n", - " (401, 0)\t5.614628764689268\n", - " (403, 0)\t9.999468341523317\n", - " (787, 0)\t7.6170790481600985\n", - " (404, 0)\t5.105328903336744\n", - " (916, 0)\t1.395526391114967\n", + " (770, 0)\t-1.373354548007899\n", + " (771, 0)\t11.641890592020793\n", + " (644, 0)\t-1.4820515981598015\n", + " (773, 0)\t4.374245789758399\n", + " (646, 0)\t4.58071340724814\n", + " (776, 0)\t5.115792716318899\n", + " (649, 0)\t8.676941295251092\n", + " (522, 0)\t-0.11573951593420229\n", + " (396, 0)\t8.124303607236273\n", + " (652, 0)\t9.359339954077681\n", + " (141, 0)\t8.50710863345112\n", + " (272, 0)\t7.440244879175392\n", + " (1042, 0)\t4.286859524587998\n", + " (275, 0)\t-0.6091666840632348\n", + " (787, 0)\t10.124449357828695\n", + " (915, 0)\t11.391560911074649\n", + " (1043, 0)\t11.478396096078907\n", + " (408, 0)\t11.204049991287349\n", + " (536, 0)\t13.239689100708974\n", + " (26, 0)\t4.951917355877771\n", + " (794, 0)\t2.736556006961319\n", + " (539, 0)\t12.553519350929216\n", + " (412, 0)\t2.8682583361020786\n", + " (540, 0)\t-1.2121388231076713\n", + " (796, 0)\t6.986443354019786\n", " :\t:\n", - " (9328, 19)\t5.938629381103238\n", - " (9457, 19)\t4.463547879031807\n", - " (9458, 19)\t-0.8034946631917106\n", - " (8051, 19)\t-1.904327616912268\n", - " (8819, 19)\t8.314944347687199\n", - " (7543, 19)\t1.4303204025224376\n", - " (8824, 19)\t5.1559713157589\n", - " (7673, 19)\t7.478681299798863\n", - " (7802, 19)\t0.502526238006068\n", - " (8186, 19)\t-3.824944685072472\n", - " (8570, 19)\t8.442324394481236\n", - " (8571, 19)\t6.204199957873215\n", - " (7420, 19)\t0.297737356585836\n", - " (9212, 19)\t3.934797966994188\n", - " (7421, 19)\t14.26161925450462\n", - " (8574, 19)\t5.826108027573207\n", - " (9214, 19)\t7.209975861932724\n", - " (9825, 19)\t11.155342644729613\n", - " (9702, 19)\t3.55144040779287\n", - " (9578, 19)\t12.638681362546228\n", - " (9712, 19)\t2.3542852760656348\n", - " (9969, 19)\t-2.645175092587592\n", - " (9973, 19)\t-2.2666402312025213\n", - " (9851, 19)\t-4.293381721466055\n", - " (9596, 19)\t6.6580506888430415\n" + " (9087, 19)\t-2.9543770156500395\n", + " (9440, 19)\t3.903613949374532\n", + " (9186, 19)\t0.3141028170017329\n", + " (9571, 19)\t1.7347840594688502\n", + " (9188, 19)\t14.68745562157488\n", + " (9316, 19)\t13.808308442016436\n", + " (9957, 19)\t9.705810918221086\n", + " (9318, 19)\t9.984168186940485\n", + " (9446, 19)\t5.173000114288142\n", + " (9830, 19)\t3.2442816093793607\n", + " (9835, 19)\t5.713078257113576\n", + " (9580, 19)\t5.373437384911853\n", + " (9326, 19)\t10.736403419943093\n", + " (9711, 19)\t-4.003216472911014\n", + " (9200, 19)\t5.560182026578174\n", + " (9844, 19)\t6.17251145210342\n", + " (9333, 19)\t7.085353006324948\n", + " (9208, 19)\t6.789030498520347\n", + " (9464, 19)\t4.314887636528589\n", + " (9720, 19)\t12.446300974563027\n", + " (9594, 19)\t4.317523130615451\n", + " (9722, 19)\t-2.3257161477576336\n", + " (9723, 19)\t1.9288133227037407\n", + " (9469, 19)\t0.268312217498608\n", + " (9599, 19)\t4.100996763787237\n" ] } ], @@ -1355,6 +1390,7 @@ }, { "cell_type": "markdown", + "id": "e8e58cd5", "metadata": {}, "source": [ "From here, we could continue our workflow with a CuPy sparse matrix.\n", @@ -1379,9 +1415,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.7" + "version": "3.8.13" } }, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 5 } diff --git a/docs/cudf/source/basics/dask-cudf.rst b/docs/cudf/source/user_guide/dask-cudf.rst similarity index 100% rename from docs/cudf/source/basics/dask-cudf.rst rename to docs/cudf/source/user_guide/dask-cudf.rst diff --git a/docs/cudf/source/user_guide/data-types.rst b/docs/cudf/source/user_guide/data-types.rst new file mode 100644 index 00000000000..336e578955e --- /dev/null +++ b/docs/cudf/source/user_guide/data-types.rst @@ -0,0 +1,78 @@ +Supported Data Types +==================== + +cuDF lets you store and operate on many different types of data on the +GPU. Each type of data is associated with a data type (or "dtype"). +cuDF supports many data types supported by NumPy and Pandas, including +numeric, datetime, timedelta, categorical and string data types. In +addition cuDF supports special data types for decimals and "nested +types" (lists and structs). + +Unlike in Pandas, all data types in cuDF are nullable. +See :doc:`Working With Missing Data `. + + +.. rst-class:: special-table +.. table:: + + +-----------------+----------------------------+--------------------------------------------------------------+----------------------------------------------+ + | Kind of Data | Data Type | Scalar | String Aliases | + +=================+============================+==============================================================+==============================================+ + | Integer |np.dtype(...) | np.int8_, np.int16_, np.int32_, np.int64_, np.uint8_, | ``'int8'``, ``'int16'``, ``'int32'``, | + | | | np.uint16_, np.uint32_, np.uint64_ | ``'int64'``, ``'uint8'``, ``'uint16'``, | + | | | | ``'uint32'``, ``'uint64'`` | + +-----------------+----------------------------+--------------------------------------------------------------+----------------------------------------------+ + | Float |np.dtype(...) | np.float32_, np.float64_ | ``'float32'``, ``'float64'`` | + +-----------------+----------------------------+--------------------------------------------------------------+----------------------------------------------+ + | Strings |np.dtype('object') | `str `_ | ``'string'``, ``'object'`` | + +-----------------+----------------------------+--------------------------------------------------------------+----------------------------------------------+ + | Datetime |np.dtype('datetime64[...]') | np.datetime64_ | ``'datetime64[s]'``, ``'datetime64[ms]'``, | + | | | | ``'datetime64[us]'``, ``'datetime64[ns]'`` | + +-----------------+----------------------------+--------------------------------------------------------------+----------------------------------------------+ + | Timedelta |np.dtype('timedelta64[...]')| np.timedelta64_ | ``'timedelta64[s]'``, ``'timedelta64[ms]'``, | + | (duration type) | | | ``'timedelta64[us]'``, ``'timedelta64[ns]'`` | + +-----------------+----------------------------+--------------------------------------------------------------+----------------------------------------------+ + | Categorical |cudf.CategoricalDtype(...) |(none) | ``'category'`` | + +-----------------+----------------------------+--------------------------------------------------------------+----------------------------------------------+ + | Boolean |np.dtype('bool') | np.bool_ | ``'bool'`` | + +-----------------+----------------------------+--------------------------------------------------------------+----------------------------------------------+ + | Decimal |cudf.Decimal32Dtype(...), |(none) |(none) | + | |cudf.Decimal64Dtype(...), | | | + | |cudf.Decimal128Dtype(...) | | | + +-----------------+----------------------------+--------------------------------------------------------------+----------------------------------------------+ + | Lists |cudf.ListDtype(...) | list |(none) | + +-----------------+----------------------------+--------------------------------------------------------------+----------------------------------------------+ + | Structs |cudf.StructDtype(...) | dict |(none) | + +-----------------+----------------------------+--------------------------------------------------------------+----------------------------------------------+ + + +A note on strings +----------------- + +The data type associated with string data in cuDF is ``"object"``. + +.. code:: python + >>> import cudf + >>> s = cudf.Series(["abc", "def", "ghi"]) + >>> s.dtype + dtype("object") + +This is for compatibility with Pandas, but it can be misleading. In +both NumPy and Pandas, ``"object"`` is the data type associated data +composed of arbitrary Python objects (not just strings). However, +cuDF does not support storing arbitrary Python objects. + + +.. _np.int8: +.. _np.int16: +.. _np.int32: +.. _np.int64: +.. _np.uint8: +.. _np.uint16: +.. _np.uint32: +.. _np.uint64: +.. _np.float32: +.. _np.float64: +.. _np.bool: https://numpy.org/doc/stable/user/basics.types.html +.. _np.datetime64: https://numpy.org/doc/stable/reference/arrays.datetime.html#basic-datetimes +.. _np.timedelta64: https://numpy.org/doc/stable/reference/arrays.datetime.html#datetime-and-timedelta-arithmetic diff --git a/docs/cudf/source/basics/groupby.rst b/docs/cudf/source/user_guide/groupby.rst similarity index 100% rename from docs/cudf/source/basics/groupby.rst rename to docs/cudf/source/user_guide/groupby.rst diff --git a/docs/cudf/source/user_guide/guide-to-udfs.ipynb b/docs/cudf/source/user_guide/guide-to-udfs.ipynb index 8026c378156..ef7500a2be9 100644 --- a/docs/cudf/source/user_guide/guide-to-udfs.ipynb +++ b/docs/cudf/source/user_guide/guide-to-udfs.ipynb @@ -2,15 +2,16 @@ "cells": [ { "cell_type": "markdown", + "id": "77149e57", "metadata": {}, "source": [ - "Overview of User Defined Functions with cuDF\n", - "====================================" + "# Overview of User Defined Functions with cuDF" ] }, { "cell_type": "code", "execution_count": 1, + "id": "0c6b65ce", "metadata": {}, "outputs": [], "source": [ @@ -21,6 +22,7 @@ }, { "cell_type": "markdown", + "id": "8826af13", "metadata": {}, "source": [ "Like many tabular data processing APIs, cuDF provides a range of composable, DataFrame style operators. While out of the box functions are flexible and useful, it is sometimes necessary to write custom code, or user-defined functions (UDFs), that can be applied to rows, columns, and other groupings of the cells making up the DataFrame.\n", @@ -39,10 +41,10 @@ }, { "cell_type": "markdown", + "id": "32a8f4fb", "metadata": {}, "source": [ - "Series UDFs\n", - "--------------\n", + "## Series UDFs\n", "\n", "You can execute UDFs on Series in two ways:\n", "\n", @@ -54,14 +56,15 @@ }, { "cell_type": "markdown", + "id": "49399a84", "metadata": {}, "source": [ - "`cudf.Series.apply`\n", - "---------------------" + "### `cudf.Series.apply`" ] }, { "cell_type": "markdown", + "id": "0a209ea2", "metadata": {}, "source": [ "cuDF provides a similar API to `pandas.Series.apply` for applying scalar UDFs to series objects. Here is a very basic example." @@ -70,6 +73,7 @@ { "cell_type": "code", "execution_count": 2, + "id": "e28d5b82", "metadata": {}, "outputs": [], "source": [ @@ -79,6 +83,7 @@ }, { "cell_type": "markdown", + "id": "48a9fa5e", "metadata": {}, "source": [ "UDFs destined for `cudf.Series.apply` might look something like this:" @@ -87,6 +92,7 @@ { "cell_type": "code", "execution_count": 3, + "id": "96aeb19f", "metadata": {}, "outputs": [], "source": [ @@ -97,6 +103,7 @@ }, { "cell_type": "markdown", + "id": "e61d0169", "metadata": {}, "source": [ "`cudf.Series.apply` is called like `pd.Series.apply` and returns a new `Series` object:" @@ -105,6 +112,7 @@ { "cell_type": "code", "execution_count": 4, + "id": "8ca08834", "metadata": {}, "outputs": [ { @@ -127,14 +135,15 @@ }, { "cell_type": "markdown", + "id": "c98dab03", "metadata": {}, "source": [ - "Functions with Additional Scalar Arguments\n", - "---------------------------------------------------" + "### Functions with Additional Scalar Arguments" ] }, { "cell_type": "markdown", + "id": "2aa3df6f", "metadata": {}, "source": [ "In addition, `cudf.Series.apply` supports `args=` just like pandas, allowing you to write UDFs that accept an arbitrary number of scalar arguments. Here is an example of such a function and it's API call in both pandas and cuDF:" @@ -143,6 +152,7 @@ { "cell_type": "code", "execution_count": 5, + "id": "8d156d01", "metadata": {}, "outputs": [], "source": [ @@ -153,6 +163,7 @@ { "cell_type": "code", "execution_count": 6, + "id": "1dee82d7", "metadata": {}, "outputs": [ { @@ -176,6 +187,7 @@ }, { "cell_type": "markdown", + "id": "22739e28", "metadata": {}, "source": [ "As a final note, `**kwargs` is not yet supported." @@ -183,14 +195,15 @@ }, { "cell_type": "markdown", + "id": "afbf33dc", "metadata": {}, "source": [ - "Nullable Data\n", - "----------------" + "### Nullable Data" ] }, { "cell_type": "markdown", + "id": "5dc06e8c", "metadata": {}, "source": [ "The null value `NA` an propagates through unary and binary operations. Thus, `NA + 1`, `abs(NA)`, and `NA == NA` all return `NA`. To make this concrete, let's look at the same example from above, this time using nullable data:" @@ -199,6 +212,7 @@ { "cell_type": "code", "execution_count": 7, + "id": "bda261dd", "metadata": {}, "outputs": [ { @@ -224,6 +238,7 @@ { "cell_type": "code", "execution_count": 8, + "id": "0123ae07", "metadata": {}, "outputs": [], "source": [ @@ -235,6 +250,7 @@ { "cell_type": "code", "execution_count": 9, + "id": "e95868dd", "metadata": {}, "outputs": [ { @@ -258,6 +274,7 @@ }, { "cell_type": "markdown", + "id": "97372e15", "metadata": {}, "source": [ "Often however you want explicit null handling behavior inside the function. cuDF exposes this capability the same way as pandas, by interacting directly with the `NA` singleton object. Here's an example of a function with explicit null handling:" @@ -266,6 +283,7 @@ { "cell_type": "code", "execution_count": 10, + "id": "6c65241b", "metadata": {}, "outputs": [], "source": [ @@ -280,6 +298,7 @@ { "cell_type": "code", "execution_count": 11, + "id": "ab0f4dbf", "metadata": {}, "outputs": [ { @@ -303,6 +322,7 @@ }, { "cell_type": "markdown", + "id": "bdddc4e8", "metadata": {}, "source": [ "In addition, `cudf.NA` can be returned from a function directly or conditionally. This capability should allow you to implement custom null handling in a wide variety of cases." @@ -310,14 +330,15 @@ }, { "cell_type": "markdown", + "id": "54cafbc0", "metadata": {}, "source": [ - "Lower level control with custom `numba` kernels\n", - "---------------------------------------------------------" + "### Lower level control with custom `numba` kernels" ] }, { "cell_type": "markdown", + "id": "00914f2a", "metadata": {}, "source": [ "In addition to the Series.apply() method for performing custom operations, you can also pass Series objects directly into [CUDA kernels written with Numba](https://numba.pydata.org/numba-doc/latest/cuda/kernels.html).\n", @@ -329,6 +350,7 @@ { "cell_type": "code", "execution_count": 12, + "id": "732434f6", "metadata": {}, "outputs": [], "source": [ @@ -338,6 +360,7 @@ { "cell_type": "code", "execution_count": 13, + "id": "4f5997e5", "metadata": {}, "outputs": [], "source": [ @@ -352,6 +375,7 @@ }, { "cell_type": "markdown", + "id": "d9667a55", "metadata": {}, "source": [ "This kernel will take an input array, multiply it by a configurable value (supplied at runtime), and store the result in an output array. Notice that we wrapped our logic in an `if` statement. Because we can launch more threads than the size of our array, we need to make sure that we don't use threads with an index that would be out of bounds. Leaving this out can result in undefined behavior.\n", @@ -362,6 +386,7 @@ { "cell_type": "code", "execution_count": 14, + "id": "ea6008a6", "metadata": {}, "outputs": [], "source": [ @@ -372,6 +397,7 @@ }, { "cell_type": "markdown", + "id": "3fb69909", "metadata": {}, "source": [ "After calling our kernel, our DataFrame is now populated with the result." @@ -380,6 +406,7 @@ { "cell_type": "code", "execution_count": 15, + "id": "183a82ed", "metadata": {}, "outputs": [ { @@ -469,6 +496,7 @@ }, { "cell_type": "markdown", + "id": "ab9c305e", "metadata": {}, "source": [ "This API allows a you to theoretically write arbitrary kernel logic, potentially accessing and using elements of the series at arbitrary indices and use them on cuDF data structures. Advanced developers with some CUDA experience can often use this capability to implement iterative transformations, or spot treat problem areas of a data pipeline with a custom kernel that does the same job faster." @@ -476,28 +504,29 @@ }, { "cell_type": "markdown", + "id": "0acc6ef2", "metadata": {}, "source": [ - "DataFrame UDFs\n", - "--------------------\n", + "## DataFrame UDFs\n", "\n", "Like `cudf.Series`, there are multiple ways of using UDFs on dataframes, which essentially amount to UDFs that expect multiple columns as input:\n", "\n", "- `cudf.DataFrame.apply`, which functions like `pd.DataFrame.apply` and expects a row udf\n", "- `cudf.DataFrame.apply_rows`, which is a thin wrapper around numba and expects a numba kernel\n", - "- `cudf.DataFrame.apply_chunks`, which is similar to `cudf.DataFrame.apply_rows` but offers lower level control.\n" + "- `cudf.DataFrame.apply_chunks`, which is similar to `cudf.DataFrame.apply_rows` but offers lower level control." ] }, { "cell_type": "markdown", + "id": "2102c3ed", "metadata": {}, "source": [ - "`cudf.DataFrame.apply`\n", - "---------------------------" + "### `cudf.DataFrame.apply`" ] }, { "cell_type": "markdown", + "id": "238bec41", "metadata": {}, "source": [ "`cudf.DataFrame.apply` is the main entrypoint for UDFs that expect multiple columns as input and produce a single output column. Functions intended to be consumed by this API are written in terms of a \"row\" argument. The \"row\" is considered to be like a dictionary and contains all of the column values at a certain `iloc` in a `DataFrame`. The function can access these values by key within the function, the keys being the column names corresponding to the desired value. Below is an example function that would be used to add column `A` and column `B` together inside a UDF." @@ -506,6 +535,7 @@ { "cell_type": "code", "execution_count": 16, + "id": "73653918", "metadata": {}, "outputs": [], "source": [ @@ -515,6 +545,7 @@ }, { "cell_type": "markdown", + "id": "b5eb32dd", "metadata": {}, "source": [ "Let's create some very basic toy data containing at least one null." @@ -523,6 +554,7 @@ { "cell_type": "code", "execution_count": 17, + "id": "077feb75", "metadata": {}, "outputs": [ { @@ -592,14 +624,16 @@ }, { "cell_type": "markdown", + "id": "609a3da5", "metadata": {}, "source": [ - "Finally call the function as you would in pandas - by using a lambda function to map the UDF onto \"rows\" of the DataFrame: " + "Finally call the function as you would in pandas - by using a lambda function to map the UDF onto \"rows\" of the DataFrame:" ] }, { "cell_type": "code", "execution_count": 18, + "id": "091e39e1", "metadata": {}, "outputs": [ { @@ -622,6 +656,7 @@ }, { "cell_type": "markdown", + "id": "44e54c31", "metadata": {}, "source": [ "The same function should produce the same result as pandas:" @@ -630,6 +665,7 @@ { "cell_type": "code", "execution_count": 19, + "id": "bd345fab", "metadata": {}, "outputs": [ { @@ -652,6 +688,7 @@ }, { "cell_type": "markdown", + "id": "004fbbba", "metadata": {}, "source": [ "Notice that Pandas returns `object` dtype - see notes on this in the caveats section." @@ -659,6 +696,7 @@ }, { "cell_type": "markdown", + "id": "0b11c172", "metadata": {}, "source": [ "Like `cudf.Series.apply`, these functions support generalized null handling. Here's a function that conditionally returns a different value if a certain input is null:" @@ -667,6 +705,7 @@ { "cell_type": "code", "execution_count": 20, + "id": "b70f4b3b", "metadata": {}, "outputs": [ { @@ -737,6 +776,7 @@ { "cell_type": "code", "execution_count": 21, + "id": "0313c8df", "metadata": {}, "outputs": [ { @@ -759,6 +799,7 @@ }, { "cell_type": "markdown", + "id": "313c77f3", "metadata": {}, "source": [ "`cudf.NA` can also be directly returned from a function resulting in data that has the the correct nulls in the end, just as if it were run in Pandas. For the following data, the last row fulfills the condition that `1 + 3 > 3` and returns `NA` for that row:" @@ -767,6 +808,7 @@ { "cell_type": "code", "execution_count": 22, + "id": "96a7952a", "metadata": {}, "outputs": [ { @@ -845,6 +887,7 @@ { "cell_type": "code", "execution_count": 23, + "id": "e0815f60", "metadata": {}, "outputs": [ { @@ -867,6 +910,7 @@ }, { "cell_type": "markdown", + "id": "b9c674f4", "metadata": {}, "source": [ "Mixed types are allowed, but will return the common type, rather than object as in Pandas. Here's a null aware op between an int and a float column:" @@ -875,6 +919,7 @@ { "cell_type": "code", "execution_count": 24, + "id": "495efd14", "metadata": {}, "outputs": [ { @@ -948,6 +993,7 @@ { "cell_type": "code", "execution_count": 25, + "id": "678b0b5a", "metadata": {}, "outputs": [ { @@ -970,6 +1016,7 @@ }, { "cell_type": "markdown", + "id": "ce0897c0", "metadata": {}, "source": [ "Functions may also return scalar values, however the result will be promoted to a safe type regardless of the data. This means even if you have a function like:\n", @@ -991,6 +1038,7 @@ { "cell_type": "code", "execution_count": 26, + "id": "acf48d56", "metadata": {}, "outputs": [ { @@ -1063,6 +1111,7 @@ { "cell_type": "code", "execution_count": 27, + "id": "78a98172", "metadata": {}, "outputs": [ { @@ -1085,6 +1134,7 @@ }, { "cell_type": "markdown", + "id": "2ceaece4", "metadata": {}, "source": [ "Any number of columns and many arithmetic operators are supported, allowing for complex UDFs:" @@ -1093,6 +1143,7 @@ { "cell_type": "code", "execution_count": 28, + "id": "142c30a9", "metadata": {}, "outputs": [ { @@ -1181,6 +1232,7 @@ { "cell_type": "code", "execution_count": 29, + "id": "fee9198a", "metadata": {}, "outputs": [ { @@ -1203,17 +1255,17 @@ }, { "cell_type": "markdown", + "id": "9c587bd2", "metadata": {}, "source": [ - "Numba kernels for DataFrames\n", - "------------------------------------" + "### Numba kernels for DataFrames" ] }, { "cell_type": "markdown", + "id": "adc6a459", "metadata": {}, "source": [ - "\n", "We could apply a UDF on a DataFrame like we did above with `forall`. We'd need to write a kernel that expects multiple inputs, and pass multiple Series as arguments when we execute our kernel. Because this is fairly common and can be difficult to manage, cuDF provides two APIs to streamline this: `apply_rows` and `apply_chunks`. Below, we walk through an example of using `apply_rows`. `apply_chunks` works in a similar way, but also offers more control over low-level kernel behavior.\n", "\n", "Now that we have two numeric columns in our DataFrame, let's write a kernel that uses both of them." @@ -1222,6 +1274,7 @@ { "cell_type": "code", "execution_count": 30, + "id": "90cbcd85", "metadata": {}, "outputs": [], "source": [ @@ -1235,6 +1288,7 @@ }, { "cell_type": "markdown", + "id": "bce045f2", "metadata": {}, "source": [ "Notice that we need to `enumerate` through our `zipped` function arguments (which either match or are mapped to our input column names). We can pass this kernel to `apply_rows`. We'll need to specify a few arguments:\n", @@ -1251,6 +1305,7 @@ { "cell_type": "code", "execution_count": 31, + "id": "e782daff", "metadata": {}, "outputs": [ { @@ -1337,6 +1392,7 @@ }, { "cell_type": "markdown", + "id": "6b838b89", "metadata": {}, "source": [ "As expected, we see our conditional addition worked. At this point, we've successfully executed UDFs on the core data structures of cuDF." @@ -1344,9 +1400,10 @@ }, { "cell_type": "markdown", + "id": "fca97003", "metadata": {}, "source": [ - "## Null Handling in `apply_rows` and `apply_chunks`\n", + "### Null Handling in `apply_rows` and `apply_chunks`\n", "\n", "By default, DataFrame methods for applying UDFs like `apply_rows` will handle nulls pessimistically (all rows with a null value will be removed from the output if they are used in the kernel). Exploring how not handling not pessimistically can lead to undefined behavior is outside the scope of this guide. Suffice it to say, pessimistic null handling is the safe and consistent approach. You can see an example below." ] @@ -1354,6 +1411,7 @@ { "cell_type": "code", "execution_count": 32, + "id": "befd8333", "metadata": {}, "outputs": [ { @@ -1445,6 +1503,7 @@ }, { "cell_type": "markdown", + "id": "c710ce86", "metadata": {}, "source": [ "In the dataframe above, there are three null values. Each column has a null in a different row. When we use our UDF with `apply_rows`, our output should have two nulls due to pessimistic null handling (because we're not using column `c`, the null value there does not matter to us)." @@ -1453,6 +1512,7 @@ { "cell_type": "code", "execution_count": 33, + "id": "d1f3dcaf", "metadata": {}, "outputs": [ { @@ -1546,6 +1606,7 @@ }, { "cell_type": "markdown", + "id": "53b9a2f8", "metadata": {}, "source": [ "As expected, we end up with two nulls in our output. The null values from the columns we used propogated to our output, but the null from the column we ignored did not." @@ -1553,10 +1614,10 @@ }, { "cell_type": "markdown", + "id": "4bbefa67", "metadata": {}, "source": [ - "Rolling Window UDFs\n", - "-------------------------\n", + "## Rolling Window UDFs\n", "\n", "For time-series data, we may need to operate on a small \\\"window\\\" of our column at a time, processing each portion independently. We could slide (\\\"roll\\\") this window over the entire column to answer questions like \\\"What is the 3-day moving average of a stock price over the past year?\"\n", "\n", @@ -1566,6 +1627,7 @@ { "cell_type": "code", "execution_count": 34, + "id": "6bc6aea3", "metadata": {}, "outputs": [ { @@ -1593,6 +1655,7 @@ { "cell_type": "code", "execution_count": 35, + "id": "a4c31df1", "metadata": {}, "outputs": [ { @@ -1613,6 +1676,7 @@ }, { "cell_type": "markdown", + "id": "ff40d863", "metadata": {}, "source": [ "Next, we'll define a function to use on our rolling windows. We created this one to highlight how you can include things like loops, mathematical functions, and conditionals. Rolling window UDFs do not yet support null values." @@ -1621,6 +1685,7 @@ { "cell_type": "code", "execution_count": 36, + "id": "eb5a081b", "metadata": {}, "outputs": [], "source": [ @@ -1637,6 +1702,7 @@ }, { "cell_type": "markdown", + "id": "df8ba31d", "metadata": {}, "source": [ "We can execute the function by passing it to `apply`. With `window=3`, `min_periods=3`, and `center=False`, our first two values are `null`." @@ -1645,6 +1711,7 @@ { "cell_type": "code", "execution_count": 37, + "id": "ddec3263", "metadata": {}, "outputs": [ { @@ -1670,6 +1737,7 @@ }, { "cell_type": "markdown", + "id": "187478db", "metadata": {}, "source": [ "We can apply this function to every column in a DataFrame, too." @@ -1678,6 +1746,7 @@ { "cell_type": "code", "execution_count": 38, + "id": "8b61094a", "metadata": {}, "outputs": [ { @@ -1759,6 +1828,7 @@ { "cell_type": "code", "execution_count": 39, + "id": "bb8c3019", "metadata": {}, "outputs": [ { @@ -1867,10 +1937,10 @@ }, { "cell_type": "markdown", + "id": "d4785060", "metadata": {}, "source": [ - "GroupBy DataFrame UDFs\n", - "-------------------------------\n", + "## GroupBy DataFrame UDFs\n", "\n", "We can also apply UDFs to grouped DataFrames using `apply_grouped`. This example is also drawn and adapted from the RAPIDS [API documentation]().\n", "\n", @@ -1880,6 +1950,7 @@ { "cell_type": "code", "execution_count": 40, + "id": "3dc272ab", "metadata": {}, "outputs": [ { @@ -1971,6 +2042,7 @@ { "cell_type": "code", "execution_count": 41, + "id": "c0578e0a", "metadata": {}, "outputs": [], "source": [ @@ -1979,6 +2051,7 @@ }, { "cell_type": "markdown", + "id": "4808726f", "metadata": {}, "source": [ "Next we'll define a function to apply to each group independently. In this case, we'll take the rolling average of column `e`, and call that new column `rolling_avg_e`." @@ -1987,6 +2060,7 @@ { "cell_type": "code", "execution_count": 42, + "id": "19f0f7fe", "metadata": {}, "outputs": [], "source": [ @@ -2006,6 +2080,7 @@ }, { "cell_type": "markdown", + "id": "7566f359", "metadata": {}, "source": [ "We can execute this with a very similar API to `apply_rows`. This time, though, it's going to execute independently for each group." @@ -2014,6 +2089,7 @@ { "cell_type": "code", "execution_count": 43, + "id": "c43426c3", "metadata": {}, "outputs": [ { @@ -2157,6 +2233,7 @@ }, { "cell_type": "markdown", + "id": "c8511306", "metadata": {}, "source": [ "Notice how, with a window size of three in the kernel, the first two values in each group for our output column are null." @@ -2164,10 +2241,10 @@ }, { "cell_type": "markdown", + "id": "0060678c", "metadata": {}, "source": [ - "Numba Kernels on CuPy Arrays\n", - "-------------------------------------\n", + "## Numba Kernels on CuPy Arrays\n", "\n", "We can also execute Numba kernels on CuPy NDArrays, again thanks to the `__cuda_array_interface__`. We can even run the same UDF on the Series and the CuPy array. First, we define a Series and then create a CuPy array from that Series." ] @@ -2175,6 +2252,7 @@ { "cell_type": "code", "execution_count": 44, + "id": "aa6a8509", "metadata": {}, "outputs": [ { @@ -2198,6 +2276,7 @@ }, { "cell_type": "markdown", + "id": "0fed556f", "metadata": {}, "source": [ "Next, we define a UDF and execute it on our Series. We need to allocate a Series of the same size for our output, which we'll call `out`." @@ -2206,6 +2285,7 @@ { "cell_type": "code", "execution_count": 45, + "id": "0bb8bf93", "metadata": {}, "outputs": [ { @@ -2238,6 +2318,7 @@ }, { "cell_type": "markdown", + "id": "a857b169", "metadata": {}, "source": [ "Finally, we execute the same function on our array. We allocate an empty array `out` to store our results." @@ -2246,6 +2327,7 @@ { "cell_type": "code", "execution_count": 46, + "id": "ce60b639", "metadata": {}, "outputs": [ { @@ -2267,14 +2349,15 @@ }, { "cell_type": "markdown", + "id": "b899d51c", "metadata": {}, "source": [ - "Caveats\n", - "---------" + "## Caveats" ] }, { "cell_type": "markdown", + "id": "fe7eb68b", "metadata": {}, "source": [ "- Only numeric nondecimal scalar types are currently supported as of yet, but strings and structured types are in planning. Attempting to use this API with those types will throw a `TypeError`.\n", @@ -2283,10 +2366,10 @@ }, { "cell_type": "markdown", + "id": "c690563b", "metadata": {}, "source": [ - "Summary\n", - "-----------\n", + "## Summary\n", "\n", "This guide has covered a lot of content. At this point, you should hopefully feel comfortable writing UDFs (with or without null values) that operate on\n", "\n", @@ -2323,5 +2406,5 @@ } }, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 5 } diff --git a/docs/cudf/source/user_guide/index.rst b/docs/cudf/source/user_guide/index.rst index 1061008eb3c..4e9c97cfae3 100644 --- a/docs/cudf/source/user_guide/index.rst +++ b/docs/cudf/source/user_guide/index.rst @@ -6,7 +6,14 @@ User Guide .. toctree:: :maxdepth: 2 - 10min.ipynb - 10min-cudf-cupy.ipynb - guide-to-udfs.ipynb - Working-with-missing-data.ipynb + 10min.md + pandas-comparison.rst + data-types.rst + io.rst + Working-with-missing-data.md + groupby.rst + guide-to-udfs.md + cupy-interop.md + dask-cudf.rst + internals.rst + PandasCompat.rst diff --git a/docs/cudf/source/basics/internals.rst b/docs/cudf/source/user_guide/internals.rst similarity index 100% rename from docs/cudf/source/basics/internals.rst rename to docs/cudf/source/user_guide/internals.rst diff --git a/docs/cudf/source/basics/io-gds-integration.rst b/docs/cudf/source/user_guide/io-gds-integration.rst similarity index 100% rename from docs/cudf/source/basics/io-gds-integration.rst rename to docs/cudf/source/user_guide/io-gds-integration.rst diff --git a/docs/cudf/source/basics/io-nvcomp-integration.rst b/docs/cudf/source/user_guide/io-nvcomp-integration.rst similarity index 100% rename from docs/cudf/source/basics/io-nvcomp-integration.rst rename to docs/cudf/source/user_guide/io-nvcomp-integration.rst diff --git a/docs/cudf/source/basics/io-supported-types.rst b/docs/cudf/source/user_guide/io-supported-types.rst similarity index 100% rename from docs/cudf/source/basics/io-supported-types.rst rename to docs/cudf/source/user_guide/io-supported-types.rst diff --git a/docs/cudf/source/basics/io.rst b/docs/cudf/source/user_guide/io.rst similarity index 100% rename from docs/cudf/source/basics/io.rst rename to docs/cudf/source/user_guide/io.rst diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md new file mode 100644 index 00000000000..e0e4dc0157e --- /dev/null +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -0,0 +1,155 @@ +# Comparison of cuDF and Pandas + +cuDF is a DataFrame library that closely matches the Pandas API, but +leverages NVIDIA GPUs for performing computations for speed. However, +there are some differences between cuDF and Pandas, both in terms API +and behavior. This page documents the similarities and differences +between cuDF and Pandas. + +## Supported operations + +cuDF supports many of the same data structures and operations as +Pandas. This includes `Series`, `DataFrame`, `Index` and +operations on them such as unary and binary operations, indexing, +filtering, concatenating, joining, groupby and window operations - +among many others. + +The best way to see if we support a particular Pandas API is to search +our [API docs](/api_docs/index). + +## Data types + +cuDF supports many common data types supported by Pandas, including +numeric, datetime, timestamp, string, and categorical data types. In +addition, we support special data types for decimal, list and "struct" +values. See the section on [Data Types](data-types) for +details. + +Note that we do not support custom data types like Pandas' +`ExtensionDtype`. + +## Null (or "missing") values + +Unlike Pandas, *all* data types in cuDF are nullable, +meaning they can contain missing values (represented by `cudf.NA`). + +```{code} python +>>> s = cudf.Series([1, 2, cudf.NA]) +>>> s +>>> s +0 1 +1 2 +2 +dtype: int64 +``` + +Nulls are not coerced to `nan` in any situation; +compare the behaviour of cuDF with Pandas below: + +```{code} python +>>> s = cudf.Series([1, 2, cudf.NA], dtype="category") +>>> s +0 1 +1 2 +2 +dtype: category +Categories (2, int64): [1, 2] + +>>> s = pd.Series([1, 2, pd.NA], dtype="category") +>>> s +0 1 +1 2 +2 NaN +dtype: category +Categories (2, int64): [1, 2] +``` + +See the docs on [missing data](Working-with-missing-data) for +details. + +## Iteration + +Iterating over a cuDF `Series`, `DataFrame` or `Index` is not +supported. This is because iterating over data that resides on the GPU +will yield *extremely* poor performance, as GPUs are optimized for +highly parallel operations rather than sequential operations. + +In the vast majority of cases, it is possible to avoid iteration and +use an existing function or method to accomplish the same task. If you +absolutely must iterate, copy the data from GPU to CPU by using +`.to_arrow()` or `.to_pandas()`, then copy the result back to GPU +using `.from_arrow()` or `.from_pandas()`. + +## Result ordering + +By default, `join` (or `merge`) and `groupby` operations in cuDF +do *not* guarantee output ordering by default. +Compare the results obtained from Pandas and cuDF below: + +```{code} python + >>> import cupy as cp + >>> df = cudf.DataFrame({'a': cp.random.randint(0, 1000, 1000), 'b': range(1000)}) + >>> df.groupby("a").mean().head() + b + a + 742 694.5 + 29 840.0 + 459 525.5 + 442 363.0 + 666 7.0 + >>> df.to_pandas().groupby("a").mean().head() + b + a + 2 643.75 + 6 48.00 + 7 631.00 + 9 906.00 + 10 640.00 +``` + +To match Pandas behavior, you must explicitly pass `sort=True`: + +```{code} python +>>> df.to_pandas().groupby("a", sort=True).mean().head() + b +a +2 643.75 +6 48.00 +7 631.00 +9 906.00 +10 640.00 +``` + +## Column names + +Unlike Pandas, cuDF does not support duplicate column names. +It is best to use strings for column names. + +## No true `"object"` data type + +In Pandas and NumPy, the `"object"` data type is used for +collections of arbitrary Python objects. For example, in Pandas you +can do the following: + +```{code} python +>>> import pandas as pd +>>> s = pd.Series(["a", 1, [1, 2, 3]]) +0 a +1 1 +2 [1, 2, 3] +dtype: object +``` + +For compatibilty with Pandas, cuDF reports the data type for strings +as `"object"`, but we do *not* support storing or operating on +collections of arbitrary Python objects. + +## `.apply()` function limitations + +The `.apply()` function in Pandas accecpts a user-defined function +(UDF) that can include arbitrary operations that are applied to each +value of a `Series`, `DataFrame`, or in the case of a groupby, +each group. cuDF also supports `apply()`, but it relies on Numba to +JIT compile the UDF and execute it on the GPU. This can be extremely +fast, but imposes a few limitations on what operations are allowed in +the UDF. See the docs on [UDFs](guide-to-udfs) for details.