From dae46b1249034a079dad6e86437fc7f077a7c076 Mon Sep 17 00:00:00 2001 From: brandon-b-miller Date: Wed, 6 Apr 2022 09:30:59 -0700 Subject: [PATCH 1/3] update docs --- .../source/user_guide/guide-to-udfs.ipynb | 1911 +++++++++-------- 1 file changed, 968 insertions(+), 943 deletions(-) diff --git a/docs/cudf/source/user_guide/guide-to-udfs.ipynb b/docs/cudf/source/user_guide/guide-to-udfs.ipynb index 215d11cdbb8..bb2e7a2626a 100644 --- a/docs/cudf/source/user_guide/guide-to-udfs.ipynb +++ b/docs/cudf/source/user_guide/guide-to-udfs.ipynb @@ -7,13 +7,22 @@ "# Overview of User Defined Functions with cuDF" ] }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import cudf" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like many tabular data processing APIs, cuDF provides a range of composable, DataFrame style operators. While out of the box functions are flexible and useful, it is sometimes necessary to write custom code, or user-defined functions (UDFs), that can be applied to rows, columns, and other groupings of the cells making up the DataFrame.\n", "\n", - "In conjunction with the broader GPU PyData ecosystem, cuDF provides interfaces to run UDFs on a variety of data structures. Currently, we can only execute UDFs on numeric and Boolean typed data (support for strings is being planned). This guide covers writing and executing UDFs on the following data structures:\n", + "In conjunction with the broader GPU PyData ecosystem, cuDF provides interfaces to run UDFs on a variety of data structures. Currently, we can only execute UDFs on numeric, boolean, datetime, and timedelta typed data (support for strings is being planned). This guide covers writing and executing UDFs on the following data structures:\n", "\n", "- Series\n", "- DataFrame\n", @@ -46,19 +55,131 @@ "source": [ "## Series UDFs\n", "\n", - "You can execute UDFs on Series in two ways:\n", + "You can execute UDFs on Series in three ways:\n", "\n", - "- Writing a standard Python function and using `applymap`\n", + "- Writing a standard python function and using `cudf.Series.apply` (recommended)\n", + "- Writing a standard Python function and using `applymap` (deprecated)\n", "- Writing a Numba kernel and using Numba's `forall` syntax\n", "\n", - "Using `applymap` is simpler, but writing a Numba kernel offers the flexibility to build more complex functions (we'll be writing only simple kernels in this guide).\n", - "\n", - "Let's start by importing a few libraries and creating a DataFrame of several Series." + "Using `apply` or `applymap` is simpler, but writing a Numba kernel offers the flexibility to build more complex functions (we'll be writing only simple kernels in this guide)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### `cudf.Series.apply`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "cuDF provides a similar API to `pandas.Series.apply` for applying scalar UDFs to series objects. These UDFs have generalized null handling and are slightly more flexible than those that work with `applymap`. Here is a very simple example:" ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# Create a cuDF series\n", + "sr = cudf.Series([1, cudf.NA, 3])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "UDFs destined for `cudf.Series.apply` might look something like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# define a scalar function\n", + "def f(x):\n", + " if x is cudf.NA:\n", + " return 42\n", + " else:\n", + " return 2**x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`cudf.Series.apply` is called like `pd.Series.apply` and returns a new `Series` object:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 2\n", + "1 42\n", + "2 8\n", + "dtype: int64" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sr.apply(f)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 2\n", + "1 42\n", + "2 8\n", + "dtype: int64" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Check the pandas result\n", + "sr.to_pandas(nullable=True).apply(f)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### `cudf.Series.applymap` (deprecated)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`cudf.Series.applymap` originally played a similar role to `cudf.Series.apply` in legacy version of cuDF and is now deprecated. Its main difference is there is no explicit null handling. Functions are written the same way, but can't interact with the `cudf.NA` null value. In fact this API assumes that if an input value is null, the output value is also null, regardless of the logic inside the function. Let's look at a simple example." + ] + }, + { + "cell_type": "code", + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -131,15 +252,13 @@ "4 -0.970850 False Sarah" ] }, - "execution_count": 1, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", - "\n", - "import cudf\n", "from cudf.datasets import randomdata \n", "\n", "df = randomdata(nrows=10, dtypes={'a':float, 'b':bool, 'c':str}, seed=12)\n", @@ -155,7 +274,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -168,9 +287,17 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 8, "metadata": {}, "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "cudf/core/series.py:2219: FutureWarning: Series.applymap is deprecated and will be removed in a future cuDF release. Use Series.apply instead.\n", + " warnings.warn(\n" + ] + }, { "data": { "text/plain": [ @@ -187,7 +314,7 @@ "Name: a, dtype: float64" ] }, - "execution_count": 3, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } @@ -200,16 +327,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "That's all there is to it. For more complex UDFs, though, we'd want to write an actual Numba kernel.\n", + "### Lower level control with custom `numba` kernels" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Many problems in data science and engineering are well studied and there exist known parallel algorithms for making some desired transformation to some data. Many have corresponding CUDA solutions that may not exist as column level API in cuDF. To expose the ability to use these custom kernels, cuDF supports directly using custom cuda kernels written using `numba` on cuDF `Series` objects. In short, this means that if a user has knowledge of how to write a CUDA kernel in numba, they may simply pass cuDF `Series` objects to that kernel as if they were numba device arrays. Let's look at a basic example of how to do this.\n", "\n", - "For more complex logic (for instance, accessing values from multiple input columns or rows, you'll need to use a more complex API. There are several types. First we'll cover writing and running a Numba JITed CUDA kernel.\n", + "Note that this section requires basic CUDA knowledge. Refer to [numba's CUDA documentation](https://numba.pydata.org/numba-doc/latest/cuda/index.html) for details.\n", "\n", - "The easiest way to write a Numba kernel is to use `cuda.grid(1)` to manage our thread indices, and then leverage Numba's `forall` method to configure the kernel for us. Below, define a basic multiplication kernel as an example and use `@cuda.jit` to compile it." + "The easiest way to write a Numba kernel is to use `cuda.grid(1)` to manage thread indices, and then leverage Numba's `forall` method to configure the kernel for us. Below, define a basic multiplication kernel as an example and use `@cuda.jit` to compile it." ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ @@ -233,7 +367,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ @@ -251,7 +385,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 11, "metadata": {}, "outputs": [ { @@ -330,7 +464,7 @@ "4 -0.970850 False Sarah -9.708501" ] }, - "execution_count": 6, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } @@ -343,7 +477,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that, while we're operating on the Series `df['e']`, the kernel executes on the [DeviceNDArray](https://numba.pydata.org/numba-doc/dev/cuda/memory.html#device-arrays) \\\"underneath\\\" the Series. If you ever need to access the underlying DeviceNDArray of a Series, you can do so with `Series.data.mem`. We'll use this during an example in the Null Handling section of this guide." + "This API allows a user to theoretically write arbitrary kernel logic, potentially accessing and using elements of the series at arbitrary indices and use them on cuDF data structures. Advanced developers with some CUDA experience can often use this capability to implement iterative transformations, or spot treat problem areas of a data pipeline with a custom kernel that does the same job faster." ] }, { @@ -352,43 +486,47 @@ "source": [ "## DataFrame UDFs\n", "\n", - "We could apply a UDF on a DataFrame like we did above with `forall`. We'd need to write a kernel that expects multiple inputs, and pass multiple Series as arguments when we execute our kernel. Because this is fairly common and can be difficult to manage, cuDF provides two APIs to streamline this: `apply_rows` and `apply_chunks`. Below, we walk through an example of using `apply_rows`. `apply_chunks` works in a similar way, but also offers more control over low-level kernel behavior.\n", + "Like `cudf.Series`, there are multiple ways of using UDFs on dataframes, which essentially amount to UDFs that expect multiple columns as input:\n", "\n", - "Now that we have two numeric columns in our DataFrame, let's write a kernel that uses both of them." + "- `cudf.DataFrame.apply`, which functions like `pd.DataFrame.apply` and expects a row udf\n", + "- `cudf.DataFrame.apply_rows`, which is a thin wrapper around numba and expects a numba kernel\n", + "- `cudf.DataFrame.apply_chunks`, which is similar to `cudf.DataFrame.apply_rows` but offers lower level control.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# `cudf.DataFrame.apply`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`cudf.DataFrame.apply` is the main entrypoint for UDFs that expect multiple columns as input and produce a single output column. Functions intended to be consumed by this API are written in terms of a \"row\" argument. The \"row\" is considered to be like a dictionary and contains all of the column values at a certain `iloc` in a `DataFrame`. The function can access these values by key within the function, the keys being the column names corresponding to the desired value. Below is an example function that would be used to add column `A` and column `B` together inside a UDF." ] }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ - "def conditional_add(x, y, out):\n", - " for i, (a, e) in enumerate(zip(x, y)):\n", - " if a > 0:\n", - " out[i] = a + e\n", - " else:\n", - " out[i] = a" + "def f(row):\n", + " return row['A'] + row['B']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Notice that we need to `enumerate` through our `zipped` function arguments (which either match or are mapped to our input column names). We can pass this kernel to `apply_rows`. We'll need to specify a few arguments:\n", - "- incols\n", - " - A list of names of input columns that match the function arguments. Or, a dictionary mapping input column names to their corresponding function arguments such as `{'col1': 'arg1'}`.\n", - "- outcols\n", - " - A dictionary defining our output column names and their data types. These names must match our function arguments.\n", - "- kwargs (optional)\n", - " - We can optionally pass keyword arguments as a dictionary. Since we don't need any, we pass an empty one.\n", - " \n", - "While it looks like our function is looping sequentially through our columns, it actually executes in parallel in multiple threads on the GPU. This parallelism is the heart of GPU-accelerated computing. With that background, we're ready to use our UDF." + "Let's create some very basic toy data containing at least one null." ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 13, "metadata": {}, "outputs": [ { @@ -412,214 +550,227 @@ " \n", " \n", " \n", - " a\n", - " b\n", - " c\n", - " e\n", - " out\n", + " A\n", + " B\n", " \n", " \n", " \n", " \n", " 0\n", - " -0.691674\n", - " True\n", - " Dan\n", - " -6.916743\n", - " -0.691674\n", + " 1\n", + " 4\n", " \n", " \n", " 1\n", - " 0.480099\n", - " False\n", - " Bob\n", - " 4.800994\n", - " 5.281093\n", + " 2\n", + " <NA>\n", " \n", " \n", " 2\n", - " -0.473370\n", - " True\n", - " Xavier\n", - " -4.733700\n", - " -0.473370\n", - " \n", - " \n", - " 3\n", - " 0.067479\n", - " True\n", - " Alice\n", - " 0.674788\n", - " 0.742267\n", - " \n", - " \n", - " 4\n", - " -0.970850\n", - " False\n", - " Sarah\n", - " -9.708501\n", - " -0.970850\n", + " 3\n", + " 6\n", " \n", " \n", "\n", "" ], "text/plain": [ - " a b c e out\n", - "0 -0.691674 True Dan -6.916743 -0.691674\n", - "1 0.480099 False Bob 4.800994 5.281093\n", - "2 -0.473370 True Xavier -4.733700 -0.473370\n", - "3 0.067479 True Alice 0.674788 0.742267\n", - "4 -0.970850 False Sarah -9.708501 -0.970850" + " A B\n", + "0 1 4\n", + "1 2 \n", + "2 3 6" ] }, - "execution_count": 8, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "df = df.apply_rows(conditional_add, \n", - " incols={'a':'x', 'e':'y'},\n", - " outcols={'out': np.float64},\n", - " kwargs={}\n", - " )\n", - "df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As expected, we see our conditional addition worked. At this point, we've successfully executed UDFs on the core data structures of cuDF." + "df = cudf.DataFrame({\n", + " 'A': [1,2,3],\n", + " 'B': [4,cudf.NA,6]\n", + "})\n", + "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Rolling Window UDFs\n", - "\n", - "For time-series data, we may need to operate on a small \\\"window\\\" of our column at a time, processing each portion independently. We could slide (\\\"roll\\\") this window over the entire column to answer questions like \\\"What is the 3-day moving average of a stock price over the past year?\"\n", - "\n", - "We can apply more complex functions to rolling windows to `rolling` Series and DataFrames using `apply`. This example is adapted from cuDF's [API documentation](https://docs.rapids.ai/api/cudf/stable/api.html#cudf.core.dataframe.DataFrame.rolling). First, we'll create an example Series and then create a `rolling` object from the Series." + "Finally call the function as you would in pandas - by using a lambda function to map the UDF onto \"rows\" of the DataFrame: " ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "0 16.0\n", - "1 25.0\n", - "2 36.0\n", - "3 49.0\n", - "4 64.0\n", - "5 81.0\n", - "dtype: float64" - ] - }, - "execution_count": 9, - "metadata": {}, + "0 5\n", + "1 \n", + "2 9\n", + "dtype: int64" + ] + }, + "execution_count": 14, + "metadata": {}, "output_type": "execute_result" } ], "source": [ - "ser = cudf.Series([16, 25, 36, 49, 64, 81], dtype='float64')\n", - "ser" + "df.apply(f, axis=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The same function should produce the same result as pandas:" ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "Rolling [window=3,min_periods=3,center=False]" + "0 5\n", + "1 \n", + "2 9\n", + "dtype: object" ] }, - "execution_count": 10, + "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "rolling = ser.rolling(window=3, min_periods=3, center=False)\n", - "rolling" + "df.to_pandas(nullable=True).apply(f, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Next, we'll define a function to use on our rolling windows. We created this one to highlight how you can include things like loops, mathematical functions, and conditionals. Rolling window UDFs do not yet support null values." + "Notice that Pandas returns `object` dtype - see notes on this in the caveats section." ] }, { - "cell_type": "code", - "execution_count": 11, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "import math\n", - "\n", - "def example_func(window):\n", - " b = 0\n", - " for a in window:\n", - " b = max(b, math.sqrt(a))\n", - " if b == 8:\n", - " return 100 \n", - " return b" + "Like `cudf.Series.apply`, these functions support generalized null handling. Here's a function that conditionally returns a different value if a certain input is null:" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 16, "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
a
01
1<NA>
23
\n", + "
" + ], + "text/plain": [ + " a\n", + "0 1\n", + "1 \n", + "2 3" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "We can execute the function by passing it to `apply`. With `window=3`, `min_periods=3`, and `center=False`, our first two values are `null`." + "def f(row):\n", + " x = row['a']\n", + " if x is cudf.NA:\n", + " return 0\n", + " else:\n", + " return x + 1\n", + "\n", + "df = cudf.DataFrame({'a': [1, cudf.NA, 3]})\n", + "df" ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "0 \n", - "1 \n", - "2 6.0\n", - "3 7.0\n", - "4 100.0\n", - "5 9.0\n", - "dtype: float64" + "0 2\n", + "1 0\n", + "2 4\n", + "dtype: int64" ] }, - "execution_count": 12, + "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "rolling.apply(example_func)" + "df.apply(f, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We can apply this function to every column in a DataFrame, too." + "`cudf.NA` can also be directly returned from a function resulting in data that has the the correct nulls in the end, just as if it were run in Pandas. For the following data, the last row fulfills the condition that `1 + 3 > 3` and returns `NA` for that row:" ] }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 18, "metadata": {}, "outputs": [ { @@ -650,57 +801,84 @@ " \n", " \n", " 0\n", - " 55.0\n", - " 55.0\n", + " 1\n", + " 2\n", " \n", " \n", " 1\n", - " 56.0\n", - " 56.0\n", + " 2\n", + " 1\n", " \n", " \n", " 2\n", - " 57.0\n", - " 57.0\n", - " \n", - " \n", - " 3\n", - " 58.0\n", - " 58.0\n", - " \n", - " \n", - " 4\n", - " 59.0\n", - " 59.0\n", + " 3\n", + " 1\n", " \n", " \n", "\n", "" ], "text/plain": [ - " a b\n", - "0 55.0 55.0\n", - "1 56.0 56.0\n", - "2 57.0 57.0\n", - "3 58.0 58.0\n", - "4 59.0 59.0" + " a b\n", + "0 1 2\n", + "1 2 1\n", + "2 3 1" ] }, - "execution_count": 13, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "df2 = cudf.DataFrame()\n", - "df2['a'] = np.arange(55, 65, dtype='float64')\n", - "df2['b'] = np.arange(55, 65, dtype='float64')\n", - "df2.head()" + "def f(row):\n", + " x = row['a']\n", + " y = row['b']\n", + " if x + y > 3:\n", + " return cudf.NA\n", + " else:\n", + " return x + y\n", + "\n", + "df = cudf.DataFrame({\n", + " 'a': [1, 2, 3], \n", + " 'b': [2, 1, 1]\n", + "})\n", + "df" ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 3\n", + "1 3\n", + "2 \n", + "dtype: int64" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.apply(f, axis=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Mixed types are allowed, but will return the common type, rather than object as in Pandas. Here's a null aware op between an int and a float column:" + ] + }, + { + "cell_type": "code", + "execution_count": 20, "metadata": {}, "outputs": [ { @@ -731,96 +909,92 @@ " \n", " \n", " 0\n", - " <NA>\n", - " <NA>\n", + " 1\n", + " 0.5\n", " \n", " \n", " 1\n", - " <NA>\n", + " 2\n", " <NA>\n", " \n", " \n", " 2\n", - " 7.549834435\n", - " 7.549834435\n", - " \n", - " \n", - " 3\n", - " 7.615773106\n", - " 7.615773106\n", - " \n", - " \n", - " 4\n", - " 7.681145748\n", - " 7.681145748\n", - " \n", - " \n", - " 5\n", - " 7.745966692\n", - " 7.745966692\n", - " \n", - " \n", - " 6\n", - " 7.810249676\n", - " 7.810249676\n", - " \n", - " \n", - " 7\n", - " 7.874007874\n", - " 7.874007874\n", - " \n", - " \n", - " 8\n", - " 7.937253933\n", - " 7.937253933\n", - " \n", - " \n", - " 9\n", - " 100.0\n", - " 100.0\n", + " 3\n", + " 3.14\n", " \n", " \n", "\n", "" ], "text/plain": [ - " a b\n", - "0 \n", - "1 \n", - "2 7.549834435 7.549834435\n", - "3 7.615773106 7.615773106\n", - "4 7.681145748 7.681145748\n", - "5 7.745966692 7.745966692\n", - "6 7.810249676 7.810249676\n", - "7 7.874007874 7.874007874\n", - "8 7.937253933 7.937253933\n", - "9 100.0 100.0" + " a b\n", + "0 1 0.5\n", + "1 2 \n", + "2 3 3.14" ] }, - "execution_count": 14, + "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "rolling = df2.rolling(window=3, min_periods=3, center=False)\n", - "rolling.apply(example_func)" + "def f(row):\n", + " return row['a'] + row['b']\n", + "\n", + "df = cudf.DataFrame({\n", + " 'a': [1, 2, 3], \n", + " 'b': [0.5, cudf.NA, 3.14]\n", + "})\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 1.5\n", + "1 \n", + "2 6.14\n", + "dtype: float64" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.apply(f, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## GroupBy DataFrame UDFs\n", - "\n", - "We can also apply UDFs to grouped DataFrames using `apply_grouped`. This example is also drawn and adapted from the RAPIDS [API documentation](https://docs.rapids.ai/api/cudf/stable/api.html#cudf.core.groupby.groupby.GroupBy.apply_grouped).\n", + "Functions may also return scalar values, however the result will be promoted to a safe type regardless of the data. This means even if you have a function like:\n", "\n", - "First, we'll group our DataFrame based on column `b`, which is either True or False. Note that we currently need to pass `method=\"cudf\"` to use UDFs with GroupBy objects." + "```python\n", + "def f(x):\n", + " if x > 1000:\n", + " return 1.5\n", + " else:\n", + " return 2\n", + "```\n", + "And your data is:\n", + "```python\n", + "[1,2,3,4,5]\n", + "```\n", + "You will get floats in the final data even though a float is never returned. This is because Numba ultimately needs to produce one function that can handle any data, which means if there's any possibility a float could result, you must always assume it will happen. Here's an example of a function that returns a scalar in some cases:" ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 22, "metadata": {}, "outputs": [ { @@ -845,121 +1019,84 @@ " \n", " \n", " a\n", - " b\n", - " c\n", - " e\n", - " out\n", " \n", " \n", " \n", " \n", " 0\n", - " -0.691674\n", - " True\n", - " Dan\n", - " -6.916743\n", - " -0.691674\n", + " 1\n", " \n", " \n", " 1\n", - " 0.480099\n", - " False\n", - " Bob\n", - " 4.800994\n", - " 5.281093\n", + " 3\n", " \n", " \n", " 2\n", - " -0.473370\n", - " True\n", - " Xavier\n", - " -4.733700\n", - " -0.473370\n", - " \n", - " \n", - " 3\n", - " 0.067479\n", - " True\n", - " Alice\n", - " 0.674788\n", - " 0.742267\n", - " \n", - " \n", - " 4\n", - " -0.970850\n", - " False\n", - " Sarah\n", - " -9.708501\n", - " -0.970850\n", + " 5\n", " \n", " \n", "\n", "" ], "text/plain": [ - " a b c e out\n", - "0 -0.691674 True Dan -6.916743 -0.691674\n", - "1 0.480099 False Bob 4.800994 5.281093\n", - "2 -0.473370 True Xavier -4.733700 -0.473370\n", - "3 0.067479 True Alice 0.674788 0.742267\n", - "4 -0.970850 False Sarah -9.708501 -0.970850" + " a\n", + "0 1\n", + "1 3\n", + "2 5" ] }, - "execution_count": 15, + "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "grouped = df.groupby(['b'])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next we'll define a function to apply to each group independently. In this case, we'll take the rolling average of column `e`, and call that new column `rolling_avg_e`." + "def f(row):\n", + " x = row['a']\n", + " if x > 3:\n", + " return x\n", + " else:\n", + " return 1.5\n", + "\n", + "df = cudf.DataFrame({\n", + " 'a': [1, 3, 5]\n", + "})\n", + "df" ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 23, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "0 1.5\n", + "1 1.5\n", + "2 5.0\n", + "dtype: float64" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "def rolling_avg(e, rolling_avg_e):\n", - " win_size = 3\n", - " for i in range(cuda.threadIdx.x, len(e), cuda.blockDim.x):\n", - " if i < win_size - 1:\n", - " # If there is not enough data to fill the window,\n", - " # take the average to be NaN\n", - " rolling_avg_e[i] = np.nan\n", - " else:\n", - " total = 0\n", - " for j in range(i - win_size + 1, i + 1):\n", - " total += e[j]\n", - " rolling_avg_e[i] = total / win_size" + "df.apply(f, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We can execute this with a very similar API to `apply_rows`. This time, though, it's going to execute independently for each group." + "Any number of columns and many arithmetic operators are supported, allowing for complex UDFs:" ] }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 24, "metadata": {}, "outputs": [ { @@ -986,259 +1123,240 @@ " a\n", " b\n", " c\n", + " d\n", " e\n", - " out\n", - " rolling_avg_e\n", " \n", " \n", " \n", " \n", - " 1\n", - " 0.480099\n", - " False\n", - " Bob\n", - " 4.800994\n", - " 5.281093\n", - " NaN\n", - " \n", - " \n", - " 4\n", - " -0.970850\n", - " False\n", - " Sarah\n", - " -9.708501\n", - " -0.970850\n", - " NaN\n", - " \n", - " \n", - " 6\n", - " 0.801430\n", - " False\n", - " Sarah\n", - " 8.014297\n", - " 8.815727\n", - " 1.035597\n", - " \n", - " \n", - " 7\n", - " -0.933157\n", - " False\n", - " Quinn\n", - " -9.331571\n", - " -0.933157\n", - " -3.675258\n", - " \n", - " \n", " 0\n", - " -0.691674\n", - " True\n", - " Dan\n", - " -6.916743\n", - " -0.691674\n", - " NaN\n", - " \n", - " \n", - " 2\n", - " -0.473370\n", - " True\n", - " Xavier\n", - " -4.733700\n", - " -0.473370\n", - " NaN\n", - " \n", - " \n", - " 3\n", - " 0.067479\n", - " True\n", - " Alice\n", - " 0.674788\n", - " 0.742267\n", - " -3.658552\n", + " 1\n", + " 4\n", + " <NA>\n", + " 8\n", + " 7\n", " \n", " \n", - " 5\n", - " 0.837494\n", - " True\n", - " Wendy\n", - " 8.374940\n", - " 9.212434\n", - " 1.438676\n", + " 1\n", + " 2\n", + " 5\n", + " 4\n", + " 7\n", + " 1\n", " \n", " \n", - " 8\n", - " 0.913899\n", - " True\n", - " Ursula\n", - " 9.138987\n", - " 10.052885\n", - " 6.062905\n", - " \n", - " \n", - " 9\n", - " -0.725581\n", - " True\n", - " George\n", - " -7.255814\n", - " -0.725581\n", - " 3.419371\n", + " 2\n", + " 3\n", + " 6\n", + " 4\n", + " 8\n", + " 6\n", " \n", " \n", "\n", "" ], "text/plain": [ - " a b c e out rolling_avg_e\n", - "1 0.480099 False Bob 4.800994 5.281093 NaN\n", - "4 -0.970850 False Sarah -9.708501 -0.970850 NaN\n", - "6 0.801430 False Sarah 8.014297 8.815727 1.035597\n", - "7 -0.933157 False Quinn -9.331571 -0.933157 -3.675258\n", - "0 -0.691674 True Dan -6.916743 -0.691674 NaN\n", - "2 -0.473370 True Xavier -4.733700 -0.473370 NaN\n", - "3 0.067479 True Alice 0.674788 0.742267 -3.658552\n", - "5 0.837494 True Wendy 8.374940 9.212434 1.438676\n", - "8 0.913899 True Ursula 9.138987 10.052885 6.062905\n", - "9 -0.725581 True George -7.255814 -0.725581 3.419371" + " a b c d e\n", + "0 1 4 8 7\n", + "1 2 5 4 7 1\n", + "2 3 6 4 8 6" ] }, - "execution_count": 18, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "results = grouped.apply_grouped(rolling_avg,\n", - " incols=['e'],\n", - " outcols=dict(rolling_avg_e=np.float64))\n", - "results" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice how, with a window size of three in the kernel, the first two values in each group for our output column are null." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Numba Kernels on CuPy Arrays\n", + "def f(row):\n", + " return row['a'] + (row['b'] - (row['c'] / row['d'])) % row['e']\n", "\n", - "We can also execute Numba kernels on CuPy NDArrays, again thanks to the `__cuda_array_interface__`. We can even run the same UDF on the Series and the CuPy array. First, we define a Series and then create a CuPy array from that Series." + "df = cudf.DataFrame({\n", + " 'a': [1, 2, 3],\n", + " 'b': [4, 5, 6],\n", + " 'c': [cudf.NA, 4, 4],\n", + " 'd': [8, 7, 8],\n", + " 'e': [7, 1, 6]\n", + "})\n", + "df" ] }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "array([ 1., 2., 3., 4., 10.])" + "0 \n", + "1 2.428571429\n", + "2 8.5\n", + "dtype: float64" ] }, - "execution_count": 19, + "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "import cupy as cp\n", - "\n", - "s = cudf.Series([1.0, 2, 3, 4, 10])\n", - "arr = cp.asarray(s)\n", - "arr" + "df.apply(f, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Next, we define a UDF and execute it on our Series. We need to allocate a Series of the same size for our output, which we'll call `out`." + "# Numba kernels for DataFrames" ] }, { - "cell_type": "code", - "execution_count": 20, + "cell_type": "markdown", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 5\n", - "1 10\n", - "2 15\n", - "3 20\n", - "4 50\n", - "dtype: int32" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ - "from cudf.utils import cudautils\n", "\n", - "@cuda.jit\n", - "def multiply_by_5(x, out):\n", - " i = cuda.grid(1)\n", - " if i < x.size:\n", - " out[i] = x[i] * 5\n", - " \n", - "out = cudf.Series(cp.zeros(len(s), dtype='int32'))\n", - "multiply_by_5.forall(s.shape[0])(s, out)\n", - "out" + "We could apply a UDF on a DataFrame like we did above with `forall`. We'd need to write a kernel that expects multiple inputs, and pass multiple Series as arguments when we execute our kernel. Because this is fairly common and can be difficult to manage, cuDF provides two APIs to streamline this: `apply_rows` and `apply_chunks`. Below, we walk through an example of using `apply_rows`. `apply_chunks` works in a similar way, but also offers more control over low-level kernel behavior.\n", + "\n", + "Now that we have two numeric columns in our DataFrame, let's write a kernel that uses both of them." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "def conditional_add(x, y, out):\n", + " for i, (a, e) in enumerate(zip(x, y)):\n", + " if a > 0:\n", + " out[i] = a + e\n", + " else:\n", + " out[i] = a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Finally, we execute the same function on our array. We allocate an empty array `out` to store our results." + "Notice that we need to `enumerate` through our `zipped` function arguments (which either match or are mapped to our input column names). We can pass this kernel to `apply_rows`. We'll need to specify a few arguments:\n", + "- incols\n", + " - A list of names of input columns that match the function arguments. Or, a dictionary mapping input column names to their corresponding function arguments such as `{'col1': 'arg1'}`.\n", + "- outcols\n", + " - A dictionary defining our output column names and their data types. These names must match our function arguments.\n", + "- kwargs (optional)\n", + " - We can optionally pass keyword arguments as a dictionary. Since we don't need any, we pass an empty one.\n", + " \n", + "While it looks like our function is looping sequentially through our columns, it actually executes in parallel in multiple threads on the GPU. This parallelism is the heart of GPU-accelerated computing. With that background, we're ready to use our UDF." ] }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 27, "metadata": {}, "outputs": [ { "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
abcdeout
014<NA>878.0
1254713.0
2364869.0
\n", + "
" + ], "text/plain": [ - "array([ 5., 10., 15., 20., 50.])" + " a b c d e out\n", + "0 1 4 8 7 8.0\n", + "1 2 5 4 7 1 3.0\n", + "2 3 6 4 8 6 9.0" ] }, - "execution_count": 21, + "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "out = cp.empty_like(arr)\n", - "multiply_by_5.forall(arr.size)(arr, out)\n", - "out" + "df = df.apply_rows(conditional_add, \n", + " incols={'a':'x', 'e':'y'},\n", + " outcols={'out': np.float64},\n", + " kwargs={}\n", + " )\n", + "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Null Handling in UDFs\n", - "\n", - "Above, we covered most basic usage of UDFs with cuDF.\n", - "\n", - "The remainder of the guide focuses on considerations for executing UDFs on DataFrames containing null values. If your UDFs will read or write any column containing nulls, **you should read this section carefully**. \n", + "As expected, we see our conditional addition worked. At this point, we've successfully executed UDFs on the core data structures of cuDF." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Null Handling in `apply_rows` and `apply_chunks`\n", "\n", - "Writing UDFs that can handle null values is complicated by the fact that a separate bitmask is used to identify when a value is valid and when it's null. By default, DataFrame methods for applying UDFs like `apply_rows` will handle nulls pessimistically (all rows with a null value will be removed from the output if they are used in the kernel). Exploring how not handling not pessimistically can lead to undefined behavior is outside the scope of this guide. Suffice it to say, pessimistic null handling is the safe and consistent approach. You can see an example below." + "By default, DataFrame methods for applying UDFs like `apply_rows` will handle nulls pessimistically (all rows with a null value will be removed from the output if they are used in the kernel). Exploring how not handling not pessimistically can lead to undefined behavior is outside the scope of this guide. Suffice it to say, pessimistic null handling is the safe and consistent approach. You can see an example below." ] }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 28, "metadata": {}, "outputs": [ { @@ -1311,7 +1429,7 @@ "4 979 982 1011" ] }, - "execution_count": 22, + "execution_count": 28, "metadata": {}, "output_type": "execute_result" } @@ -1337,7 +1455,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 29, "metadata": {}, "outputs": [ { @@ -1416,7 +1534,7 @@ "4 979 982 1011 1961.0" ] }, - "execution_count": 23, + "execution_count": 29, "metadata": {}, "output_type": "execute_result" } @@ -1440,189 +1558,128 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Generalized NA Support" + "## Rolling Window UDFs\n", + "\n", + "For time-series data, we may need to operate on a small \\\"window\\\" of our column at a time, processing each portion independently. We could slide (\\\"roll\\\") this window over the entire column to answer questions like \\\"What is the 3-day moving average of a stock price over the past year?\"\n", + "\n", + "We can apply more complex functions to rolling windows to `rolling` Series and DataFrames using `apply`. This example is adapted from cuDF's [API documentation](https://docs.rapids.ai/api/cudf/stable/api.html#cudf.core.dataframe.DataFrame.rolling). First, we'll create an example Series and then create a `rolling` object from the Series." ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 16.0\n", + "1 25.0\n", + "2 36.0\n", + "3 49.0\n", + "4 64.0\n", + "5 81.0\n", + "dtype: float64" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ser = cudf.Series([16, 25, 36, 49, 64, 81], dtype='float64')\n", + "ser" + ] + }, + { + "cell_type": "code", + "execution_count": 31, "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Rolling [window=3,min_periods=3,center=False]" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "More general support for `NA` handling is provided on an experimental basis. Numba is used to translate a standard python function into an operation on the data columns and their masks, and then the reduced and optimized version of this function is runtime compiled and called using the data. \n", - "\n", - "One advantage of this approach apart from the ability to handle nulls generally in an intuitive manner is it results in a very familiar API to Pandas users. Let's see how this works with an example." + "rolling = ser.rolling(window=3, min_periods=3, center=False)\n", + "rolling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Let's create a simple example DataFrame for demonstrational purposes." + "Next, we'll define a function to use on our rolling windows. We created this one to highlight how you can include things like loops, mathematical functions, and conditionals. Rolling window UDFs do not yet support null values." ] }, { "cell_type": "code", - "execution_count": 24, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AB
014
12<NA>
236
\n", - "
" - ], - "text/plain": [ - " A B\n", - "0 1 4\n", - "1 2 \n", - "2 3 6" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df = cudf.DataFrame({\n", - " 'A': [1,2,3],\n", - " 'B': [4,cudf.NA,6]\n", - "})\n", - "df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The entrypoint for UDFs used in this manner is `cudf.DataFrame.apply`. To use it, start by defining a standard python function designed to accept a single dict-like row of the dataframe:" - ] - }, - { - "cell_type": "code", - "execution_count": 25, + "execution_count": 32, "metadata": {}, "outputs": [], "source": [ - "def f(row):\n", - " return row['A'] + row['B']" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally call the function as you would in pandas - by using a lambda function to map the UDF onto \"rows\" of the DataFrame: " - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 5\n", - "1 \n", - "2 9\n", - "dtype: int64" - ] - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df.apply(f, axis=1)" + "import math\n", + "\n", + "def example_func(window):\n", + " b = 0\n", + " for a in window:\n", + " b = max(b, math.sqrt(a))\n", + " if b == 8:\n", + " return 100 \n", + " return b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The same function should produce the same result as pandas:" + "We can execute the function by passing it to `apply`. With `window=3`, `min_periods=3`, and `center=False`, our first two values are `null`." ] }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "0 5\n", - "1 \n", - "2 9\n", - "dtype: object" + "0 \n", + "1 \n", + "2 6.0\n", + "3 7.0\n", + "4 100.0\n", + "5 9.0\n", + "dtype: float64" ] }, - "execution_count": 27, + "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "df.to_pandas(nullable=True).apply(f, axis=1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that Pandas returns `object` dtype - see notes on this in the caveats section." + "rolling.apply(example_func)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "This API supports UDFs that interact with nulls in more complex ways, and leverages the `cudf.NA` singleton object much in the same manner as Pandas, allowing for more flexible functions. As a basic example this function conditions on wether or not a value is `NA` and returns a scalar in that case:" + "We can apply this function to every column in a DataFrame, too." ] }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 34, "metadata": {}, "outputs": [ { @@ -1647,82 +1704,63 @@ " \n", " \n", " a\n", + " b\n", " \n", " \n", " \n", " \n", " 0\n", - " 1\n", + " 55.0\n", + " 55.0\n", " \n", " \n", " 1\n", - " <NA>\n", + " 56.0\n", + " 56.0\n", " \n", " \n", " 2\n", - " 3\n", + " 57.0\n", + " 57.0\n", + " \n", + " \n", + " 3\n", + " 58.0\n", + " 58.0\n", + " \n", + " \n", + " 4\n", + " 59.0\n", + " 59.0\n", " \n", " \n", "\n", "" ], "text/plain": [ - " a\n", - "0 1\n", - "1 \n", - "2 3" - ] - }, - "execution_count": 28, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def f(row):\n", - " x = row['a']\n", - " if x is cudf.NA:\n", - " return 0\n", - " else:\n", - " return x + 1\n", - "\n", - "df = cudf.DataFrame({'a': [1, cudf.NA, 3]})\n", - "df" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 2\n", - "1 0\n", - "2 4\n", - "dtype: int64" + " a b\n", + "0 55.0 55.0\n", + "1 56.0 56.0\n", + "2 57.0 57.0\n", + "3 58.0 58.0\n", + "4 59.0 59.0" ] }, - "execution_count": 29, + "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "df.apply(f, axis=1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`cudf.NA` can also be directly returned from a function resulting in data that has the the correct nulls in the end, just as if it were run in Pandas. For the following data, the last row fulfills the condition that `1 + 3 > 3` and returns `NA` for that row:" + "df2 = cudf.DataFrame()\n", + "df2['a'] = np.arange(55, 65, dtype='float64')\n", + "df2['b'] = np.arange(55, 65, dtype='float64')\n", + "df2.head()" ] }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 35, "metadata": {}, "outputs": [ { @@ -1753,84 +1791,96 @@ " \n", " \n", " 0\n", - " 1\n", - " 2\n", + " <NA>\n", + " <NA>\n", " \n", " \n", " 1\n", - " 2\n", - " 1\n", + " <NA>\n", + " <NA>\n", " \n", " \n", " 2\n", - " 3\n", - " 1\n", + " 7.549834435\n", + " 7.549834435\n", " \n", - " \n", - "\n", - "" - ], - "text/plain": [ - " a b\n", - "0 1 2\n", - "1 2 1\n", - "2 3 1" - ] - }, - "execution_count": 30, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def f(row):\n", - " x = row['a']\n", - " y = row['b']\n", - " if x + y > 3:\n", - " return cudf.NA\n", - " else:\n", - " return x + y\n", - "\n", - "df = cudf.DataFrame({\n", - " 'a': [1, 2, 3], \n", - " 'b': [2, 1, 1]\n", - "})\n", - "df" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": {}, - "outputs": [ - { - "data": { + " \n", + " 3\n", + " 7.615773106\n", + " 7.615773106\n", + " \n", + " \n", + " 4\n", + " 7.681145748\n", + " 7.681145748\n", + " \n", + " \n", + " 5\n", + " 7.745966692\n", + " 7.745966692\n", + " \n", + " \n", + " 6\n", + " 7.810249676\n", + " 7.810249676\n", + " \n", + " \n", + " 7\n", + " 7.874007874\n", + " 7.874007874\n", + " \n", + " \n", + " 8\n", + " 7.937253933\n", + " 7.937253933\n", + " \n", + " \n", + " 9\n", + " 100.0\n", + " 100.0\n", + " \n", + " \n", + "\n", + "" + ], "text/plain": [ - "0 3\n", - "1 3\n", - "2 \n", - "dtype: int64" + " a b\n", + "0 \n", + "1 \n", + "2 7.549834435 7.549834435\n", + "3 7.615773106 7.615773106\n", + "4 7.681145748 7.681145748\n", + "5 7.745966692 7.745966692\n", + "6 7.810249676 7.810249676\n", + "7 7.874007874 7.874007874\n", + "8 7.937253933 7.937253933\n", + "9 100.0 100.0" ] }, - "execution_count": 31, + "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "df.apply(f, axis=1)" + "rolling = df2.rolling(window=3, min_periods=3, center=False)\n", + "rolling.apply(example_func)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Mixed types are allowed, but will return the common type, rather than object as in Pandas. Here's a null aware op between an int and a float column:" + "## GroupBy DataFrame UDFs\n", + "\n", + "We can also apply UDFs to grouped DataFrames using `apply_grouped`. This example is also drawn and adapted from the RAPIDS [API documentation](https://docs.rapids.ai/api/cudf/stable/api.html#cudf.core.groupby.groupby.GroupBy.apply_grouped).\n", + "\n", + "First, we'll group our DataFrame based on column `b`, which is either True or False. Note that we currently need to pass `method=\"cudf\"` to use UDFs with GroupBy objects." ] }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 36, "metadata": {}, "outputs": [ { @@ -1856,199 +1906,115 @@ " \n", " a\n", " b\n", + " c\n", + " e\n", " \n", " \n", " \n", " \n", " 0\n", - " 1\n", - " 0.5\n", + " -0.691674\n", + " True\n", + " Dan\n", + " -0.958380\n", " \n", " \n", " 1\n", - " 2\n", - " <NA>\n", + " 0.480099\n", + " False\n", + " Bob\n", + " -0.729580\n", " \n", " \n", " 2\n", - " 3\n", - " 3.14\n", + " -0.473370\n", + " True\n", + " Xavier\n", + " -0.767454\n", + " \n", + " \n", + " 3\n", + " 0.067479\n", + " True\n", + " Alice\n", + " -0.380205\n", + " \n", + " \n", + " 4\n", + " -0.970850\n", + " False\n", + " Sarah\n", + " 0.342905\n", " \n", " \n", "\n", "" ], "text/plain": [ - " a b\n", - "0 1 0.5\n", - "1 2 \n", - "2 3 3.14" + " a b c e\n", + "0 -0.691674 True Dan -0.958380\n", + "1 0.480099 False Bob -0.729580\n", + "2 -0.473370 True Xavier -0.767454\n", + "3 0.067479 True Alice -0.380205\n", + "4 -0.970850 False Sarah 0.342905" ] }, - "execution_count": 32, + "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "def f(row):\n", - " return row['a'] + row['b']\n", - "\n", - "df = cudf.DataFrame({\n", - " 'a': [1, 2, 3], \n", - " 'b': [0.5, cudf.NA, 3.14]\n", - "})\n", - "df" + "df = randomdata(nrows=10, dtypes={'a':float, 'b':bool, 'c':str, 'e': float}, seed=12)\n", + "df.head()" ] }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 37, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 1.5\n", - "1 \n", - "2 6.14\n", - "dtype: float64" - ] - }, - "execution_count": 33, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "df.apply(f, axis=1)" + "grouped = df.groupby(['b'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Functions may also return scalar values, however the result will be promoted to a safe type regardless of the data. This means even if you have a function like:\n", - "\n", - "```python\n", - "def f(x):\n", - " if x > 1000:\n", - " return 1.5\n", - " else:\n", - " return 2\n", - "```\n", - "And your data is:\n", - "```python\n", - "[1,2,3,4,5]\n", - "```\n", - "You will get floats in the final data even though a float is never returned. This is because Numba ultimately needs to produce one function that can handle any data, which means if there's any possibility a float could result, you must always assume it will happen. Here's an example of a function that returns a scalar in some cases:" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
a
01
13
25
\n", - "
" - ], - "text/plain": [ - " a\n", - "0 1\n", - "1 3\n", - "2 5" - ] - }, - "execution_count": 34, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def f(row):\n", - " x = row['a']\n", - " if x > 3:\n", - " return x\n", - " else:\n", - " return 1.5\n", - "\n", - "df = cudf.DataFrame({\n", - " 'a': [1, 3, 5]\n", - "})\n", - "df" + "Next we'll define a function to apply to each group independently. In this case, we'll take the rolling average of column `e`, and call that new column `rolling_avg_e`." ] }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 38, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 1.5\n", - "1 1.5\n", - "2 5.0\n", - "dtype: float64" - ] - }, - "execution_count": 35, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "df.apply(f, axis=1)" + "def rolling_avg(e, rolling_avg_e):\n", + " win_size = 3\n", + " for i in range(cuda.threadIdx.x, len(e), cuda.blockDim.x):\n", + " if i < win_size - 1:\n", + " # If there is not enough data to fill the window,\n", + " # take the average to be NaN\n", + " rolling_avg_e[i] = np.nan\n", + " else:\n", + " total = 0\n", + " for j in range(i - win_size + 1, i + 1):\n", + " total += e[j]\n", + " rolling_avg_e[i] = total / win_size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Any number of columns and many arithmetic operators are supported, allowing for complex UDFs:" + "We can execute this with a very similar API to `apply_rows`. This time, though, it's going to execute independently for each group." ] }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 39, "metadata": {}, "outputs": [ { @@ -2075,171 +2041,230 @@ " a\n", " b\n", " c\n", - " d\n", " e\n", + " rolling_avg_e\n", " \n", " \n", " \n", " \n", - " 0\n", - " 1\n", - " 4\n", - " <NA>\n", - " 8\n", - " 7\n", + " 1\n", + " 0.480099\n", + " False\n", + " Bob\n", + " -0.729580\n", + " NaN\n", " \n", " \n", - " 1\n", - " 2\n", - " 5\n", - " 4\n", - " 7\n", - " 1\n", + " 4\n", + " -0.970850\n", + " False\n", + " Sarah\n", + " 0.342905\n", + " NaN\n", + " \n", + " \n", + " 6\n", + " 0.801430\n", + " False\n", + " Sarah\n", + " 0.632337\n", + " 0.081887\n", + " \n", + " \n", + " 7\n", + " -0.933157\n", + " False\n", + " Quinn\n", + " -0.420826\n", + " 0.184805\n", + " \n", + " \n", + " 0\n", + " -0.691674\n", + " True\n", + " Dan\n", + " -0.958380\n", + " NaN\n", " \n", " \n", " 2\n", - " 3\n", - " 6\n", - " 4\n", - " 8\n", - " 6\n", + " -0.473370\n", + " True\n", + " Xavier\n", + " -0.767454\n", + " NaN\n", + " \n", + " \n", + " 3\n", + " 0.067479\n", + " True\n", + " Alice\n", + " -0.380205\n", + " -0.702013\n", + " \n", + " \n", + " 5\n", + " 0.837494\n", + " True\n", + " Wendy\n", + " -0.057540\n", + " -0.401733\n", + " \n", + " \n", + " 8\n", + " 0.913899\n", + " True\n", + " Ursula\n", + " 0.466252\n", + " 0.009502\n", + " \n", + " \n", + " 9\n", + " -0.725581\n", + " True\n", + " George\n", + " 0.405245\n", + " 0.271319\n", " \n", " \n", "\n", "" ], "text/plain": [ - " a b c d e\n", - "0 1 4 8 7\n", - "1 2 5 4 7 1\n", - "2 3 6 4 8 6" + " a b c e rolling_avg_e\n", + "1 0.480099 False Bob -0.729580 NaN\n", + "4 -0.970850 False Sarah 0.342905 NaN\n", + "6 0.801430 False Sarah 0.632337 0.081887\n", + "7 -0.933157 False Quinn -0.420826 0.184805\n", + "0 -0.691674 True Dan -0.958380 NaN\n", + "2 -0.473370 True Xavier -0.767454 NaN\n", + "3 0.067479 True Alice -0.380205 -0.702013\n", + "5 0.837494 True Wendy -0.057540 -0.401733\n", + "8 0.913899 True Ursula 0.466252 0.009502\n", + "9 -0.725581 True George 0.405245 0.271319" ] }, - "execution_count": 36, + "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "def f(row):\n", - " return row['a'] + (row['b'] - (row['c'] / row['d'])) % row['e']\n", - "\n", - "df = cudf.DataFrame({\n", - " 'a': [1, 2, 3],\n", - " 'b': [4, 5, 6],\n", - " 'c': [cudf.NA, 4, 4],\n", - " 'd': [8, 7, 8],\n", - " 'e': [7, 1, 6]\n", - "})\n", - "df" - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 \n", - "1 2.428571429\n", - "2 8.5\n", - "dtype: float64" - ] - }, - "execution_count": 37, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df.apply(f, axis=1)" + "results = grouped.apply_grouped(rolling_avg,\n", + " incols=['e'],\n", + " outcols=dict(rolling_avg_e=np.float64))\n", + "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## `cudf.Series.apply`" + "Notice how, with a window size of three in the kernel, the first two values in each group for our output column are null." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "cuDF provides a similar API to `pandas.Series.apply` for applying scalar UDFs to series objects. Like pandas, these UDFs do not need to be written in terms of rows. These UDFs have generalized null handling and are slightly more flexible than those that work with `applymap`. Ultimately, `applymap` will be deprecated and removed in favor of `apply`. Here is an example: " + "## Numba Kernels on CuPy Arrays\n", + "\n", + "We can also execute Numba kernels on CuPy NDArrays, again thanks to the `__cuda_array_interface__`. We can even run the same UDF on the Series and the CuPy array. First, we define a Series and then create a CuPy array from that Series." ] }, { "cell_type": "code", - "execution_count": 38, + "execution_count": 40, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 1., 2., 3., 4., 10.])" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "# Create a cuDF series\n", - "sr = cudf.Series([1, cudf.NA, 3])" + "import cupy as cp\n", + "\n", + "s = cudf.Series([1.0, 2, 3, 4, 10])\n", + "arr = cp.asarray(s)\n", + "arr" ] }, { - "cell_type": "code", - "execution_count": 39, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "# define a scalar function\n", - "def f(x):\n", - " if x is cudf.NA:\n", - " return 42\n", - " else:\n", - " return 2**x" + "Next, we define a UDF and execute it on our Series. We need to allocate a Series of the same size for our output, which we'll call `out`." ] }, { "cell_type": "code", - "execution_count": 40, + "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "0 2\n", - "1 42\n", - "2 8\n", - "dtype: int64" + "0 5\n", + "1 10\n", + "2 15\n", + "3 20\n", + "4 50\n", + "dtype: int32" ] }, - "execution_count": 40, + "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "sr.apply(f)" + "from cudf.utils import cudautils\n", + "\n", + "@cuda.jit\n", + "def multiply_by_5(x, out):\n", + " i = cuda.grid(1)\n", + " if i < x.size:\n", + " out[i] = x[i] * 5\n", + " \n", + "out = cudf.Series(cp.zeros(len(s), dtype='int32'))\n", + "multiply_by_5.forall(s.shape[0])(s, out)\n", + "out" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we execute the same function on our array. We allocate an empty array `out` to store our results." ] }, { "cell_type": "code", - "execution_count": 41, + "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "0 2\n", - "1 42\n", - "2 8\n", - "dtype: int64" + "array([ 5., 10., 15., 20., 50.])" ] }, - "execution_count": 41, + "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "# Check the pandas result\n", - "sr.to_pandas(nullable=True).apply(f)" + "out = cp.empty_like(arr)\n", + "multiply_by_5.forall(arr.size)(arr, out)\n", + "out" ] }, { @@ -2294,7 +2319,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.12" + "version": "3.8.13" } }, "nbformat": 4, From 38dcced6a31148e1c553db4e4839e00112f0ebe7 Mon Sep 17 00:00:00 2001 From: brandon-b-miller Date: Thu, 7 Apr 2022 07:50:44 -0700 Subject: [PATCH 2/3] remove applymap entirely and update apply --- .../source/user_guide/guide-to-udfs.ipynb | 530 ++++++++++-------- 1 file changed, 307 insertions(+), 223 deletions(-) diff --git a/docs/cudf/source/user_guide/guide-to-udfs.ipynb b/docs/cudf/source/user_guide/guide-to-udfs.ipynb index bb2e7a2626a..cda09e8cc16 100644 --- a/docs/cudf/source/user_guide/guide-to-udfs.ipynb +++ b/docs/cudf/source/user_guide/guide-to-udfs.ipynb @@ -13,7 +13,9 @@ "metadata": {}, "outputs": [], "source": [ - "import cudf" + "import cudf\n", + "from cudf.datasets import randomdata\n", + "import numpy as np" ] }, { @@ -31,22 +33,7 @@ "- CuPy NDArrays\n", "- Numba DeviceNDArrays\n", "\n", - "It also demonstrates cuDF's default null handling behavior, and how to write UDFs that can interact with null values in a limited fashion. Finally, it demonstrates some newer more general null handling via the `apply` API." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview\n", - "\n", - "When cuDF executes a UDF, it gets just-in-time (JIT) compiled into a CUDA kernel (either explicitly or implicitly) and is run on the GPU. Exploring CUDA and GPU architecture in-depth is out of scope for this guide. At a high level:\n", - "\n", - "- Compute is spread across multiple \"blocks\", which have access to both global memory and their own block local memory\n", - "- Within each block, many \"threads\" operate independently and simultaneously access their block-specific shared memory with low latency\n", - "\n", - "\n", - "This guide covers APIs that automatically handle dividing columns into chunks and assigning them into different GPU blocks for parallel computation (see [apply_chunks](https://docs.rapids.ai/api/cudf/stable/api.html#cudf.core.dataframe.DataFrame.apply_chunks) or the [numba CUDA JIT API](https://numba.pydata.org/numba-doc/dev/cuda/index.html) if you need to control this yourself)." + "It also demonstrates cuDF's default null handling behavior, and how to write UDFs that can interact with null values." ] }, { @@ -55,27 +42,26 @@ "source": [ "## Series UDFs\n", "\n", - "You can execute UDFs on Series in three ways:\n", + "You can execute UDFs on Series in two ways:\n", "\n", - "- Writing a standard python function and using `cudf.Series.apply` (recommended)\n", - "- Writing a standard Python function and using `applymap` (deprecated)\n", + "- Writing a standard python function and using `cudf.Series.apply`\n", "- Writing a Numba kernel and using Numba's `forall` syntax\n", "\n", - "Using `apply` or `applymap` is simpler, but writing a Numba kernel offers the flexibility to build more complex functions (we'll be writing only simple kernels in this guide)." + "Using `apply` or is simpler, but writing a Numba kernel offers the flexibility to build more complex functions (we'll be writing only simple kernels in this guide)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### `cudf.Series.apply`" + "# `cudf.Series.apply`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "cuDF provides a similar API to `pandas.Series.apply` for applying scalar UDFs to series objects. These UDFs have generalized null handling and are slightly more flexible than those that work with `applymap`. Here is a very simple example:" + "cuDF provides a similar API to `pandas.Series.apply` for applying scalar UDFs to series objects. Here is a very basic example." ] }, { @@ -85,7 +71,7 @@ "outputs": [], "source": [ "# Create a cuDF series\n", - "sr = cudf.Series([1, cudf.NA, 3])" + "sr = cudf.Series([1, 2, 3])" ] }, { @@ -103,10 +89,7 @@ "source": [ "# define a scalar function\n", "def f(x):\n", - " if x is cudf.NA:\n", - " return 42\n", - " else:\n", - " return 2**x" + " return x + 1" ] }, { @@ -124,9 +107,9 @@ { "data": { "text/plain": [ - "0 2\n", - "1 42\n", - "2 8\n", + "0 2\n", + "1 3\n", + "2 4\n", "dtype: int64" ] }, @@ -139,6 +122,13 @@ "sr.apply(f)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is the same process that would be used to apply the same UDF in pandas as we can see below:" + ] + }, { "cell_type": "code", "execution_count": 5, @@ -147,9 +137,9 @@ { "data": { "text/plain": [ - "0 2\n", - "1 42\n", - "2 8\n", + "0 2\n", + "1 3\n", + "2 4\n", "dtype: int64" ] }, @@ -159,168 +149,253 @@ } ], "source": [ - "# Check the pandas result\n", - "sr.to_pandas(nullable=True).apply(f)" + "sr.to_pandas().apply(f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### `cudf.Series.applymap` (deprecated)" + "### Functions with Additional Scalar Arguments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "`cudf.Series.applymap` originally played a similar role to `cudf.Series.apply` in legacy version of cuDF and is now deprecated. Its main difference is there is no explicit null handling. Functions are written the same way, but can't interact with the `cudf.NA` null value. In fact this API assumes that if an input value is null, the output value is also null, regardless of the logic inside the function. Let's look at a simple example." + "In addition, `cudf.Series.apply` supports `args=` just like pandas, allowing you to write UDFs that accept an arbitrary number of scalar arguments. Here is an example of such a function and it's API call in both pandas and cuDF:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, + "outputs": [], + "source": [ + "def g(x, const):\n", + " return x + const" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, "outputs": [ { "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
abc
0-0.691674TrueDan
10.480099FalseBob
2-0.473370TrueXavier
30.067479TrueAlice
4-0.970850FalseSarah
\n", - "
" - ], "text/plain": [ - " a b c\n", - "0 -0.691674 True Dan\n", - "1 0.480099 False Bob\n", - "2 -0.473370 True Xavier\n", - "3 0.067479 True Alice\n", - "4 -0.970850 False Sarah" + "0 43\n", + "1 44\n", + "2 45\n", + "dtype: int64" ] }, - "execution_count": 6, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "import numpy as np\n", - "from cudf.datasets import randomdata \n", - "\n", - "df = randomdata(nrows=10, dtypes={'a':float, 'b':bool, 'c':str}, seed=12)\n", - "df.head()" + "# cuDF apply\n", + "sr.apply(g, args=(42,))" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 43\n", + "1 44\n", + "2 45\n", + "dtype: int64" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# pandas apply\n", + "sr.to_pandas().apply(g, args=(42,))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Nullable Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Next, we'll define a basic Python function and call it as a UDF with `applymap`." + "Functions used with `cudf.Series.apply` and `cudf.DataFrame.apply` can be expected to handle nulls using the same rules as the rest of cuDF. In most cases this translates to nulls propagating through unary and binary operations and yielding more nulls. To make this concrete, let's look at the same example from above, this time using nullable data:" ] }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 1\n", + "1 \n", + "2 3\n", + "dtype: int64" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Create a cuDF series with nulls\n", + "sr = cudf.Series([1, cudf.NA, 3])\n", + "sr" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "# redefine the same function from above\n", + "def f(x):\n", + " return x + 1" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 2\n", + "1 \n", + "2 4\n", + "dtype: int64" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# cuDF result\n", + "sr.apply(f)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 2\n", + "1 \n", + "2 4\n", + "dtype: object" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# pandas result\n", + "sr.to_pandas(nullable=True).apply(f)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Often however the user desires explicit null handling behavior inside the function. cuDF exposes this capability the same way as pandas, by interacting directly with the `NA` singleton object. Here's an example of a function with explicit null handling:" + ] + }, + { + "cell_type": "code", + "execution_count": 13, "metadata": {}, "outputs": [], "source": [ - "def udf(x):\n", - " if x > 0:\n", - " return x + 5\n", + "def f_null_sensitive(x):\n", + " # do something if the input is null\n", + " if x is cudf.NA:\n", + " return 42\n", " else:\n", - " return x - 5" + " return x + 1" ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 14, "metadata": {}, "outputs": [ { - "name": "stderr", - "output_type": "stream", - "text": [ - "cudf/core/series.py:2219: FutureWarning: Series.applymap is deprecated and will be removed in a future cuDF release. Use Series.apply instead.\n", - " warnings.warn(\n" - ] - }, + "data": { + "text/plain": [ + "0 2\n", + "1 42\n", + "2 4\n", + "dtype: int64" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# cuDF result\n", + "sr.apply(f_null_sensitive)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ { "data": { "text/plain": [ - "0 -5.691674\n", - "1 5.480099\n", - "2 -5.473370\n", - "3 5.067479\n", - "4 -5.970850\n", - "5 5.837494\n", - "6 5.801430\n", - "7 -5.933157\n", - "8 5.913899\n", - "9 -5.725581\n", - "Name: a, dtype: float64" + "0 2\n", + "1 42\n", + "2 4\n", + "dtype: int64" ] }, - "execution_count": 8, + "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "df['a'].applymap(udf)" + "# pandas result\n", + "sr.to_pandas(nullable=True).apply(f_null_sensitive)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In addition, `cudf.NA` can be returned from a function directly or conditionally. This capability should allow most users to implement custom null handling in a wide variety of cases." ] }, { @@ -343,7 +418,16 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "df = randomdata(nrows=5, dtypes={'a':int, 'b':int, 'c':int}, seed=12)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, "metadata": {}, "outputs": [], "source": [ @@ -367,7 +451,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 18, "metadata": {}, "outputs": [], "source": [ @@ -385,7 +469,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 19, "metadata": {}, "outputs": [ { @@ -418,53 +502,53 @@ " \n", " \n", " 0\n", - " -0.691674\n", - " True\n", - " Dan\n", - " -6.916743\n", + " 963\n", + " 1005\n", + " 997\n", + " 9630.0\n", " \n", " \n", " 1\n", - " 0.480099\n", - " False\n", - " Bob\n", - " 4.800994\n", + " 977\n", + " 1026\n", + " 980\n", + " 9770.0\n", " \n", " \n", " 2\n", - " -0.473370\n", - " True\n", - " Xavier\n", - " -4.733700\n", + " 1048\n", + " 1026\n", + " 1019\n", + " 10480.0\n", " \n", " \n", " 3\n", - " 0.067479\n", - " True\n", - " Alice\n", - " 0.674788\n", + " 1078\n", + " 960\n", + " 985\n", + " 10780.0\n", " \n", " \n", " 4\n", - " -0.970850\n", - " False\n", - " Sarah\n", - " -9.708501\n", + " 979\n", + " 982\n", + " 1011\n", + " 9790.0\n", " \n", " \n", "\n", "" ], "text/plain": [ - " a b c e\n", - "0 -0.691674 True Dan -6.916743\n", - "1 0.480099 False Bob 4.800994\n", - "2 -0.473370 True Xavier -4.733700\n", - "3 0.067479 True Alice 0.674788\n", - "4 -0.970850 False Sarah -9.708501" + " a b c e\n", + "0 963 1005 997 9630.0\n", + "1 977 1026 980 9770.0\n", + "2 1048 1026 1019 10480.0\n", + "3 1078 960 985 10780.0\n", + "4 979 982 1011 9790.0" ] }, - "execution_count": 11, + "execution_count": 19, "metadata": {}, "output_type": "execute_result" } @@ -509,7 +593,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 20, "metadata": {}, "outputs": [], "source": [ @@ -526,7 +610,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 21, "metadata": {}, "outputs": [ { @@ -581,7 +665,7 @@ "2 3 6" ] }, - "execution_count": 13, + "execution_count": 21, "metadata": {}, "output_type": "execute_result" } @@ -603,7 +687,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 22, "metadata": {}, "outputs": [ { @@ -615,7 +699,7 @@ "dtype: int64" ] }, - "execution_count": 14, + "execution_count": 22, "metadata": {}, "output_type": "execute_result" } @@ -633,7 +717,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 23, "metadata": {}, "outputs": [ { @@ -645,7 +729,7 @@ "dtype: object" ] }, - "execution_count": 15, + "execution_count": 23, "metadata": {}, "output_type": "execute_result" } @@ -670,7 +754,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 24, "metadata": {}, "outputs": [ { @@ -721,7 +805,7 @@ "2 3" ] }, - "execution_count": 16, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" } @@ -740,7 +824,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 25, "metadata": {}, "outputs": [ { @@ -752,7 +836,7 @@ "dtype: int64" ] }, - "execution_count": 17, + "execution_count": 25, "metadata": {}, "output_type": "execute_result" } @@ -770,7 +854,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 26, "metadata": {}, "outputs": [ { @@ -825,7 +909,7 @@ "2 3 1" ] }, - "execution_count": 18, + "execution_count": 26, "metadata": {}, "output_type": "execute_result" } @@ -848,7 +932,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 27, "metadata": {}, "outputs": [ { @@ -860,7 +944,7 @@ "dtype: int64" ] }, - "execution_count": 19, + "execution_count": 27, "metadata": {}, "output_type": "execute_result" } @@ -878,7 +962,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 28, "metadata": {}, "outputs": [ { @@ -933,7 +1017,7 @@ "2 3 3.14" ] }, - "execution_count": 20, + "execution_count": 28, "metadata": {}, "output_type": "execute_result" } @@ -951,7 +1035,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 29, "metadata": {}, "outputs": [ { @@ -963,7 +1047,7 @@ "dtype: float64" ] }, - "execution_count": 21, + "execution_count": 29, "metadata": {}, "output_type": "execute_result" } @@ -994,7 +1078,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 30, "metadata": {}, "outputs": [ { @@ -1045,7 +1129,7 @@ "2 5" ] }, - "execution_count": 22, + "execution_count": 30, "metadata": {}, "output_type": "execute_result" } @@ -1066,7 +1150,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 31, "metadata": {}, "outputs": [ { @@ -1078,7 +1162,7 @@ "dtype: float64" ] }, - "execution_count": 23, + "execution_count": 31, "metadata": {}, "output_type": "execute_result" } @@ -1096,7 +1180,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 32, "metadata": {}, "outputs": [ { @@ -1163,7 +1247,7 @@ "2 3 6 4 8 6" ] }, - "execution_count": 24, + "execution_count": 32, "metadata": {}, "output_type": "execute_result" } @@ -1184,7 +1268,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 33, "metadata": {}, "outputs": [ { @@ -1196,7 +1280,7 @@ "dtype: float64" ] }, - "execution_count": 25, + "execution_count": 33, "metadata": {}, "output_type": "execute_result" } @@ -1224,7 +1308,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 34, "metadata": {}, "outputs": [], "source": [ @@ -1253,7 +1337,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 35, "metadata": {}, "outputs": [ { @@ -1324,7 +1408,7 @@ "2 3 6 4 8 6 9.0" ] }, - "execution_count": 27, + "execution_count": 35, "metadata": {}, "output_type": "execute_result" } @@ -1356,7 +1440,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 36, "metadata": {}, "outputs": [ { @@ -1429,7 +1513,7 @@ "4 979 982 1011" ] }, - "execution_count": 28, + "execution_count": 36, "metadata": {}, "output_type": "execute_result" } @@ -1455,7 +1539,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 37, "metadata": {}, "outputs": [ { @@ -1534,7 +1618,7 @@ "4 979 982 1011 1961.0" ] }, - "execution_count": 29, + "execution_count": 37, "metadata": {}, "output_type": "execute_result" } @@ -1567,7 +1651,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 38, "metadata": {}, "outputs": [ { @@ -1582,7 +1666,7 @@ "dtype: float64" ] }, - "execution_count": 30, + "execution_count": 38, "metadata": {}, "output_type": "execute_result" } @@ -1594,7 +1678,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 39, "metadata": {}, "outputs": [ { @@ -1603,7 +1687,7 @@ "Rolling [window=3,min_periods=3,center=False]" ] }, - "execution_count": 31, + "execution_count": 39, "metadata": {}, "output_type": "execute_result" } @@ -1622,7 +1706,7 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 40, "metadata": {}, "outputs": [], "source": [ @@ -1646,7 +1730,7 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 41, "metadata": {}, "outputs": [ { @@ -1661,7 +1745,7 @@ "dtype: float64" ] }, - "execution_count": 33, + "execution_count": 41, "metadata": {}, "output_type": "execute_result" } @@ -1679,7 +1763,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 42, "metadata": {}, "outputs": [ { @@ -1746,7 +1830,7 @@ "4 59.0 59.0" ] }, - "execution_count": 34, + "execution_count": 42, "metadata": {}, "output_type": "execute_result" } @@ -1760,7 +1844,7 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 43, "metadata": {}, "outputs": [ { @@ -1857,7 +1941,7 @@ "9 100.0 100.0" ] }, - "execution_count": 35, + "execution_count": 43, "metadata": {}, "output_type": "execute_result" } @@ -1880,7 +1964,7 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 44, "metadata": {}, "outputs": [ { @@ -1959,7 +2043,7 @@ "4 -0.970850 False Sarah 0.342905" ] }, - "execution_count": 36, + "execution_count": 44, "metadata": {}, "output_type": "execute_result" } @@ -1971,7 +2055,7 @@ }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 45, "metadata": {}, "outputs": [], "source": [ @@ -1987,7 +2071,7 @@ }, { "cell_type": "code", - "execution_count": 38, + "execution_count": 46, "metadata": {}, "outputs": [], "source": [ @@ -2014,7 +2098,7 @@ }, { "cell_type": "code", - "execution_count": 39, + "execution_count": 47, "metadata": {}, "outputs": [ { @@ -2144,7 +2228,7 @@ "9 -0.725581 True George 0.405245 0.271319" ] }, - "execution_count": 39, + "execution_count": 47, "metadata": {}, "output_type": "execute_result" } @@ -2174,7 +2258,7 @@ }, { "cell_type": "code", - "execution_count": 40, + "execution_count": 48, "metadata": {}, "outputs": [ { @@ -2183,7 +2267,7 @@ "array([ 1., 2., 3., 4., 10.])" ] }, - "execution_count": 40, + "execution_count": 48, "metadata": {}, "output_type": "execute_result" } @@ -2205,7 +2289,7 @@ }, { "cell_type": "code", - "execution_count": 41, + "execution_count": 49, "metadata": {}, "outputs": [ { @@ -2219,7 +2303,7 @@ "dtype: int32" ] }, - "execution_count": 41, + "execution_count": 49, "metadata": {}, "output_type": "execute_result" } @@ -2247,7 +2331,7 @@ }, { "cell_type": "code", - "execution_count": 42, + "execution_count": 50, "metadata": {}, "outputs": [ { @@ -2256,7 +2340,7 @@ "array([ 5., 10., 15., 20., 50.])" ] }, - "execution_count": 42, + "execution_count": 50, "metadata": {}, "output_type": "execute_result" } From 6e1920b958def25aec9f3960cd9699cae9b592a6 Mon Sep 17 00:00:00 2001 From: brandon-b-miller Date: Fri, 8 Apr 2022 10:05:03 -0700 Subject: [PATCH 3/3] address reviews --- .../source/user_guide/guide-to-udfs.ipynb | 116 ++---------------- 1 file changed, 10 insertions(+), 106 deletions(-) diff --git a/docs/cudf/source/user_guide/guide-to-udfs.ipynb b/docs/cudf/source/user_guide/guide-to-udfs.ipynb index cda09e8cc16..41bce8b865e 100644 --- a/docs/cudf/source/user_guide/guide-to-udfs.ipynb +++ b/docs/cudf/source/user_guide/guide-to-udfs.ipynb @@ -122,36 +122,6 @@ "sr.apply(f)" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This is the same process that would be used to apply the same UDF in pandas as we can see below:" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 2\n", - "1 3\n", - "2 4\n", - "dtype: int64" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sr.to_pandas().apply(f)" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -201,27 +171,10 @@ ] }, { - "cell_type": "code", - "execution_count": 8, + "cell_type": "markdown", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 43\n", - "1 44\n", - "2 45\n", - "dtype: int64" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ - "# pandas apply\n", - "sr.to_pandas().apply(g, args=(42,))" + "As a final note, `**kwargs` is not yet supported." ] }, { @@ -235,7 +188,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Functions used with `cudf.Series.apply` and `cudf.DataFrame.apply` can be expected to handle nulls using the same rules as the rest of cuDF. In most cases this translates to nulls propagating through unary and binary operations and yielding more nulls. To make this concrete, let's look at the same example from above, this time using nullable data:" + "The null value `NA` an propagates through unary and binary operations. Thus, `NA + 1`, `abs(NA)`, and `NA == NA` all return `NA`. To make this concrete, let's look at the same example from above, this time using nullable data:" ] }, { @@ -298,35 +251,11 @@ "sr.apply(f)" ] }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 2\n", - "1 \n", - "2 4\n", - "dtype: object" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# pandas result\n", - "sr.to_pandas(nullable=True).apply(f)" - ] - }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Often however the user desires explicit null handling behavior inside the function. cuDF exposes this capability the same way as pandas, by interacting directly with the `NA` singleton object. Here's an example of a function with explicit null handling:" + "Often however you want explicit null handling behavior inside the function. cuDF exposes this capability the same way as pandas, by interacting directly with the `NA` singleton object. Here's an example of a function with explicit null handling:" ] }, { @@ -367,35 +296,11 @@ "sr.apply(f_null_sensitive)" ] }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 2\n", - "1 42\n", - "2 4\n", - "dtype: int64" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# pandas result\n", - "sr.to_pandas(nullable=True).apply(f_null_sensitive)" - ] - }, { "cell_type": "markdown", "metadata": {}, "source": [ - "In addition, `cudf.NA` can be returned from a function directly or conditionally. This capability should allow most users to implement custom null handling in a wide variety of cases." + "In addition, `cudf.NA` can be returned from a function directly or conditionally. This capability should allow you to implement custom null handling in a wide variety of cases." ] }, { @@ -409,8 +314,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Many problems in data science and engineering are well studied and there exist known parallel algorithms for making some desired transformation to some data. Many have corresponding CUDA solutions that may not exist as column level API in cuDF. To expose the ability to use these custom kernels, cuDF supports directly using custom cuda kernels written using `numba` on cuDF `Series` objects. In short, this means that if a user has knowledge of how to write a CUDA kernel in numba, they may simply pass cuDF `Series` objects to that kernel as if they were numba device arrays. Let's look at a basic example of how to do this.\n", - "\n", + "In addition to the Series.apply() method for performing custom operations, you can also pass Series objects directly into [CUDA kernels written with Numba](https://numba.pydata.org/numba-doc/latest/cuda/kernels.html).\n", "Note that this section requires basic CUDA knowledge. Refer to [numba's CUDA documentation](https://numba.pydata.org/numba-doc/latest/cuda/index.html) for details.\n", "\n", "The easiest way to write a Numba kernel is to use `cuda.grid(1)` to manage thread indices, and then leverage Numba's `forall` method to configure the kernel for us. Below, define a basic multiplication kernel as an example and use `@cuda.jit` to compile it." @@ -446,7 +350,7 @@ "source": [ "This kernel will take an input array, multiply it by a configurable value (supplied at runtime), and store the result in an output array. Notice that we wrapped our logic in an `if` statement. Because we can launch more threads than the size of our array, we need to make sure that we don't use threads with an index that would be out of bounds. Leaving this out can result in undefined behavior.\n", "\n", - "To execute our kernel, we just need to pre-allocate an output array and leverage the `forall` method mentioned above. First, we create a Series of all `0.0` in our DataFrame, since we want `float64` output. Next, we run the kernel with `forall`. `forall` requires us to specify our desired number of tasks, so we'll supply in the length of our Series (which we store in `size`). The [__cuda_array_interface__](https://numba.pydata.org/numba-doc/dev/cuda/cuda_array_interface.html) is what allows us to directly call our Numba kernel on our Series." + "To execute our kernel, must pre-allocate an output array and leverage the `forall` method mentioned above. First, we create a Series of all `0.0` in our DataFrame, since we want `float64` output. Next, we run the kernel with `forall`. `forall` requires us to specify our desired number of tasks, so we'll supply in the length of our Series (which we store in `size`). The [__cuda_array_interface__](https://numba.pydata.org/numba-doc/dev/cuda/cuda_array_interface.html) is what allows us to directly call our Numba kernel on our Series." ] }, { @@ -561,7 +465,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This API allows a user to theoretically write arbitrary kernel logic, potentially accessing and using elements of the series at arbitrary indices and use them on cuDF data structures. Advanced developers with some CUDA experience can often use this capability to implement iterative transformations, or spot treat problem areas of a data pipeline with a custom kernel that does the same job faster." + "This API allows a you to theoretically write arbitrary kernel logic, potentially accessing and using elements of the series at arbitrary indices and use them on cuDF data structures. Advanced developers with some CUDA experience can often use this capability to implement iterative transformations, or spot treat problem areas of a data pipeline with a custom kernel that does the same job faster." ] }, { @@ -1646,7 +1550,7 @@ "\n", "For time-series data, we may need to operate on a small \\\"window\\\" of our column at a time, processing each portion independently. We could slide (\\\"roll\\\") this window over the entire column to answer questions like \\\"What is the 3-day moving average of a stock price over the past year?\"\n", "\n", - "We can apply more complex functions to rolling windows to `rolling` Series and DataFrames using `apply`. This example is adapted from cuDF's [API documentation](https://docs.rapids.ai/api/cudf/stable/api.html#cudf.core.dataframe.DataFrame.rolling). First, we'll create an example Series and then create a `rolling` object from the Series." + "We can apply more complex functions to rolling windows to `rolling` Series and DataFrames using `apply`. This example is adapted from cuDF's [API documentation](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.DataFrame.rolling.html). First, we'll create an example Series and then create a `rolling` object from the Series." ] }, { @@ -1957,7 +1861,7 @@ "source": [ "## GroupBy DataFrame UDFs\n", "\n", - "We can also apply UDFs to grouped DataFrames using `apply_grouped`. This example is also drawn and adapted from the RAPIDS [API documentation](https://docs.rapids.ai/api/cudf/stable/api.html#cudf.core.groupby.groupby.GroupBy.apply_grouped).\n", + "We can also apply UDFs to grouped DataFrames using `apply_grouped`. This example is also drawn and adapted from the RAPIDS [API documentation]().\n", "\n", "First, we'll group our DataFrame based on column `b`, which is either True or False. Note that we currently need to pass `method=\"cudf\"` to use UDFs with GroupBy objects." ]