diff --git a/docs/cudf/source/Working-with-missing-data.ipynb b/docs/cudf/source/Working-with-missing-data.ipynb new file mode 100644 index 00000000000..54fe774060e --- /dev/null +++ b/docs/cudf/source/Working-with-missing-data.ipynb @@ -0,0 +1,3466 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Working with missing data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this section, we will discuss missing (also referred to as `NA`) values in cudf. cudf supports having missing values in all dtypes. These missing values are represented by ``. These values are also referenced as \"null values\"." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "1. [How to Detect missing values](#How-to-Detect-missing-values)\n", + "2. [Float dtypes and missing data](#Float-dtypes-and-missing-data)\n", + "3. [Datetimes](#Datetimes)\n", + "4. [Calculations with missing data](#Calculations-with-missing-data)\n", + "5. [Sum/product of Null/nans](#Sum/product-of-Null/nans)\n", + "6. [NA values in GroupBy](#NA-values-in-GroupBy)\n", + "7. [Inserting missing data](#Inserting-missing-data)\n", + "8. [Filling missing values: fillna](#Filling-missing-values:-fillna)\n", + "9. [Filling with cudf Object](#Filling-with-cudf-Object)\n", + "10. [Dropping axis labels with missing data: dropna](#Dropping-axis-labels-with-missing-data:-dropna)\n", + "11. [Replacing generic values](#Replacing-generic-values)\n", + "12. [String/regular expression replacement](#String/regular-expression-replacement)\n", + "13. [Numeric replacement](#Numeric-replacement)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## How to Detect missing values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To detect missing values, you can use `isna()` and `notna()` functions." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import cudf\n", + "import numpy as np" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "df = cudf.DataFrame({'a': [1, 2, None, 4], 'b':[0.1, None, 2.3, 17.17]})" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
010.1
12<NA>
2<NA>2.3
3417.17
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 1 0.1\n", + "1 2 \n", + "2 2.3\n", + "3 4 17.17" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
0FalseFalse
1FalseTrue
2TrueFalse
3FalseFalse
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 False False\n", + "1 False True\n", + "2 True False\n", + "3 False False" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.isna()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 True\n", + "1 True\n", + "2 False\n", + "3 True\n", + "Name: a, dtype: bool" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['a'].notna()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One has to be mindful that in Python (and NumPy), the nan's don’t compare equal, but None's do. Note that cudf/NumPy uses the fact that `np.nan != np.nan`, and treats `None` like `np.nan`." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "None == None" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.nan == np.nan" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 False\n", + "1 False\n", + "2 False\n", + "3 False\n", + "Name: b, dtype: bool" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['b'] == np.nan" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "s = cudf.Series([None, 1, 2])" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 \n", + "1 1\n", + "2 2\n", + "dtype: int64" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 False\n", + "1 False\n", + "2 False\n", + "dtype: bool" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s == None" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "s = cudf.Series([1, 2, np.nan], nan_as_null=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 1.0\n", + "1 2.0\n", + "2 NaN\n", + "dtype: float64" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 False\n", + "1 False\n", + "2 False\n", + "dtype: bool" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s == np.nan" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Float dtypes and missing data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Because ``NaN`` is a float, a column of integers with even one missing values is cast to floating-point dtype. However this doesn't happen by default.\n", + "\n", + "By default if a ``NaN`` value is passed to `Series` constructor, it is treated as `` value. " + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 1\n", + "1 2\n", + "2 \n", + "dtype: int64" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cudf.Series([1, 2, np.nan])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Hence to consider a ``NaN`` as ``NaN`` you will have to pass `nan_as_null=False` parameter into `Series` constructor." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 1.0\n", + "1 2.0\n", + "2 NaN\n", + "dtype: float64" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cudf.Series([1, 2, np.nan], nan_as_null=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Datetimes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For `datetime64` types, cudf doesn't support having `NaT` values. Instead these values which are specific to numpy and pandas are considered as null values(``) in cudf. The actual underlying value of `NaT` is `min(int64)` and cudf retains the underlying value when converting a cudf object to pandas object.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 2012-01-01 00:00:00.000000\n", + "1 \n", + "2 2012-01-01 00:00:00.000000\n", + "dtype: datetime64[us]" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "datetime_series = cudf.Series([pd.Timestamp(\"20120101\"), pd.NaT, pd.Timestamp(\"20120101\")])\n", + "datetime_series" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 2012-01-01\n", + "1 NaT\n", + "2 2012-01-01\n", + "dtype: datetime64[ns]" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "datetime_series.to_pandas()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "any operations on rows having `` values in `datetime` column will result in `` value at the same location in resulting column:" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 0 days 00:00:00\n", + "1 \n", + "2 0 days 00:00:00\n", + "dtype: timedelta64[us]" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "datetime_series - datetime_series" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Calculations with missing data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Null values propagate naturally through arithmetic operations between pandas objects." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "df1 = cudf.DataFrame({'a':[1, None, 2, 3, None], 'b':cudf.Series([np.nan, 2, 3.2, 0.1, 1], nan_as_null=False)})" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "df2 = cudf.DataFrame({'a':[1, 11, 2, 34, 10], 'b':cudf.Series([0.23, 22, 3.2, None, 1])})" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
01NaN
1<NA>2.0
223.2
330.1
4<NA>1.0
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 1 NaN\n", + "1 2.0\n", + "2 2 3.2\n", + "3 3 0.1\n", + "4 1.0" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
010.23
11122.0
223.2
334<NA>
4101.0
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 1 0.23\n", + "1 11 22.0\n", + "2 2 3.2\n", + "3 34 \n", + "4 10 1.0" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df2" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
02NaN
1<NA>24.0
246.4
337<NA>
4<NA>2.0
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 2 NaN\n", + "1 24.0\n", + "2 4 6.4\n", + "3 37 \n", + "4 2.0" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1 + df2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "While summing the data along a series, `NA` values will be treated as `0`." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 1\n", + "1 \n", + "2 2\n", + "3 3\n", + "4 \n", + "Name: a, dtype: int64" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1['a']" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "6" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1['a'].sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since `NA` values are treated as `0`, the mean would result to 2 in this case `(1 + 0 + 2 + 3 + 0)/5 = 2`" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2.0" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1['a'].mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To preserve `NA` values in the above calculations, `sum` & `mean` support `skipna` parameter.\n", + "By default it's value is\n", + "set to `True`, we can change it to `False` to preserve `NA` values." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "nan" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1['a'].sum(skipna=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "nan" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1['a'].mean(skipna=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Cumulative methods like `cumsum` and `cumprod` ignore `NA` values by default." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 1\n", + "1 \n", + "2 3\n", + "3 6\n", + "4 \n", + "Name: a, dtype: int64" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1['a'].cumsum()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To preserve `NA` values in cumulative methods, provide `skipna=False`." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 1\n", + "1 \n", + "2 \n", + "3 \n", + "4 \n", + "Name: a, dtype: int64" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1['a'].cumsum(skipna=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Sum/product of Null/nans" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The sum of an empty or all-NA Series of a DataFrame is 0." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.0" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cudf.Series([np.nan], nan_as_null=False).sum()" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "nan" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cudf.Series([np.nan], nan_as_null=False).sum(skipna=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.0" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cudf.Series([], dtype='float64').sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The product of an empty or all-NA Series of a DataFrame is 1." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1.0" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cudf.Series([np.nan], nan_as_null=False).prod()" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "nan" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cudf.Series([np.nan], nan_as_null=False).prod(skipna=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1.0" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cudf.Series([], dtype='float64').prod()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## NA values in GroupBy" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`NA` groups in GroupBy are automatically excluded. For example:" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
01NaN
1<NA>2.0
223.2
330.1
4<NA>1.0
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 1 NaN\n", + "1 2.0\n", + "2 2 3.2\n", + "3 3 0.1\n", + "4 1.0" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
b
a
23.2
1NaN
30.1
\n", + "
" + ], + "text/plain": [ + " b\n", + "a \n", + "2 3.2\n", + "1 NaN\n", + "3 0.1" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.groupby('a').mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is also possible to include `NA` in groups by passing `dropna=False`" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
b
a
23.2
1NaN
30.1
<NA>1.5
\n", + "
" + ], + "text/plain": [ + " b\n", + "a \n", + "2 3.2\n", + "1 NaN\n", + "3 0.1\n", + " 1.5" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.groupby('a', dropna=False).mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Inserting missing data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "All dtypes support insertion of missing value by assignment. Any specific location in series can made null by assigning it to `None`." + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [], + "source": [ + "series = cudf.Series([1, 2, 3, 4])" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 1\n", + "1 2\n", + "2 3\n", + "3 4\n", + "dtype: int64" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "series" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [], + "source": [ + "series[2] = None" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0 1\n", + "1 2\n", + "2 \n", + "3 4\n", + "dtype: int64" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "series" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Filling missing values: fillna" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`fillna()` can fill in `NA` & `NaN` values with non-NA data." + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
01NaN
1<NA>2.0
223.2
330.1
4<NA>1.0
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 1 NaN\n", + "1 2.0\n", + "2 2 3.2\n", + "3 3 0.1\n", + "4 1.0" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 10.0\n", + "1 2.0\n", + "2 3.2\n", + "3 0.1\n", + "4 1.0\n", + "Name: b, dtype: float64" + ] + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1['b'].fillna(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Filling with cudf Object" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also fillna using a dict or Series that is alignable. The labels of the dict or index of the Series must match the columns of the frame you wish to fill. The use case of this is to fill a DataFrame with the mean of that column." + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [], + "source": [ + "import cupy as cp\n", + "dff = cudf.DataFrame(cp.random.randn(10, 3), columns=list('ABC'))" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [], + "source": [ + "dff.iloc[3:5, 0] = np.nan" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [], + "source": [ + "dff.iloc[4:6, 1] = np.nan" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [], + "source": [ + "dff.iloc[5:8, 2] = np.nan" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABC
00.7712450.0510241.199239
1-1.1680410.702664-0.270806
2-1.467009-0.143080-0.806151
3NaN-0.610798-0.272895
4NaNNaN1.396784
5-0.439343NaNNaN
61.093102-0.764758NaN
70.003098-0.722648NaN
8-0.095899-1.285156-0.300566
90.1094652.497843-1.199856
\n", + "
" + ], + "text/plain": [ + " A B C\n", + "0 0.771245 0.051024 1.199239\n", + "1 -1.168041 0.702664 -0.270806\n", + "2 -1.467009 -0.143080 -0.806151\n", + "3 NaN -0.610798 -0.272895\n", + "4 NaN NaN 1.396784\n", + "5 -0.439343 NaN NaN\n", + "6 1.093102 -0.764758 NaN\n", + "7 0.003098 -0.722648 NaN\n", + "8 -0.095899 -1.285156 -0.300566\n", + "9 0.109465 2.497843 -1.199856" + ] + }, + "execution_count": 51, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dff" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABC
00.7712450.0510241.199239
1-1.1680410.702664-0.270806
2-1.467009-0.143080-0.806151
3-0.149173-0.610798-0.272895
4-0.149173-0.0343641.396784
5-0.439343-0.034364-0.036322
61.093102-0.764758-0.036322
70.003098-0.722648-0.036322
8-0.095899-1.285156-0.300566
90.1094652.497843-1.199856
\n", + "
" + ], + "text/plain": [ + " A B C\n", + "0 0.771245 0.051024 1.199239\n", + "1 -1.168041 0.702664 -0.270806\n", + "2 -1.467009 -0.143080 -0.806151\n", + "3 -0.149173 -0.610798 -0.272895\n", + "4 -0.149173 -0.034364 1.396784\n", + "5 -0.439343 -0.034364 -0.036322\n", + "6 1.093102 -0.764758 -0.036322\n", + "7 0.003098 -0.722648 -0.036322\n", + "8 -0.095899 -1.285156 -0.300566\n", + "9 0.109465 2.497843 -1.199856" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dff.fillna(dff.mean())" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABC
00.7712450.0510241.199239
1-1.1680410.702664-0.270806
2-1.467009-0.143080-0.806151
3NaN-0.610798-0.272895
4NaN-0.0343641.396784
5-0.439343-0.034364-0.036322
61.093102-0.764758-0.036322
70.003098-0.722648-0.036322
8-0.095899-1.285156-0.300566
90.1094652.497843-1.199856
\n", + "
" + ], + "text/plain": [ + " A B C\n", + "0 0.771245 0.051024 1.199239\n", + "1 -1.168041 0.702664 -0.270806\n", + "2 -1.467009 -0.143080 -0.806151\n", + "3 NaN -0.610798 -0.272895\n", + "4 NaN -0.034364 1.396784\n", + "5 -0.439343 -0.034364 -0.036322\n", + "6 1.093102 -0.764758 -0.036322\n", + "7 0.003098 -0.722648 -0.036322\n", + "8 -0.095899 -1.285156 -0.300566\n", + "9 0.109465 2.497843 -1.199856" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dff.fillna(dff.mean()[1:3])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dropping axis labels with missing data: dropna" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Missing data can be excluded using `dropna()`:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
01NaN
1<NA>2.0
223.2
330.1
4<NA>1.0
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 1 NaN\n", + "1 2.0\n", + "2 2 3.2\n", + "3 3 0.1\n", + "4 1.0" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
223.2
330.1
\n", + "
" + ], + "text/plain": [ + " a b\n", + "2 2 3.2\n", + "3 3 0.1" + ] + }, + "execution_count": 55, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.dropna(axis=0)" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
1
2
3
4
\n", + "
" + ], + "text/plain": [ + "Empty DataFrame\n", + "Columns: []\n", + "Index: [0, 1, 2, 3, 4]" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.dropna(axis=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An equivalent `dropna()` is available for Series. " + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 1\n", + "2 2\n", + "3 3\n", + "Name: a, dtype: int64" + ] + }, + "execution_count": 57, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1['a'].dropna()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Replacing generic values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Often times we want to replace arbitrary values with other values.\n", + "\n", + "`replace()` in Series and `replace()` in DataFrame provides an efficient yet flexible way to perform such replacements." + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": {}, + "outputs": [], + "source": [ + "series = cudf.Series([0.0, 1.0, 2.0, 3.0, 4.0])" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 0.0\n", + "1 1.0\n", + "2 2.0\n", + "3 3.0\n", + "4 4.0\n", + "dtype: float64" + ] + }, + "execution_count": 59, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "series" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 5.0\n", + "1 1.0\n", + "2 2.0\n", + "3 3.0\n", + "4 4.0\n", + "dtype: float64" + ] + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "series.replace(0, 5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also replace any value with a `` value." + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 \n", + "1 1.0\n", + "2 2.0\n", + "3 3.0\n", + "4 4.0\n", + "dtype: float64" + ] + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "series.replace(0, None)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can replace a list of values by a list of other values:" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 4.0\n", + "1 3.0\n", + "2 2.0\n", + "3 1.0\n", + "4 0.0\n", + "dtype: float64" + ] + }, + "execution_count": 62, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "series.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also specify a mapping dict:" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 10.0\n", + "1 100.0\n", + "2 2.0\n", + "3 3.0\n", + "4 4.0\n", + "dtype: float64" + ] + }, + "execution_count": 63, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "series.replace({0: 10, 1: 100})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For a DataFrame, you can specify individual values by column:" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": {}, + "outputs": [], + "source": [ + "df = cudf.DataFrame({\"a\": [0, 1, 2, 3, 4], \"b\": [5, 6, 7, 8, 9]})" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
005
116
227
338
449
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 0 5\n", + "1 1 6\n", + "2 2 7\n", + "3 3 8\n", + "4 4 9" + ] + }, + "execution_count": 65, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
0100100
116
227
338
449
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 100 100\n", + "1 1 6\n", + "2 2 7\n", + "3 3 8\n", + "4 4 9" + ] + }, + "execution_count": 66, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.replace({\"a\": 0, \"b\": 5}, 100)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## String/regular expression replacement" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "cudf supports replacing string values using `replace` API:" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": {}, + "outputs": [], + "source": [ + "d = {\"a\": list(range(4)), \"b\": list(\"ab..\"), \"c\": [\"a\", \"b\", None, \"d\"]}" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": {}, + "outputs": [], + "source": [ + "df = cudf.DataFrame(d)" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
abc
00aa
11bb
22.<NA>
33.d
\n", + "
" + ], + "text/plain": [ + " a b c\n", + "0 0 a a\n", + "1 1 b b\n", + "2 2 . \n", + "3 3 . d" + ] + }, + "execution_count": 69, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
abc
00aa
11bb
22A Dot<NA>
33A Dotd
\n", + "
" + ], + "text/plain": [ + " a b c\n", + "0 0 a a\n", + "1 1 b b\n", + "2 2 A Dot \n", + "3 3 A Dot d" + ] + }, + "execution_count": 70, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.replace(\".\", \"A Dot\")" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
abc
00aa
11<NA><NA>
22A Dot<NA>
33A Dotd
\n", + "
" + ], + "text/plain": [ + " a b c\n", + "0 0 a a\n", + "1 1 \n", + "2 2 A Dot \n", + "3 3 A Dot d" + ] + }, + "execution_count": 71, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.replace([\".\", \"b\"], [\"A Dot\", None])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Replace a few different values (list -> list):\n" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
abc
00bb
11bb
22--<NA>
33--d
\n", + "
" + ], + "text/plain": [ + " a b c\n", + "0 0 b b\n", + "1 1 b b\n", + "2 2 -- \n", + "3 3 -- d" + ] + }, + "execution_count": 72, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.replace([\"a\", \".\"], [\"b\", \"--\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Only search in column 'b' (dict -> dict):" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
abc
00aa
11bb
22replacement value<NA>
33replacement valued
\n", + "
" + ], + "text/plain": [ + " a b c\n", + "0 0 a a\n", + "1 1 b b\n", + "2 2 replacement value \n", + "3 3 replacement value d" + ] + }, + "execution_count": 73, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.replace({\"b\": \".\"}, {\"b\": \"replacement value\"})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Numeric replacement" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`replace()` can also be used similar to `fillna()`." + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "metadata": {}, + "outputs": [], + "source": [ + "df = cudf.DataFrame(cp.random.randn(10, 2))" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "metadata": {}, + "outputs": [], + "source": [ + "df[np.random.rand(df.shape[0]) > 0.5] = 1.5" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01
0<NA><NA>
1<NA><NA>
20.1231607461.09464783
3<NA><NA>
4<NA><NA>
50.68137677-0.357346253
6<NA><NA>
7<NA><NA>
81.173285961-0.968616065
90.147922362-0.154880098
\n", + "
" + ], + "text/plain": [ + " 0 1\n", + "0 \n", + "1 \n", + "2 0.123160746 1.09464783\n", + "3 \n", + "4 \n", + "5 0.68137677 -0.357346253\n", + "6 \n", + "7 \n", + "8 1.173285961 -0.968616065\n", + "9 0.147922362 -0.154880098" + ] + }, + "execution_count": 76, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.replace(1.5, None)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Replacing more than one value is possible by passing a list.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "metadata": {}, + "outputs": [], + "source": [ + "df00 = df.iloc[0, 0]" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01
05.0000005.000000
15.0000005.000000
20.1231611.094648
35.0000005.000000
45.0000005.000000
50.681377-0.357346
65.0000005.000000
75.0000005.000000
81.173286-0.968616
90.147922-0.154880
\n", + "
" + ], + "text/plain": [ + " 0 1\n", + "0 5.000000 5.000000\n", + "1 5.000000 5.000000\n", + "2 0.123161 1.094648\n", + "3 5.000000 5.000000\n", + "4 5.000000 5.000000\n", + "5 0.681377 -0.357346\n", + "6 5.000000 5.000000\n", + "7 5.000000 5.000000\n", + "8 1.173286 -0.968616\n", + "9 0.147922 -0.154880" + ] + }, + "execution_count": 78, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.replace([1.5, df00], [5, 10])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also operate on the DataFrame in place:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "metadata": {}, + "outputs": [], + "source": [ + "df.replace(1.5, None, inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01
0<NA><NA>
1<NA><NA>
20.1231607461.09464783
3<NA><NA>
4<NA><NA>
50.68137677-0.357346253
6<NA><NA>
7<NA><NA>
81.173285961-0.968616065
90.147922362-0.154880098
\n", + "
" + ], + "text/plain": [ + " 0 1\n", + "0 \n", + "1 \n", + "2 0.123160746 1.09464783\n", + "3 \n", + "4 \n", + "5 0.68137677 -0.357346253\n", + "6 \n", + "7 \n", + "8 1.173285961 -0.968616065\n", + "9 0.147922362 -0.154880098" + ] + }, + "execution_count": 80, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/cudf/source/index.rst b/docs/cudf/source/index.rst index 061d5dba126..bba0ed824b1 100644 --- a/docs/cudf/source/index.rst +++ b/docs/cudf/source/index.rst @@ -14,6 +14,7 @@ Welcome to cuDF's documentation! 10min-cudf-cupy.ipynb guide-to-udfs.ipynb internals.md + Working-with-missing-data.ipynb Indices and tables ==================