Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataFrame diff() #9817

Merged
merged 51 commits into from
Feb 5, 2022
Merged
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
287dced
create new pr
skirui-source Dec 1, 2021
721cbaa
docstrings
skirui-source Dec 2, 2021
4aa42c9
added df.diff() method. need to add tests
skirui-source Dec 2, 2021
bc2e827
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
skirui-source Dec 3, 2021
e245a63
added checks for null values and non-numeric dtypes and axis
skirui-source Dec 3, 2021
0f2bc48
added tests, all passing. ready for initial review
skirui-source Dec 3, 2021
2b81389
added example to doctrings
skirui-source Dec 3, 2021
7dbce30
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
skirui-source Dec 3, 2021
7cdb55a
minor edits to tests
skirui-source Dec 3, 2021
1e22293
.
skirui-source Dec 3, 2021
76c2926
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
skirui-source Dec 8, 2021
3bd8017
checks dtypes for all columns in a dataframe
skirui-source Dec 8, 2021
54ba362
fixed merge conflict in test_dataframe.py
skirui-source Dec 8, 2021
496e837
split long string to multiple lines
skirui-source Dec 8, 2021
2767b75
addressed review: use has_nulls to check for nans
skirui-source Dec 8, 2021
72d14bd
removed nan-constraints check
skirui-source Dec 8, 2021
3f046ac
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
skirui-source Dec 8, 2021
4ba5e5d
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
skirui-source Dec 9, 2021
798c642
added tests for mix of numeric and non-numeric dtypes
skirui-source Dec 9, 2021
0978972
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
skirui-source Dec 10, 2021
df2140e
addressed reviews by brandon
skirui-source Dec 10, 2021
c2e1e5a
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
skirui-source Dec 13, 2021
9141225
numeric types docs- fix
skirui-source Dec 14, 2021
79211f1
added test cases for decimal64 dtypes
skirui-source Dec 14, 2021
a936bcf
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
skirui-source Dec 14, 2021
43e3a7f
Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …
skirui-source Jan 6, 2022
8df6881
wip: moved binary_operator to DecimalBaseColumn
skirui-source Jan 7, 2022
db742bb
added checks for either decimal32 or decimal64
skirui-source Jan 7, 2022
18e8ff3
fixed merge conflict in test_dataframe.py
skirui-source Jan 18, 2022
a29d1a3
addressed michael's review comments
skirui-source Jan 18, 2022
ee1031e
fixed merge conflict in cudf_dev_cuda11.5.yml
skirui-source Jan 19, 2022
3dc90b0
use const seed for random generated number -- cases
skirui-source Jan 19, 2022
f09d4b5
fixed merge conflict in decimal.py
skirui-source Jan 19, 2022
a90f324
Merge branch 'branch-22.04' of https://github.com/rapidsai/cudf into …
skirui-source Jan 21, 2022
bfc550a
Merge branch 'branch-22.04' of https://github.com/rapidsai/cudf into …
skirui-source Jan 24, 2022
c8a424d
added check for periods>len(dataframe)
skirui-source Jan 25, 2022
b91a795
added test for decimal32dtype, all tests passing. ready for review
skirui-source Jan 25, 2022
331945f
use column_empty instead of cudf.NA to create df with all-nulls
skirui-source Jan 25, 2022
559edc3
apply bradley's suggestions to docstrings
skirui-source Jan 25, 2022
f33b931
add dots to indicate continuation in docstring examples
skirui-source Jan 25, 2022
85c2bcd
Merge branch 'branch-22.04' of https://github.com/rapidsai/cudf into …
skirui-source Jan 25, 2022
2dcabe9
added checks for periods as integer, and axis
skirui-source Jan 25, 2022
e6b2400
minor test-fixes. ready for review
skirui-source Jan 25, 2022
3d59ba2
fixed regex issues, all tests passing now
skirui-source Jan 25, 2022
9c4a26b
Merge branch 'branch-22.04' of https://github.com/rapidsai/cudf into …
skirui-source Jan 25, 2022
2abbd2e
Merge branch 'branch-22.04' of https://github.com/rapidsai/cudf into …
skirui-source Jan 26, 2022
e0722ae
using pandas is_integer() and float() instead
skirui-source Jan 26, 2022
82f941b
Merge branch 'branch-22.04' of https://github.com/rapidsai/cudf into …
skirui-source Feb 3, 2022
cdf5187
omit unnecessary extra tests.
skirui-source Feb 4, 2022
50b5085
only use context manager around function that raises
skirui-source Feb 4, 2022
575c118
context manager around function only
skirui-source Feb 4, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions python/cudf/cudf/core/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
import pyarrow as pa
from nvtx import annotate
from pandas._config import get_option
from pandas.core.dtypes.common import is_float, is_integer
from pandas.io.formats import console
from pandas.io.formats.printing import pprint_thing

Expand Down Expand Up @@ -2604,6 +2605,80 @@ def insert(self, loc, name, value, nan_as_null=None):

self._data.insert(name, value, loc=loc)

def diff(self, periods=1, axis=0):
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
"""
First discrete difference of element.

Calculates the difference of a DataFrame element compared with another
element in the DataFrame (default is element in previous row).

Parameters
----------
periods : int, default 1
Periods to shift for calculating difference,
accepts negative values.
axis : {0 or 'index', 1 or 'columns'}, default 0
Take difference over rows (0) or columns (1).
Only row-wise (0) shift is supported.

Returns
-------
DataFrame
First differences of the DataFrame.

Notes
-----
Diff currently only supports numeric dtype columns.
skirui-source marked this conversation as resolved.
Show resolved Hide resolved

Examples
--------
>>> import cudf
>>> gdf = cudf.DataFrame({'a': [1, 2, 3, 4, 5, 6],
... 'b': [1, 1, 2, 3, 5, 8],
... 'c': [1, 4, 9, 16, 25, 36]})
>>> gdf
a b c
0 1 1 1
1 2 1 4
2 3 2 9
3 4 3 16
4 5 5 25
5 6 8 36
>>> gdf.diff(periods=2)
a b c
0 <NA> <NA> <NA>
1 <NA> <NA> <NA>
2 2 1 8
3 2 2 12
4 2 3 16
5 2 5 20

"""
if not is_integer(periods):
if not (is_float(periods) and periods.is_integer()):
raise ValueError("periods must be an integer")
periods = int(periods)

axis = self._get_axis_from_axis_arg(axis)
if axis != 0:
raise NotImplementedError("Only axis=0 is supported.")

if not all(is_numeric_dtype(i) for i in self.dtypes):
raise NotImplementedError(
"DataFrame.diff only supports numeric dtypes"
)

if abs(periods) > len(self):
df = cudf.DataFrame._from_data(
{
name: column_empty(len(self), dtype=dtype, masked=True)
for name, dtype in zip(self.columns, self.dtypes)
}
)
return df

return self - self.shift(periods=periods)

def drop(
self,
labels=None,
Expand Down
71 changes: 71 additions & 0 deletions python/cudf/cudf/tests/test_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -9070,6 +9070,77 @@ def test_dataframe_add_suffix():
assert_eq(got, expected)


@pytest.mark.parametrize(
"data",
[
np.random.RandomState(seed=10).randint(-50, 50, (25, 30)),
np.random.RandomState(seed=10).random_sample((4, 4)),
np.array([1.123, 2.343, 5.890, 0.0]),
[True, False, True, False, False],
{"a": [1.123, 2.343, np.nan, np.nan], "b": [None, 3, 9.08, None]},
],
)
@pytest.mark.parametrize("periods", (-5, -1, 0, 1, 5))
def test_diff_dataframe_numeric_dtypes(data, periods):
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
gdf = cudf.DataFrame(data)
pdf = gdf.to_pandas()

actual = gdf.diff(periods=periods, axis=0)
expected = pdf.diff(periods=periods, axis=0)

assert_eq(
expected, actual, check_dtype=False,
)


@pytest.mark.parametrize(
("precision", "scale"), [(5, 2), (4, 3), (8, 5), (3, 1), (6, 4)],
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
)
@pytest.mark.parametrize(
"dtype", [cudf.Decimal32Dtype, cudf.Decimal64Dtype],
)
def test_diff_decimal_dtypes(precision, scale, dtype):
gdf = cudf.DataFrame(
np.random.default_rng(seed=42).uniform(10.5, 75.5, (10, 6)),
dtype=dtype(precision=precision, scale=scale),
)
pdf = gdf.to_pandas()

actual = gdf.diff()
expected = pdf.diff()

assert_eq(
expected, actual, check_dtype=False,
)


def test_diff_dataframe_invalid_axis():
with pytest.raises(NotImplementedError, match="Only axis=0 is supported."):
gdf = cudf.DataFrame(np.array([1.123, 2.343, 5.890, 0.0]))
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
gdf.diff(periods=1, axis=1)


@pytest.mark.parametrize(
"data",
[
{
"int_col": [1, 2, 3, 4, 5],
"float_col": [1.0, 2.0, 3.0, 4.0, 5.0],
"string_col": ["a", "b", "c", "d", "e"],
},
["a", "b", "c", "d", "e"],
[np.nan, None, np.nan, None],
],
)
def test_diff_dataframe_non_numeric_dypes(data):
with pytest.raises(
NotImplementedError,
match="DataFrame.diff only supports numeric dtypes",
):
gdf = cudf.DataFrame(data)
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
gdf.diff(periods=2, axis=0)


def test_dataframe_assign_cp_np_array():
m, n = 5, 3
cp_ndarray = cupy.random.randn(m, n)
Expand Down