-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Separate window bounds calculation from aggregation functions #29428
REF: Separate window bounds calculation from aggregation functions #29428
Conversation
pandas/core/window/rolling.py
Outdated
|
||
return self._apply(f, func, args=args, kwargs=kwargs, center=False, raw=raw) | ||
# Why do we always pass center=False? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO?
pandas/_libs/window_indexer.pyx
Outdated
return start, end | ||
|
||
def get_window_bounds(self): | ||
return self.start, self.end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline
pandas/_libs/window_indexer.pyx
Outdated
|
||
# TODO: Maybe will need to use this? | ||
# max window size | ||
#self.win = (self.end - self.start).max() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
pandas/_libs/window_indexer.pyx
Outdated
# max window size | ||
#self.win = (self.end - self.start).max() | ||
|
||
def build(self, const int64_t[:] index, int64_t win, bint left_closed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the function doesnt use self?
pandas/_libs/window_indexer.pyx
Outdated
""" | ||
def __init__(self, ndarray values, int64_t win, object closed, object index=None): | ||
cdef: | ||
ndarray start_s, start_e, end_s, end_e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ndarray[int64_t, ndim=1]?
are there more window/rolling-centric .pyx files on the horizon? if so, would it make sense to make a _libs/window/ directory? |
@jbrockmendel Should just be |
sounds good. I think skiplist may belong in there too. If the intra-pandas dependencies of _libs/windows/ can be tighted locked down (e.g. "only _libs.util") thatd be great |
@jreback @jbrockmendel tests are passing locally now. Since this PR is already bulky, the follow up PR will be
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. followups noted. ping on green.
@TomAugspurger @jorisvandenbossche @jbrockmendel if any comments
end_e = start_e + win | ||
self.end = np.concatenate([end_s, end_e]) | ||
|
||
def get_window_bounds(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm interesting, though I think still can type. Anyhow for a try in a followup.
Happy to defer to others here. Things seem nice based on a quick skim. |
@@ -442,80 +182,75 @@ cdef inline void remove_sum(float64_t val, int64_t *nobs, float64_t *sum_x) nogi | |||
sum_x[0] = sum_x[0] - val | |||
|
|||
|
|||
def roll_sum(ndarray[float64_t] values, int64_t win, int64_t minp, | |||
object index, object closed): | |||
def roll_sum_variable(ndarray[float64_t] values, ndarray[int64_t] start, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does ndarray[type_t]
vs type_t[:]
make a difference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely noticeable?
# np buffer
In [1]: N = 1_000_000
...: s = pd.Series(range(N), index=pd.date_range('2019', periods=N, freq='s'))
...: roll = s.rolling('1H')
...: %timeit roll.sum()
28.8 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# memoryview
In [1]: N = 1_000_000
...: s = pd.Series(range(N), index=pd.date_range('2019', periods=N, freq='s'))
...: roll = s.rolling('1H')
...: %timeit roll.sum()
28.8 ms ± 416 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
|
||
# fixed window | ||
output = np.empty(N, dtype=float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does float vs np.float64 matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure in a cython context but:
In [2]: %timeit np.empty(N, dtype=float)
127 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [3]: %timeit np.empty(N, dtype=np.float64)
127 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [4]: N
Out[4]: 1000000
with nogil: | ||
for i in range(minp - 1): | ||
val = values[i] | ||
add_skew(val, &nobs, &x, &xx, &xxx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can x, xx, and xxx have more informative names? (not a blocker, as its the same as status quo)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah copied verbatim, but can address in a followup
pandas/_libs/window.pyx
Outdated
|
||
Parameters | ||
---------- | ||
values: numpy array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after colon, in these files we usually specify the dtype as if it were a cython annotation, so np.ndarray[np.float64]
minp, index, | ||
closed, | ||
floor=0) | ||
counts = roll_sum_fixed(np.concatenate([np.isfinite(arr).astype(float), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if is is perf-relevant, i expect there is a cnp version of isfinite.
same question before about float vs float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was getting TypeError: only size-1 arrays can be converted to Python scalars
from the test suite when trying to use cnp.math.isfinite
here. I can try to get it working later but I am fairly confident that it won't be a performance bottleneck here.
if n == 0: | ||
return obj | ||
|
||
arr = np.asarray(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we be confident we wont get here with e.g. datetime64tz?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each block of data is attempted to be cast to float first:
try:
values = self._prep_values(b.values)
except (TypeError, NotImplementedError):
if isinstance(obj, ABCDataFrame):
exclude.extend(b.columns)
del block_list[i]
continue
else:
raise DataError("No numeric types to aggregate")
Therefore np.asarray(obj)
should always be valid here. (Also copied verbatim from the refactor)
@@ -1414,8 +1454,14 @@ def skew(self, **kwargs): | |||
) | |||
|
|||
def kurt(self, **kwargs): | |||
window_func = self._get_cython_func_type("roll_kurt") | |||
kwargs.pop("require_min_periods", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens here if the user passes a weird value for require_min_periods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require_min_periods
is effectively an internal variable and shouldn't be expected from an external API. I need to pop here because of kwargs passed from other super
calls.
|
||
import numpy as np | ||
from numpy cimport ndarray, int64_t | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there anything in this module that we can/should test independently of the rest of the imlpementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a 2nd follow up PR, I am planning on allowing users to create their own "window indexers" to be passed into rolling(...)
. In that PR I can add tests for these existing indexers then. They have been effectively smoke tested since they get hit with every rolling test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor follow up comments, thanks @mroeschke
output[:] = NaN | ||
return output | ||
|
||
win = (end - start).max() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can create win about (followup ok)
end_e = start_e + win | ||
self.end = np.concatenate([end_s, end_e]) | ||
|
||
def get_window_bounds(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its optimizing readability :->
@@ -96,280 +96,20 @@ def _check_minp(win, minp, N, floor=None) -> int: | |||
# Physical description: 366 p. | |||
# Series: Prentice-Hall Series in Automatic Computation | |||
|
|||
# ---------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think using _check_minp above? and likely can be moved to indexer.pyx anyhow (next pass)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also the references above are misplaced, not sure where they should go
kwargs=kwargs, | ||
raw=raw, | ||
offset=offset, | ||
func=func, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy error: "partial" gets multiple values for keyword argument "func"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What version of mypy raises this? I get this with 0.740
(pandas-dev) matthewroeschke:pandas-mroeschke matthewroeschke$ mypy pandas
pandas/core/indexes/frozen.py:112: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as overloaded function)
pandas/core/indexes/frozen.py:112: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], Union[int, slice]], None]")
pandas/core/indexes/frozen.py:113: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], int], Any]")
pandas/core/indexes/frozen.py:113: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], Any], None]")
pandas/core/indexes/frozen.py:113: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], Iterable[Any]], None]")
pandas/core/indexes/frozen.py:113: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], DefaultNamedArg(Optional[Callable[[Any], Any]], 'key'), DefaultNamedArg(bool, 'reverse')], None]")
Found 6 errors in 1 file (checked 807 source files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm getting that error on 0.740 with --check-untyped-defs (on #28339)
The problem is that the required argument for partial is named func, so I assume you can't also pass func as a keyword argument.
functools.partial(func, /, *args, **keywords)
EDIT: 0.730 -> 0.740
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking into this further, I think this is a false positive from mypy.
the __new__
of class partial seems to be able to handle this use case. testing with a minimum examples doesn't seem to break. so it appears that it is a typeshed issue.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Pre-req for #28987
Currently many of the aggregation functions in
window.pyx
follow the form:This PR refactors out the window bound calculation into
window_indexer.pyx
and validation so the aggregation functions can be of the form:The methods therefore in
rolling.py
now have the following pattern:start
andend
window bounds from functionality inwindow_indexer.pyx
values
,start
,end
,min periods
into the aggregation function.