-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' #31809
Conversation
8c49bb6
to
0cc6149
Compare
Instead of adding a new keyword, might be nice if |
I would rather suggest the following: I always thought that the
This behavior is very confusing for the users (myself included), but it also creates bugs: see #25161, #25226 Instead of relying on Example of the current use of loffset with resample: >>> start, end = "1/1/2000 00:00:00", "1/31/2000 00:00"
>>> rng = pd.date_range(start, end, freq="1231min")
>>> ts = pd.Series(np.arange(len(rng)), index=rng)
>>> ts.resample("1min", loffset=-pd.Timedelta("1min")).count()
Example of the current broken loffset argument: >>> ts.groupby(pd.Grouper(freq="1min", loffset=-pd.Timedelta("1min"))).count()
That being said, I agree that the naming of adjust_timestamp is not ideal. I would rename it into: The line https://github.com/pandas-dev/pandas/blob/master/pandas/core/resample.py#L1728 would be replaced by something roughly equivalent to:
TL;DR:
What do you think? |
I just realised that >>> start, end = "1/1/2000 00:00:00", "1/31/2000 00:00"
>>> rng = pd.date_range(start, end, freq="1231min")
>>> ts = pd.Series(np.arange(len(rng)), index=rng)
>>> ts.resample("1min", loffset=-pd.Timedelta("365D")).count()
So I would suggest the following instead:
I will not fix What do you think ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think base and loffset actually are pretty useful. However for non-evenly divisible freq the issue is that you likely simply want to use the first (or maybe the last) timestamp as the base. So how about we just add that ability in base to accept the string first
or last
rather than adding another keyword?
@jreback this won't fix the issue that I'm trying to tackle. The idea is to be able to have a fixed timestamp as a "origin" that does not depend of the time series. So neither the base argument with I could use the |
@hasB4K not averse with changing things. But we currently have base, loffset, so I don' really like the idea of another another pretty opaque options. I would be onboard with deprecating both of these and replacing with 2 options, e.g. origin and offset come to mind. |
So would this signature be ok with you @jreback?
|
sure that looks reasonable. we would need to have a pretty nice deprecation message that shows one how to convert base and/or loffset to the new args (as well as a whatsnew and warning box in the docs); they can bascially be the same though. its how we want folks to migrate. |
Perfect, I will implement that in this PR then 🙂
Yep, it seems quite necessary! Is there an example of a nice deprecation message in the current (or in the old) code that I could look into? |
there are some (recently removed in 1.0.0) deprecation messages in resample on how to handle the freq arg. myabe not great but ok :-> |
2d2beba
to
3097767
Compare
3cbeac4
to
bbcbf7c
Compare
@jreback I still need to add more examples for 'origin' and 'offset' and update the "what's new" part of the doc, but otherwise, it's ready for review 🙂 |
Co-Authored-By: William Ayd <[email protected]>
…with_day_freq_on_dst
16a6831
to
de6b477
Compare
very nice @hasB4K this was quite some PR! please have a read thru the built docs (https://dev.pandas.io/), will take a little bfeore they are there. and if needed issue a followup to clarify. and keep em coming! |
Thank you @jreback! 😃 The inputs and guidance from @mroeschke, @WillAyd and you was really interesting and challenging in the good way! I am really glad of the current state of this new functionality. Thank you all! 🎉 |
The "base" kwarg is no longer valid for resample in pandas. See pandas-dev/pandas#31809
EDIT: this PR has changed, now instead of adding
adjust_timestamp
we are addingorigin
andoffset
arguments toresample
andpd.Grouper
(see #31809 (comment))Hello,
This enhancement is an alternative to the
base
argument present inpd.Grouper
or in the methodresample
. It adds theadjust_timestamp
argument to change the current behavior of: https://github.com/pandas-dev/pandas/blob/master/pandas/core/resample.py#L1728adjust_timestamp
is the timestamp on which to adjust the grouping. If None is passed, the first day of the time series at midnight is used.Currently the bins of the grouping are adjusted based on the beginning of the day of the time series starting point. This works well with frequencies that are multiples of a day (like
30D
) or that divides a day (like90s
or1min
). But it can create inconsistencies with some frequencies that do not meet this criteria.Here is a simple snippet from a test that I added that proves that the current behavior can lead to some inconsistencies. Inconsistencies that can be fixed if we use
adjust_timestamp
:I think this PR is ready to be merged, but I am of course open to any suggestions or criticism. 😉
For instance, I am not sure if the naming of
adjust_timestamp
is correct. An alternative could bebase_timestamp
orref_timestamp
🤔?Cheers,
base
argument #25226closes groupby(pd.Grouper) ignores loffset #28302
closes resample becomes non-deterministic, depending on DateTimeIndex values #28675
closes BUG: resample closed='left' not binning correctly. #4197
closes ENH: resample(..., base='start') for automaticly determining base. #8521
loffset
andbase
in the codeloffset
andbase
in the docorigin
andoffset
offset
example)black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff