-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: make pct_change can handle the anchored freq #28664 #28681
Conversation
pct_change didn't work when the freq is anchored(like 1W, 1M, BM) so when the freq is anchored, use data.asfreq(freq) instead of the raw data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@0xF4D3C0D3 just a general comment - your thoroughness is very much appreciated but it would be helpful if you could scale back the length of some of the posts. Posting full tracebacks or huge output is difficult to sift through and makes it unclear what problem(s) we are trying to address, so if you can try and limit to actionable items I think would help review process
@WillAyd Thanks for your advice, I'll bear it in my mind. I'm gonna edit it to leave the main point only. |
@0xF4D3C0D3 I'm still just lost on this PR - can you try copy / pasting the test case from the issue this is trying to solve and start from there? We can dicuss parametrization and changes from there |
@WillAyd below is the original code sample: import pandas as pd
import random
import numpy as np
Creating the time-series index
n=60
index = pd.date_range('01/13/2020', periods = 70,freq='D')
Creating the dataframe
df = pd.DataFrame({"A":np.random.uniform(low=0.5, high=13.3, size=(70,)),
"B":np.random.uniform(low=10.5, high=45.3, size=(70,)),
"C":np.random.uniform(low=70.5, high=85, size=(70,)),
"D":np.random.uniform(low=50.5, high=65.7, size=(70,))}, index = index)
df.pct_change(freq='BM') and this is the mcve of it: import pandas as pd
pd.Series(range(70), pd.date_range('2019', periods=70, freq='D')).pct_change(freq='BM') It raises an error like this: ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-66-e6e97a8957ae> in <module>
----> 1 pd.Series(range(70), pd.date_range('2019', periods=70, freq='D')).pct_change(freq='BM')
~/workspace/venv/marketing_ai3/lib/python3.7/site-packages/pandas/core/generic.py in pct_change(self, periods, fill_method, limit, freq, **kwargs)
9958 rs = (data.div(data.shift(periods=periods, freq=freq, axis=axis,
9959 **kwargs)) - 1)
-> 9960 rs = rs.reindex_like(data)
9961 if freq is None:
9962 mask = isna(com.values_from_object(data))
...
~/workspace/venv/marketing_ai3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
3085 # trying to reindex on an axis with duplicates
3086 if not self.is_unique and len(indexer):
-> 3087 raise ValueError("cannot reindex from a duplicate axis")
3088
3089 def reindex(self, target, method=None, level=None, limit=None,
ValueError: cannot reindex from a duplicate axis So I had added singular(1) and plural(3) periods, anchored('BM') and unanchored('D') freq at first, then as according to @jreback's request, I added new parameters 70 and empty periods which is equivalent to 1 from the original code sample. |
Maybe I am overlooking it but I think something has just gotten confused during the back and forth. If you have this minimum test please start with that and we can build back up from there |
@WillAyd ok, I replaced the test with the mcve for the sake of simplicity, and also to merge with upstream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks for breaking this down to the minimum case. We can build up from here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight edit to wording and if you can move as suggested by Jeff I think this is good
Co-Authored-By: William Ayd <[email protected]>
All benchmark is in the same environment (Mac, Laptop)
First benchmark:
Second benchmark:
Third benchmark:
Forth benchmark:
Overall, they look like being worse in a relative view. I don't know why their performance fluctuates whenever I observed. But I think this is quite acceptable because when I have benchmarked between upstream/master and upstream/master, they also fluctuate likewise. As I told at another issue #28634, I think the fluctuation occurs because its computing time is too short to compare relatively. |
I increased the periods from 3 to 5 for a more detailed explanation. the first of |
def test_pct_change_with_duplicate_axis(self): | ||
# GH 28664 | ||
common_idx = date_range("2019-11-14", periods=5, freq="D") | ||
result = Series(range(5), common_idx).pct_change(freq="B") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually can you replicate the original issue here, I think BM was used. I don't mind having B as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I had two options, the original issue used freq BM and periods 70. I thought you also want the explicit and explainable expected, however, if I used periods 70 then the expected would be quite long and hard to explain why it is so.
so two options are as follows:
- replicate the original issue, but omit the explanation
- use freq B since it's also anchored offset, and leave the explanation as it is
both options are not hard :D
please tell me which one you more prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i c, ok, then let's leave it.
thanks @0xF4D3C0D3 |
closes Bug: pct_change with frequency set as 'BM' throws value error #28664
tests added / passed
passes
black pandas
passes
git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry
pct_change didn't work when the freq is anchored(like
1W
,1M
,BM
)so when the freq is anchored, use
data.asfreq(freq)
instead of the rawdata
.