Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: make pct_change can handle the anchored freq #28664 #28681

Merged
merged 26 commits into from
Nov 15, 2019

Conversation

dongho-jung
Copy link
Contributor

@dongho-jung dongho-jung commented Sep 30, 2019

pct_change didn't work when the freq is anchored(like 1W, 1M, BM)
so when the freq is anchored, use data.asfreq(freq) instead of the raw
data.
pandas/tests/series/test_timeseries.py Outdated Show resolved Hide resolved
pandas/core/generic.py Outdated Show resolved Hide resolved
@WillAyd WillAyd added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Oct 1, 2019
pandas/core/generic.py Outdated Show resolved Hide resolved
@jreback jreback added the Frequency DateOffsets label Oct 1, 2019
pandas/tests/series/test_timeseries.py Outdated Show resolved Hide resolved
pandas/tests/series/test_timeseries.py Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved
pandas/core/generic.py Show resolved Hide resolved
pandas/tests/series/test_timeseries.py Outdated Show resolved Hide resolved
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0xF4D3C0D3 just a general comment - your thoroughness is very much appreciated but it would be helpful if you could scale back the length of some of the posts. Posting full tracebacks or huge output is difficult to sift through and makes it unclear what problem(s) we are trying to address, so if you can try and limit to actionable items I think would help review process

pandas/tests/series/test_timeseries.py Outdated Show resolved Hide resolved
@dongho-jung
Copy link
Contributor Author

@WillAyd Thanks for your advice, I'll bear it in my mind. I'm gonna edit it to leave the main point only.

pandas/tests/series/test_timeseries.py Outdated Show resolved Hide resolved
@TomAugspurger TomAugspurger added this to the 1.0 milestone Oct 19, 2019
@dongho-jung dongho-jung requested a review from jreback October 19, 2019 14:11
@WillAyd
Copy link
Member

WillAyd commented Oct 22, 2019

@0xF4D3C0D3 I'm still just lost on this PR - can you try copy / pasting the test case from the issue this is trying to solve and start from there? We can dicuss parametrization and changes from there

@dongho-jung
Copy link
Contributor Author

@0xF4D3C0D3 I'm still just lost on this PR - can you try copy / pasting the test case from the issue this is trying to solve and start from there? We can dicuss parametrization and changes from there

@WillAyd below is the original code sample:

import pandas as pd
import random
import numpy as np


Creating the time-series index 
n=60
index = pd.date_range('01/13/2020', periods = 70,freq='D') 
  
Creating the dataframe  
df = pd.DataFrame({"A":np.random.uniform(low=0.5, high=13.3, size=(70,)), 
                   "B":np.random.uniform(low=10.5, high=45.3, size=(70,)),  
                   "C":np.random.uniform(low=70.5, high=85, size=(70,)), 
                   "D":np.random.uniform(low=50.5, high=65.7, size=(70,))}, index = index) 


df.pct_change(freq='BM')

and this is the mcve of it:

import pandas as pd
pd.Series(range(70), pd.date_range('2019', periods=70, freq='D')).pct_change(freq='BM')

It raises an error like this:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-66-e6e97a8957ae> in <module>
----> 1 pd.Series(range(70), pd.date_range('2019', periods=70, freq='D')).pct_change(freq='BM')

~/workspace/venv/marketing_ai3/lib/python3.7/site-packages/pandas/core/generic.py in pct_change(self, periods, fill_method, limit, freq, **kwargs)
   9958         rs = (data.div(data.shift(periods=periods, freq=freq, axis=axis,
   9959                                   **kwargs)) - 1)
-> 9960         rs = rs.reindex_like(data)
   9961         if freq is None:
   9962             mask = isna(com.values_from_object(data))

...

~/workspace/venv/marketing_ai3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3085         # trying to reindex on an axis with duplicates
   3086         if not self.is_unique and len(indexer):
-> 3087             raise ValueError("cannot reindex from a duplicate axis")
   3088 
   3089     def reindex(self, target, method=None, level=None, limit=None,

ValueError: cannot reindex from a duplicate axis

So I had added singular(1) and plural(3) periods, anchored('BM') and unanchored('D') freq at first, then as according to @jreback's request, I added new parameters 70 and empty periods which is equivalent to 1 from the original code sample.

@WillAyd
Copy link
Member

WillAyd commented Nov 8, 2019

and this is the mcve of it:

import pandas as pd
pd.Series(range(70), pd.date_range('2019', periods=70, freq='D')).pct_change(freq='BM')

Maybe I am overlooking it but I think something has just gotten confused during the back and forth. If you have this minimum test please start with that and we can build back up from there

@dongho-jung
Copy link
Contributor Author

and this is the mcve of it:

import pandas as pd
pd.Series(range(70), pd.date_range('2019', periods=70, freq='D')).pct_change(freq='BM')

Maybe I am overlooking it but I think something has just gotten confused during the back and forth. If you have this minimum test please start with that and we can build back up from there

@WillAyd ok, I replaced the test with the mcve for the sake of simplicity, and also to merge with upstream.
I think it would be better if it's just a sanity test in the beginning.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks for breaking this down to the minimum case. We can build up from here

pandas/tests/series/test_timeseries.py Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight edit to wording and if you can move as suggested by Jeff I think this is good

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved
@dongho-jung
Copy link
Contributor Author

  1. I moved a whatsnew entry from Other to Reshaping
  2. I also changed slightly edited to the wording according to what @WillAyd suggested.
  3. the important part, asv benchmark, is as follows:
All benchmark is in the same environment (Mac, Laptop)

First benchmark:

+         129±3μs         146±20μs     1.14  index_cached_properties.IndexCache.time_is_monotonic_decreasing('TimedeltaIndex')
+     1.89±0.05ms       2.14±0.3ms     1.14  index_cached_properties.IndexCache.time_is_unique('MultiIndex')
+      1.69±0.1μs       1.92±0.2μs     1.13  index_cached_properties.IndexCache.time_is_all_dates('MultiIndex')
+        799±30ns         905±20ns     1.13  index_cached_properties.IndexCache.time_is_monotonic_decreasing('RangeIndex')
+        370±10μs         409±40μs     1.10  index_cached_properties.IndexCache.time_is_monotonic('MultiIndex')
+       940±100ns       1.04±0.1μs     1.10  index_cached_properties.IndexCache.time_is_all_dates('Float64Index')
-        517±20ns         461±20ns     0.89  index_cached_properties.IndexCache.time_is_all_dates('Int64Index')
-      1.29±0.2μs       1.11±0.1μs     0.86  index_cached_properties.IndexCache.time_is_all_dates('UInt64Index')
-      2.36±0.4μs       2.03±0.1μs     0.86  index_cached_properties.IndexCache.time_inferred_type('IntervalIndex')

Second benchmark:

+     1.12±0.03μs       1.44±0.2μs     1.29  index_cached_properties.IndexCache.time_values('DatetimeIndex')
+     1.05±0.02μs       1.23±0.1μs     1.17  index_cached_properties.IndexCache.time_is_all_dates('PeriodIndex')
+        603±20ns         667±20ns     1.11  index_cached_properties.IndexCache.time_is_monotonic_increasing('Int64Index')
-      1.20±0.1μs       1.08±0.1μs     0.90  index_cached_properties.IndexCache.time_is_all_dates('UInt64Index')
-        480±10ns         433±20ns     0.90  index_cached_properties.IndexCache.time_is_unique('Int64Index')
-      4.23±0.7μs       3.74±0.4μs     0.89  index_cached_properties.IndexCache.time_engine('TimedeltaIndex')
-        10.9±3μs         6.12±2μs     0.56  index_cached_properties.IndexCache.time_engine('Float64Index')

Third benchmark:

+      37.5±0.5μs         46.0±5μs     1.23  index_cached_properties.IndexCache.time_engine('RangeIndex')
+      1.06±0.2μs       1.27±0.2μs     1.20  index_cached_properties.IndexCache.time_values('TimedeltaIndex')
+       990±100ns       1.19±0.2μs     1.20  index_cached_properties.IndexCache.time_is_all_dates('TimedeltaIndex')
-      1.22±0.1μs       1.09±0.1μs     0.89  index_cached_properties.IndexCache.time_is_all_dates('UInt64Index')
-     1.22±0.09μs      1.06±0.03μs     0.86  index_cached_properties.IndexCache.time_inferred_type('DatetimeIndex')
-      1.42±0.2μs       1.11±0.1μs     0.78  index_cached_properties.IndexCache.time_values('UInt64Index')
-      2.68±0.6μs       2.00±0.1μs     0.74  index_cached_properties.IndexCache.time_inferred_type('IntervalIndex')
-        11.1±3μs         6.80±2μs     0.61  index_cached_properties.IndexCache.time_engine('Float64Index')

Forth benchmark:

+        777±30ns         896±70ns     1.15  index_cached_properties.IndexCache.time_is_monotonic('Int64Index')
+        437±20ns         500±30ns     1.14  index_cached_properties.IndexCache.time_is_all_dates('RangeIndex')
+      6.95±0.4μs       7.90±0.4μs     1.14  index_cached_properties.IndexCache.time_engine('UInt64Index')
+      1.12±0.1μs       1.27±0.1μs     1.13  index_cached_properties.IndexCache.time_values('UInt64Index')
+      1.01±0.1μs       1.13±0.2μs     1.12  index_cached_properties.IndexCache.time_values('Float64Index')
+      2.18±0.3μs       2.42±0.4μs     1.11  index_cached_properties.IndexCache.time_shape('TimedeltaIndex')
-      1.26±0.2μs       1.11±0.1μs     0.89  index_cached_properties.IndexCache.time_is_all_dates('UInt64Index')
-        507±30ns         447±30ns     0.88  index_cached_properties.IndexCache.time_inferred_type('Int64Index')

Overall, they look like being worse in a relative view. I don't know why their performance fluctuates whenever I observed. But I think this is quite acceptable because when I have benchmarked between upstream/master and upstream/master, they also fluctuate likewise.

As I told at another issue #28634, I think the fluctuation occurs because its computing time is too short to compare relatively.

pandas/tests/series/test_timeseries.py Outdated Show resolved Hide resolved
@dongho-jung
Copy link
Contributor Author

I increased the periods from 3 to 5 for a more detailed explanation. the first of pct_change should be NaN as there is no previous value and the next should be inf as the previous value is 0 and the current value is 1. The next two values should be NaN because 2019-11-16, 2019-11-17 are not business days. The last value should be 3 since the previous value is 1 and the current value is 1.

def test_pct_change_with_duplicate_axis(self):
# GH 28664
common_idx = date_range("2019-11-14", periods=5, freq="D")
result = Series(range(5), common_idx).pct_change(freq="B")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually can you replicate the original issue here, I think BM was used. I don't mind having B as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I had two options, the original issue used freq BM and periods 70. I thought you also want the explicit and explainable expected, however, if I used periods 70 then the expected would be quite long and hard to explain why it is so.

so two options are as follows:

  1. replicate the original issue, but omit the explanation
  2. use freq B since it's also anchored offset, and leave the explanation as it is

both options are not hard :D
please tell me which one you more prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i c, ok, then let's leave it.

@jreback jreback merged commit 710d82c into pandas-dev:master Nov 15, 2019
@jreback
Copy link
Contributor

jreback commented Nov 15, 2019

thanks @0xF4D3C0D3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Frequency DateOffsets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: pct_change with frequency set as 'BM' throws value error
5 participants