Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG date_range for Annual interval gives end of year instead of start #9312

Closed
mangecoeur opened this issue Jan 20, 2015 · 14 comments
Closed
Labels
Frequency DateOffsets

Comments

@mangecoeur
Copy link
Contributor

When Creating an Annual date range from the start of a year, the actual dates generated are for the END of that year, and do not include the datetime provided. You would expect to get a range starting from start_date and a series of subsequent dates one year apart

pd.date_range(datetime.datetime(2006, 1, 1, 0,0,0), periods=2, freq='A')

<class 'pandas.tseries.index.DatetimeIndex'>
[2006-12-31, 2007-12-31]
Length: 2, Freq: A-DEC, Timezone: None

For reference, using dateutils.rrule gives intuitive behavior

list(rrule(YEARLY, count=2, dtstart=datetime.datetime(2006, 1, 1, 0,0,0)))

[datetime.datetime(2006, 1, 1, 0, 0), datetime.datetime(2007, 1, 1, 0, 0)]
@rockg
Copy link
Contributor

rockg commented Jan 21, 2015

I don't think this is a bug, but a choice at the time the aliases were made. This is clearly laid out in the date offset documentation ("Offset Aliases" section).

A year end frequency
BA business year end frequency
AS year start frequency

@TomAugspurger
Copy link
Contributor

Agreed with @rockg. I'm guessing the year-end by default is common for many fields, so we can't really say that year-start frequency is more intuitive. Plus pd.date_range(start='2015-01-01', end='2015-12-31', freq='A'), forces you to drop at least one of those dates.

@mangecoeur
Copy link
Contributor Author

Well it is also inconsistent with rrule which would be the closest analogue outside of Pandas. I actually ran into this when converting code from using rrule to pandas and it definitely seems odd. I'm not sure what application defaults to the end of the year.

As for pd.date_range(start='2015-01-01', end='2015-12-31', freq='A') - it seems to me strange to drop the starting date, since most cases in python ranges are left-closed (include the start but not the end). To me this is like writing range(0,10,10) and getting 10 instead of 0. Similarly if you use the periods=1 you supply a start date that is not included in the resulting range. This IMHO is weird behaviour, considering that the simplest case should follow the most common pattern.

@TomAugspurger
Copy link
Contributor

Is left closed more common though (I agree that is is for other python ranges)? But for date stuff, I could see this being the natural behavior (I'm thinking of finance). This had to have been a deliberate choice, and not an implementation detail I think.

Anyway, this would be a big break in backwards compatibility, and I don't think that the benefit justifies the break. Do you? If we did do anything, I'd break date_range into two functions, one for specifying start, end and freq, and another for specifying (start or end), periods, and freq.

Usually in these situations I'd say we need better documentation. The offsets docs are clear I think, maybe we should note that in the docstring for date_range / DatetimeIndex?

@rockg
Copy link
Contributor

rockg commented Jan 21, 2015

I personally agree that start would be a more natural default to have it at the start of the year (I'm in finance and yearly aggregations are most of the time easier to think of as year beginning), but it is too entrenched to change now. The same case could be made for M which is month end which I don't like either. One needs to use MS just like AS.

@TomAugspurger
Copy link
Contributor

Interesting. I wonder why this was the choice originally then. This would be a really big break in comparability though :(.

@mangecoeur
Copy link
Contributor Author

I'm sure left-closed is common usage in Python precisely because it matches the behaviour of range and indexing - any deviations from this habit need to be clearly documented.

I'm not sure of the origin of this behaviour, it's odd from a physical sciences perspective, especially considering that "W - Weekly" gives you the start of the week - why start of the week but end of the year??

It at least needs to be better explained in the date_range docs (I totally missed that table before).

To avoid a backwards incompatible break I would just add new aliases (in any case "A" for "Annual" is odd, when even the docs say "A for Year end frequency", see also #9313).

I would create aliases with Y for Year:
YE = Year End
YS = Year Start
Y = the default (Year Start for preference)

I would also add
AE =Year End

for completeness. This would break nothing and give reasonably behaviour for someone new.

The A aliases could be kept or depreciated over time (I have no strong opinion on this).

As for Month, it should probably also get a ME alias (so at least you can be explicit) but there's otherwise not a good transition story... Maybe just doing a clean break would actually make more sense...

@TomAugspurger
Copy link
Contributor

I like folding in the change here w/ the one from #9313.

The month story is tougher to fix. Let's see what others have to say. Sorry about closing the issue early.

@TomAugspurger TomAugspurger reopened this Jan 21, 2015
@rockg
Copy link
Contributor

rockg commented Jan 21, 2015

I see no reason why we would change 'A' or 'M' behavior...there are offsets that already do start of the respective frequency. We may not think they are named the best, but there must have been a reason a the time and in the grand scheme it's not a big deal--one time of getting it wrong and then one extra letter. I agree that making the documentation clearer and adding more explicit end frequencies would be a good resolution.

@mangecoeur
Copy link
Contributor Author

@rockg certainly no need to change A - but adding Y would still make sense, I think since pandas is still under heavy development it's a good time to adjust this kind of detail. If someone can weight in explaining why that behaviour was chosen it would help the discussion.

@rockg
Copy link
Contributor

rockg commented Jan 21, 2015

Yes, I agree adding Y makes sense.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2015

mo believe a lot of this came from here: http://pytseries.sourceforge.net/core.constants.html

@jreback
Copy link
Contributor

jreback commented Jan 25, 2015

there has been a closed argument to date_range for a while, which you can specify left/right/both to suite your needs.

In [1]: pd.date_range?
Type:        function
String form: <function date_range at 0x106ab0c08>
File:        /Users/jreback/pandas/pandas/tseries/index.py
Definition:  pd.date_range(start=None, end=None, periods=None, freq='D', tz=None, normalize=False, name=None, closed=None)
Docstring:
Return a fixed frequency datetime index, with day (calendar) as the default
frequency

Parameters
----------
start : string or datetime-like, default None
    Left bound for generating dates
end : string or datetime-like, default None
    Right bound for generating dates
periods : integer or None, default None
    If None, must specify start and end
freq : string or DateOffset, default 'D' (calendar daily)
    Frequency strings can have multiples, e.g. '5H'
tz : string or None
    Time zone name for returning localized DatetimeIndex, for example
Asia/Hong_Kong
normalize : bool, default False
    Normalize start/end dates to midnight before generating date range
name : str, default None
    Name of the resulting index
closed : string or None, default None
    Make the interval closed with respect to the given frequency to
    the 'left', 'right', or both sides (None)

Notes
-----
2 of start, end, or periods must be specified

Returns
-------
rng : DatetimeIndex

@jreback jreback closed this as completed Jan 25, 2015
@jreback jreback added the Frequency DateOffsets label Jan 25, 2015
@TomAugspurger
Copy link
Contributor

I think closed only affects the values after they've been aligned at start or end, e.g.

In [5]: pd.date_range(datetime.datetime(2006, 1, 1, 0,0,0), periods=2, freq='A', closed='left')
Out[5]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2006-12-31]
Length: 1, Freq: A-DEC, Timezone: None

so it makes the date_range with two elements, [2006-12-31, 2007-12-31], and then only includes the first with closed='left' since 2007-12-31 falls on the right edge, which is open.

The issue here is that the default should be the initial two elements should be year start, [2006-01-01, 2007-01-01], independent of the closed argument. I'm cool with creating an alias Y for AS,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Frequency DateOffsets
Projects
None yet
Development

No branches or pull requests

4 participants