-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: add shortcut to Timestamp constructor #30676
Conversation
Simply adding a shortcut breaks date_range in an unintuitive way (see test |
pandas/_libs/tslibs/timestamps.pyx
Outdated
@@ -389,7 +389,10 @@ class Timestamp(_Timestamp): | |||
# User passed tzinfo instead of tz; avoid silently ignoring | |||
tz, tzinfo = tzinfo, None | |||
|
|||
if isinstance(ts_input, str): | |||
if isinstance(ts_input, Timestamp) and tz is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're going to need to check that all the other kwargs are None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll change this, but first I need to solve #24329, because right now we rely on Timestamp constructor changing ts.value when called on a localized DST Timestamp (this is why the tests for this PR break currently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel Solved #24329. Now this PR can be reviewed again.
0a9c79e
to
ec8249b
Compare
Hello @AlexKirko! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-01-24 09:24:16 UTC |
"2019-03-10 01:00", | ||
marks=pytest.mark.xfail(reason="GH 24329"), | ||
), | ||
["dateutil/US/Pacific", "shift_backward", "2019-03-10 01:00"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shortcut, along with the fix to #24329 makes this exception no longer necessary, as a correct value gets returned without an error.
UPDATE: #31155 made the check succeed with dateutil.__version__ >= 2.7.0
. With this shortcut, the exception is not necessary with any version of dateutil
. What happened before was that during making a date_range
we would call pd.Timestamp
twice and this would alter the object in case of a dateutil
timezone near a winter/summer DST switch, which would make the test fail. #31155 made sure that the object didn't get altered with an updated version of dateutil
and this shortcut eliminates the danger of the Timestamp
being altered altogether, because the shortcut simply returns that same object.
pandas/_libs/tslibs/timestamps.pyx
Outdated
@@ -379,6 +379,8 @@ class Timestamp(_Timestamp): | |||
_date_attributes = [year, month, day, hour, minute, second, | |||
microsecond, nanosecond] | |||
|
|||
_non_ts_attributes = [freq, tz, unit, tzinfo] + _date_attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we check for emtpy kwargs similar to the way we check for empty date attributes.
pandas/_libs/tslibs/timestamps.pyx
Outdated
# GH 30543 if pd.Timestamp already passed, return it | ||
# check that only ts_input is passed | ||
if (isinstance(ts_input, Timestamp) and not | ||
any(arg is not None for arg in _non_ts_attributes)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as with _date_attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. One question since we're focusing on performance: does it make a difference if you write out the verbose and foo is None and bar is None and baz is None...
? i have a suspicion that cython doesnt optimize this list comprehension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel Let's test it then.
The current implementation:
>>> ts = pd.Timestamp("2019-01-01")
>>> timeit.timeit(lambda: pd.Timestamp(ts), number=10000000)
5.860465700000002
And now with this implementation:
if (isinstance(ts_input, Timestamp) and freq is None and
tz is None and
unit is None and
year is None and
month is None and
day is None and
hour is None and
minute is None and
second is None and
microsecond is None and
nanosecond is None and
tzinfo is None):
return ts_input
>>> ts = pd.Timestamp("2019-01-01")
>>> timeit.timeit(lambda: pd.Timestamp(ts), number=10000000)
4.43885800000001
I've also tested this when we supply other arguments, but the overhead (or early exit from the shortcut if condition) aren't noticeable then, because the non-shortcut Timestamp is so much slower.
I think you are right, and we probably incur the overhead because Cython doesn't explode the list comprehension and instead calls the Python API.
To be honest, the exact reasoning doesn't matter, because I don't think we'll find anything faster than chaining arg is None and
checks.
I suppose I should also move this to whatsnew for version 1.1.0? |
month is None and day is None and hour is None and | ||
minute is None and second is None and | ||
microsecond is None and nanosecond is None and | ||
tzinfo is None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel This appears the way to go if we need maximum performance.
We do lose a bit of speed (between 10 and 20 percent) because we implement the shortcut after errorchecks and _date_attributes
.
Is this fine or should we hoist it after (or before) _date_attributes = [year, month, day, hour, minute, second, microsecond, nanosecond]
? I think the current way is tidier, but it's a tradeoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure all of these extra checks are worth adding to improve perf when passing a Timestamp to a Timestamp constructor - @AlexKirko how often are you expecting that to actually happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how often are you expecting that to actually happen?
For internal usage there are a lot of places where we do:
if isinstance(obj, (datetime, np.datetime64)):
obj = Timestamp(obj)
That's not exactly the usage being checked here, but could benefit in the same way from an optimized no-other-arguments-passed check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittedly, I don't know the library well enough to comment on internal usage, but @jbrockmendel has already done that.
However, what I've done repeatedly in my own projects when on a deadline is take a Dataframe
column or a list
and just cast the elements into the type I need, trusting that if it already is that class, the performance loss won't be noticeable in the larger program (I do lots of ad-hoc data science modeling). I don't think it's as much of a problem in production-quality code, but a lot of people I work with use pandas
to quickly preprocess data for sklearn
.
You tend to rely on being able to cut corners when working with a well-supported package, and, currently, calling Timestamp on a Timestamp is more than 10 000 times slower than the proposed shortcut, which can be a nasty shock for someone expecting to just blitz through type conversions during data wrangling.
@WillAyd We could make the code a bit more compact with what was proposed in 4c9eb70
, which you can look up above, but this incurs about a 25% performance loss. I believe that if we do the shortcut, might as well add extra two lines. It's a bit grating but the gain is worth it, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment to the effect of "we do this verbose thing because cython wont optimize a list comprehension (as of cython 0.29.x)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel Added the comment you requested.
when you say cast? are you doing an astype or some sort of loop? can i show the original example perf issue |
When I say "cast" mean a
@jreback Thanks for suggesting designing and running more tests. I went ahead and did that. TLDR: depending on the usage conditions, the shortcut can speed up conversion to Timestamp to very different degrees. We speed things up about:
In addition:
I think that implementing the shortcut is definitely worth it. As much as we might prefer it to not be true, some people will always loop when there is a better solution available (and sometimtes it's not available). I don't think moving the check to the very top of the function is worth it, as that is very ugly, but using expanded What do you think? The performance tests code and the results are below. If we run the original perf issue code on master:
And on the current commit of the PR branch (
So with original issue code, the shortcut runs about 7 times faster. Let's test this further.
And on the current commit of the PR branch (
So we save about 85% of time on a loop, 30% of time on
So hoisting appears to help on loops.
With list comprehensions, this runs about two times slower in a loop, and a bit slower in lambda. My guess is that since the comprehension doesn't depend on the value converted, it gets optimized away. Also pinging @jbrockmendel @WillAyd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok can you add the benchmarks you did as asv's.
I don't think this is really a big deal as converting scalars is a anti-pattern, we don't do this at scale (sure we do this often on a single basis), but on purpose we don't even store Timestamps (these are stored as i8), so if users are doing it and converting via .apply
then tough.
that said this is ok as its not onerous on maitenance.
Timestamp(self.ts) | ||
|
||
def time_identity_series_apply(self): | ||
self.ts_series.apply(lambda x: Timestamp(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added benchmarks for scalar cast and apply. Didn't touch astype, as this PR doesn't change astype performance. Is this what you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benchmark.tslibs is intended to only depend on _libs.tslibs, so shouldnt have any Series objects in it. just benchmark the Timestamp contructor here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as long as we're at it, might as well have cases for single-arguments for np.datetime64, pydatetime, and tzaware
Timestamp(self.dttime_aware) | ||
|
||
def time_from_pd_timestamp(self): | ||
Timestamp(self.ts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel
Was this what you had in mind? I think we should leave only the scalar transformer for benchmarking this shortcut. Don't see much sense in adding the benchmarks for transforming a Series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. i think if you add @jbrockmendel comment about cython should be good. ping on green.
@jreback |
@jreback |
lgtm. can you rebase, ping on green. |
33fca04
to
c80f748
Compare
@jreback |
thanks @AlexKirko very nice! keep em coming |
…ndexing-1row-df * upstream/master: (194 commits) DOC Remove Python 2 specific comments from documentation (pandas-dev#31198) Follow up PR: pandas-dev#28097 Simplify branch statement (pandas-dev#29243) BUG: DatetimeIndex.snap incorrectly setting freq (pandas-dev#31188) Move DataFrame.info() to live with similar functions (pandas-dev#31317) ENH: accept a dictionary in plot colors (pandas-dev#31071) PERF: add shortcut to Timestamp constructor (pandas-dev#30676) CLN/MAINT: Clean and annotate stata reader and writers (pandas-dev#31072) REF: define _get_slice_axis in correct classes (pandas-dev#31304) BUG: DataFrame.floordiv(ser, axis=0) not matching column-wise bheavior (pandas-dev#31271) PERF: optimize is_scalar, is_iterator (pandas-dev#31294) BUG: Series rolling count ignores min_periods (pandas-dev#30923) xfail sparse warning; closes pandas-dev#31310 (pandas-dev#31311) REF: DatetimeIndex.get_value wrap DTI.get_loc (pandas-dev#31314) CLN: internals.managers (pandas-dev#31316) PERF: avoid copies if possible in fill_binop (pandas-dev#31300) Add test for multiindex json (pandas-dev#31307) BUG: passing TDA and wrong freq to TimedeltaIndex (pandas-dev#31268) BUG: inconsistency between PeriodIndex.get_value vs get_loc (pandas-dev#31172) CLN: remove _set_subtyp (pandas-dev#31301) CI: Updated version of macos image (pandas-dev#31292) ...
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This implements a shortcut in the Timestamp constructor to cut down on processing if Timestamp is passed. We still need to check if the timezone was passed correctly. Then, if a Timestamp was passed, and there is no timezone, we just return that same Timestamp.
A test is added to check that the Timestamp is still the same object.
PR for timedelta to be added once I confirm that this is the approach we want to go with.