Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: add shortcut to Timestamp constructor #30676

Merged
merged 12 commits into from
Jan 26, 2020

Conversation

AlexKirko
Copy link
Member

@AlexKirko AlexKirko commented Jan 4, 2020

This implements a shortcut in the Timestamp constructor to cut down on processing if Timestamp is passed. We still need to check if the timezone was passed correctly. Then, if a Timestamp was passed, and there is no timezone, we just return that same Timestamp.
A test is added to check that the Timestamp is still the same object.

PR for timedelta to be added once I confirm that this is the approach we want to go with.

@AlexKirko
Copy link
Member Author

Simply adding a shortcut breaks date_range in an unintuitive way (see test TestDatetimeIndexTimezones.test_dti_construction_ambiguous_endpoint). I'll look deeper into this today to see why date_range no longer takes into account daylight savings with my fix.

@@ -389,7 +389,10 @@ class Timestamp(_Timestamp):
# User passed tzinfo instead of tz; avoid silently ignoring
tz, tzinfo = tzinfo, None

if isinstance(ts_input, str):
if isinstance(ts_input, Timestamp) and tz is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're going to need to check that all the other kwargs are None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll change this, but first I need to solve #24329, because right now we rely on Timestamp constructor changing ts.value when called on a localized DST Timestamp (this is why the tests for this PR break currently).

Copy link
Member Author

@AlexKirko AlexKirko Jan 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how digging through the code goes, this PR might also take care of #24329, but I still need a couple days to analyze tz_localize, pd.date_range, and the Timestamp constructor.
Update: a separate PR is needed. Dealing with #24329 in PR #30995, will return here after that gets done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel Solved #24329. Now this PR can be reviewed again.

@alimcmaster1 alimcmaster1 added the Performance Memory or execution speed performance label Jan 4, 2020
@pep8speaks
Copy link

pep8speaks commented Jan 15, 2020

Hello @AlexKirko! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-24 09:24:16 UTC

"2019-03-10 01:00",
marks=pytest.mark.xfail(reason="GH 24329"),
),
["dateutil/US/Pacific", "shift_backward", "2019-03-10 01:00"],
Copy link
Member Author

@AlexKirko AlexKirko Jan 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shortcut, along with the fix to #24329 makes this exception no longer necessary, as a correct value gets returned without an error.
UPDATE: #31155 made the check succeed with dateutil.__version__ >= 2.7.0 . With this shortcut, the exception is not necessary with any version of dateutil. What happened before was that during making a date_range we would call pd.Timestamp twice and this would alter the object in case of a dateutil timezone near a winter/summer DST switch, which would make the test fail. #31155 made sure that the object didn't get altered with an updated version of dateutil and this shortcut eliminates the danger of the Timestamp being altered altogether, because the shortcut simply returns that same object.

@@ -379,6 +379,8 @@ class Timestamp(_Timestamp):
_date_attributes = [year, month, day, hour, minute, second,
microsecond, nanosecond]

_non_ts_attributes = [freq, tz, unit, tzinfo] + _date_attributes
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we check for emtpy kwargs similar to the way we check for empty date attributes.

# GH 30543 if pd.Timestamp already passed, return it
# check that only ts_input is passed
if (isinstance(ts_input, Timestamp) and not
any(arg is not None for arg in _non_ts_attributes)):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as with _date_attributes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. One question since we're focusing on performance: does it make a difference if you write out the verbose and foo is None and bar is None and baz is None...? i have a suspicion that cython doesnt optimize this list comprehension

Copy link
Member Author

@AlexKirko AlexKirko Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel Let's test it then.
The current implementation:

>>> ts = pd.Timestamp("2019-01-01")
>>> timeit.timeit(lambda: pd.Timestamp(ts), number=10000000)
5.860465700000002

And now with this implementation:

        if (isinstance(ts_input, Timestamp) and freq is None and
                tz is None and
                unit is None and
                year is None and
                month is None and
                day is None and
                hour is None and
                minute is None and
                second is None and
                microsecond is None and
                nanosecond is None and
                tzinfo is None):
            return ts_input
>>> ts = pd.Timestamp("2019-01-01")
>>> timeit.timeit(lambda: pd.Timestamp(ts), number=10000000)
4.43885800000001

I've also tested this when we supply other arguments, but the overhead (or early exit from the shortcut if condition) aren't noticeable then, because the non-shortcut Timestamp is so much slower.
I think you are right, and we probably incur the overhead because Cython doesn't explode the list comprehension and instead calls the Python API.
To be honest, the exact reasoning doesn't matter, because I don't think we'll find anything faster than chaining arg is None and checks.

@AlexKirko
Copy link
Member Author

AlexKirko commented Jan 15, 2020

I suppose I should also move this to whatsnew for version 1.1.0?
Update: moved it since we already have the 1.0.0 release candidate.

month is None and day is None and hour is None and
minute is None and second is None and
microsecond is None and nanosecond is None and
tzinfo is None):
Copy link
Member Author

@AlexKirko AlexKirko Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel This appears the way to go if we need maximum performance.
We do lose a bit of speed (between 10 and 20 percent) because we implement the shortcut after errorchecks and _date_attributes.
Is this fine or should we hoist it after (or before) _date_attributes = [year, month, day, hour, minute, second, microsecond, nanosecond]? I think the current way is tidier, but it's a tradeoff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure all of these extra checks are worth adding to improve perf when passing a Timestamp to a Timestamp constructor - @AlexKirko how often are you expecting that to actually happen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how often are you expecting that to actually happen?

For internal usage there are a lot of places where we do:

if isinstance(obj, (datetime, np.datetime64)):
    obj = Timestamp(obj)

That's not exactly the usage being checked here, but could benefit in the same way from an optimized no-other-arguments-passed check.

Copy link
Member Author

@AlexKirko AlexKirko Jan 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly, I don't know the library well enough to comment on internal usage, but @jbrockmendel has already done that.

However, what I've done repeatedly in my own projects when on a deadline is take a Dataframe column or a list and just cast the elements into the type I need, trusting that if it already is that class, the performance loss won't be noticeable in the larger program (I do lots of ad-hoc data science modeling). I don't think it's as much of a problem in production-quality code, but a lot of people I work with use pandas to quickly preprocess data for sklearn.

You tend to rely on being able to cut corners when working with a well-supported package, and, currently, calling Timestamp on a Timestamp is more than 10 000 times slower than the proposed shortcut, which can be a nasty shock for someone expecting to just blitz through type conversions during data wrangling.

@WillAyd We could make the code a bit more compact with what was proposed in 4c9eb70, which you can look up above, but this incurs about a 25% performance loss. I believe that if we do the shortcut, might as well add extra two lines. It's a bit grating but the gain is worth it, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment to the effect of "we do this verbose thing because cython wont optimize a list comprehension (as of cython 0.29.x)"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel Added the comment you requested.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2020

when you say cast? are you doing an astype or some sort of loop? can i show the original example perf issue

@AlexKirko
Copy link
Member Author

AlexKirko commented Jan 18, 2020

When I say "cast" mean a timeit loop of pd.Timestamp(ts) in the python console. Something along the lines of:

ts = pd.Timestamp(100)
timeit(lambda: pd.Timestamp(ts), number=10000000)

@jreback Thanks for suggesting designing and running more tests. I went ahead and did that.

TLDR: depending on the usage conditions, the shortcut can speed up conversion to Timestamp to very different degrees. We speed things up about:

  1. >10 000 times in a timeit call in the console (testing done previously).
  2. ~7 times in a %timeit call in Jupyter Notebook
  3. ~30% when using pd.Series.apply
  4. Nothing when using pd.Series.astype.

In addition:

  1. Hoisting the check to the very top of __new__speeds up loop performance by about 20% and doesn't speed up anything else.
  2. Replacing all the is not None checks with not any on a list comprehension slows down loop performance by about 100%, slows down apply a bit, and doesn't impact astype.

I think that implementing the shortcut is definitely worth it. As much as we might prefer it to not be true, some people will always loop when there is a better solution available (and sometimtes it's not available). I don't think moving the check to the very top of the function is worth it, as that is very ugly, but using expanded is not None gives a large performance boost in loops, so I think we should keep it.

What do you think?

The performance tests code and the results are below.

If we run the original perf issue code on master:

IN:
ts = pd.Timestamp('2020')
%timeit pd.Timestamp(ts)
OUT:
2.17 µs ± 168 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

And on the current commit of the PR branch (bf08cc8):

IN:
ts = pd.Timestamp('2020')
%timeit pd.Timestamp(ts)
OUT:
372 ns ± 31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So with original issue code, the shortcut runs about 7 times faster.
This is, of course, the worst possible case, but loops might be used only by somebody just starting with pandas and Python in general.

Let's test this further.
On master:

IN:
t_ser = pd.Series(range(100000))
t_ser = t_ser.astype('<M8[ns]')
%timeit t_ser.apply(lambda x: pd.Timestamp(x))
%timeit t_ser.astype('<M8[ns]')
OUT:
634 ms ± 25.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
146 µs ± 2.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And on the current commit of the PR branch (bf08cc8):

IN:
t_ser = pd.Series(range(100000))
t_ser = t_ser.astype('<M8[ns]')
%timeit t_ser.apply(lambda x: pd.Timestamp(x))
%timeit t_ser.astype('<M8[ns]')
OUT:
464 ms ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
151 µs ± 4.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So we save about 85% of time on a loop, 30% of time on apply, and nothing on astype. My guess is that astype has its own shortcut for when the user tries to cast a Series into its own dtype.
I also hoisted the check to the very top of the constructor and ran this again:

IN:
ts = pd.Timestamp('2020')
%timeit pd.Timestamp(ts)
OUT:
298 ns ± 18 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


IN:
t_ser = pd.Series(range(100000))
t_ser = t_ser.astype('<M8[ns]')
%timeit t_ser.apply(lambda x: pd.Timestamp(x))
%timeit t_ser.astype('<M8[ns]')
OUT:
451 ms ± 23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
160 µs ± 8.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So hoisting appears to help on loops.
And also ran it on commit 4c9eb70 (the solution with not any(arg is not None for arg in _non_ts_attributes):

IN:
ts = pd.Timestamp('2020')
%timeit pd.Timestamp(ts)
OUT:
623 ns ± 25.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

IN:
t_ser = pd.Series(range(100000))
t_ser = t_ser.astype('<M8[ns]')
%timeit t_ser.apply(lambda x: pd.Timestamp(x))
%timeit t_ser.astype('<M8[ns]')
OUT:
482 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
154 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

With list comprehensions, this runs about two times slower in a loop, and a bit slower in lambda. My guess is that since the comprehension doesn't depend on the value converted, it gets optimized away.

Also pinging @jbrockmendel @WillAyd
These testing results should make it easier to decide on what we want to implement here.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexKirko

ok can you add the benchmarks you did as asv's.

I don't think this is really a big deal as converting scalars is a anti-pattern, we don't do this at scale (sure we do this often on a single basis), but on purpose we don't even store Timestamps (these are stored as i8), so if users are doing it and converting via .apply then tough.

that said this is ok as its not onerous on maitenance.

@jreback jreback added this to the 1.1 milestone Jan 18, 2020
Timestamp(self.ts)

def time_identity_series_apply(self):
self.ts_series.apply(lambda x: Timestamp(x))
Copy link
Member Author

@AlexKirko AlexKirko Jan 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback

Added benchmarks for scalar cast and apply. Didn't touch astype, as this PR doesn't change astype performance. Is this what you had in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benchmark.tslibs is intended to only depend on _libs.tslibs, so shouldnt have any Series objects in it. just benchmark the Timestamp contructor here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as we're at it, might as well have cases for single-arguments for np.datetime64, pydatetime, and tzaware

@AlexKirko AlexKirko requested a review from jreback January 19, 2020 07:30
Timestamp(self.dttime_aware)

def time_from_pd_timestamp(self):
Timestamp(self.ts)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
Was this what you had in mind? I think we should leave only the scalar transformer for benchmarking this shortcut. Don't see much sense in adding the benchmarks for transforming a Series.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect, thanks

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. i think if you add @jbrockmendel comment about cython should be good. ping on green.

@AlexKirko
Copy link
Member Author

@jreback
Done.

@AlexKirko AlexKirko requested a review from jreback January 21, 2020 08:59
@AlexKirko
Copy link
Member Author

@jreback
All green, should be ready to merge.

@jreback
Copy link
Contributor

jreback commented Jan 24, 2020

lgtm. can you rebase, ping on green.

@AlexKirko
Copy link
Member Author

@jreback
Rebased, all green.
Also cleaned up the conflicts with my previour 2 PRs. Don't need one of the dateutil xfails now.

@jreback jreback merged commit 35df212 into pandas-dev:master Jan 26, 2020
@jreback
Copy link
Contributor

jreback commented Jan 26, 2020

thanks @AlexKirko very nice! keep em coming

@AlexKirko AlexKirko deleted the perf-timestamp branch January 27, 2020 06:55
keechongtan added a commit to keechongtan/pandas that referenced this pull request Jan 27, 2020
…ndexing-1row-df

* upstream/master: (194 commits)
  DOC Remove Python 2 specific comments from documentation (pandas-dev#31198)
  Follow up PR: pandas-dev#28097 Simplify branch statement (pandas-dev#29243)
  BUG: DatetimeIndex.snap incorrectly setting freq (pandas-dev#31188)
  Move DataFrame.info() to live with similar functions (pandas-dev#31317)
  ENH: accept a dictionary in plot colors (pandas-dev#31071)
  PERF: add shortcut to Timestamp constructor (pandas-dev#30676)
  CLN/MAINT: Clean and annotate stata reader and writers (pandas-dev#31072)
  REF: define _get_slice_axis in correct classes (pandas-dev#31304)
  BUG: DataFrame.floordiv(ser, axis=0) not matching column-wise bheavior (pandas-dev#31271)
  PERF: optimize is_scalar, is_iterator (pandas-dev#31294)
  BUG: Series rolling count ignores min_periods (pandas-dev#30923)
  xfail sparse warning; closes pandas-dev#31310 (pandas-dev#31311)
  REF: DatetimeIndex.get_value wrap DTI.get_loc (pandas-dev#31314)
  CLN: internals.managers (pandas-dev#31316)
  PERF: avoid copies if possible in fill_binop (pandas-dev#31300)
  Add test for multiindex json (pandas-dev#31307)
  BUG: passing TDA and wrong freq to TimedeltaIndex (pandas-dev#31268)
  BUG: inconsistency between PeriodIndex.get_value vs get_loc (pandas-dev#31172)
  CLN: remove _set_subtyp (pandas-dev#31301)
  CI: Updated version of macos image (pandas-dev#31292)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Timestamp/Timedelta constructors when passed a Timestamp/Timedelta
6 participants