Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Faster Series.__getattribute__ #20834

Merged

Conversation

TomAugspurger
Copy link
Contributor

Closes #19764

@TomAugspurger
Copy link
Contributor Author

Basic idea is to let the index class say whether it can hold valid python identifiers. Classes that can

  1. Index
  2. MultiIndex

I think the rest can't.

@TomAugspurger
Copy link
Contributor Author

Note that it's can contain valid identifiers, not that it does.

I'd be happy to hear alternative names to is_dotable. I was just thinking .<key> accessing pronounced as "dot key".

@@ -31,6 +31,7 @@ class NumericIndex(Index):

"""
_is_numeric_dtype = True
_is_dotable = False # Can't contain Python identifiers
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: DatetimeIndex and friends inherit from Int64Index (which inherits from here).

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is much cleaner than my idea of a hack !

I would try to use getattr in the name (ideas: needs_getattr_check, can_getattr, ..), but not that important ;-)

@@ -958,6 +958,7 @@ Performance Improvements
- Improved performance of :func:`pandas.core.groupby.GroupBy.any` and :func:`pandas.core.groupby.GroupBy.all` (:issue:`15435`)
- Improved performance of :func:`pandas.core.groupby.GroupBy.pct_change` (:issue:`19165`)
- Improved performance of :func:`Series.isin` in the case of categorical dtypes (:issue:`20003`)
- Improved performance of ``Series.__getattribute__`` when the Series has certain index types. This manifiested in slow printing of large Series with a ``DatetimeIndex`` (:issue:`19764`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's getattr instead of getattribute I think

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 26, 2018

Hmm the issue with names like getattr* is that it describes why it might be useful, but not what it actually does... How about _can_hold_identifiers?

@pep8speaks
Copy link

pep8speaks commented Apr 26, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 01, 2018 at 00:10 Hours UTC

@codecov
Copy link

codecov bot commented Apr 26, 2018

Codecov Report

Merging #20834 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20834      +/-   ##
==========================================
+ Coverage   91.77%   91.78%   +<.01%     
==========================================
  Files         153      153              
  Lines       49280    49340      +60     
==========================================
+ Hits        45229    45288      +59     
- Misses       4051     4052       +1
Flag Coverage Δ
#multiple 90.18% <100%> (ø) ⬆️
#single 41.94% <100%> (+0.06%) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 95.94% <100%> (ø) ⬆️
pandas/core/indexes/base.py 96.64% <100%> (ø) ⬆️
pandas/util/testing.py 84.59% <0%> (-0.21%) ⬇️
pandas/core/indexes/timedeltas.py 91.15% <0%> (-0.07%) ⬇️
pandas/core/indexes/datetimes.py 95.73% <0%> (-0.04%) ⬇️
pandas/core/arrays/base.py 83.95% <0%> (ø) ⬆️
pandas/io/formats/latex.py 100% <0%> (ø) ⬆️
pandas/api/extensions/__init__.py 100% <0%> (ø) ⬆️
pandas/io/pytables.py 92.41% <0%> (ø) ⬆️
pandas/core/internals.py 95.59% <0%> (+0.01%) ⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6cacdde...1b598d0. Read the comment docs.

@@ -4375,7 +4375,8 @@ def __getattr__(self, name):
name in self._accessors):
return object.__getattribute__(self, name)
else:
if name in self._info_axis:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the right fix. The issue is that some index types DO allow a non-holding index type to be testing in __contains__, e.g. a string in a DTI. This check can be non-trivial because the index hashtable may need to be created to test if its a valid method.

Instead I would create a _contains private method to do this, where by it must be an exact dtype (and not a possible one, like a string in a DTI).

Note we already have contains method so can't really use that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. a string in a DTI

These can't be valid identifiers though, so they by definition can't be going through __getatrr__.

We can perform this cheap check at the class level, without having to deal with values at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this isn't addressing the other performance problems with DTI, which are left for #17754

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said, this is not a good solution here. Call a method on the object. In the future folks will have no idea this is called, nor where it is, it add unecessary context and complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future folks will have no idea this is called

We have a comment explaining it on the base Index class and a link back to the original issue.

In what case will this cause issues? It's short-circuting a check that will always return False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with Joris. This if block is specifically for dot selection. It makes sense to be there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is that you instead should be calling a function like

if self._info_axis._._can_hold(name):
     return self[name]

putting this here buries all kinds of logic in an obscure routine. no-one will be able to understand / find this later.
These should be the simplest routines possible, while pushing logic to the axis object itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it be a function? To outcome doesn't depend on name.

putting this here buries all kinds of logic in an obscure routine

Can you say exactly what logic you think I'm burying here? You have if self._info_axis._can_hold(name). I have if self._info_axis._can_hold_identifiers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are checking for a random attribute on an index class. This is so fragile, with no documentation whatsoever. Simply make it a function call, I am not sure why this is that hard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why this is that hard.

Uhm...

@jorisvandenbossche
Copy link
Member

+1 on _can_hold_identifiers, much better than my suggestions :-)

@jorisvandenbossche jorisvandenbossche added Datetime Datetime data dtype Performance Memory or execution speed performance labels Apr 27, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.23.0 milestone Apr 27, 2018
@TomAugspurger
Copy link
Contributor Author

@jreback could you clarify your proposed solution? I don't see how

NDFrame._info_axis_can_hold_identifiers()

is any clearer than just checking the attribute off the info axis. Why add the extra layer of indirection to just look up a piece of static data?

@TomAugspurger
Copy link
Contributor Author

Oh dang, I apparently misunderstood how getattr works (this is on master).

In [1]: import pandas as pd

In [2]: s = pd.Series(1, index=pd.date_range('2017', periods=10))

In [3]: getattr(s, '2017-01-01')
Out[3]: 1

On this branch that would raise. Ugh....

@TomAugspurger
Copy link
Contributor Author

Obviously doing that is an awful idea (right? Am I missing a valid use case?), so I'm probably OK just breaking it...

@jorisvandenbossche
Copy link
Member

Yeah, I am totally fine with breaking that

@WillAyd
Copy link
Member

WillAyd commented Apr 27, 2018

Might have missed some conversations and I realize this is wrapped up in #19764 (comment) but have we considered dropping support for accessing index elements of a Series by name altogether? IMO seems strange to support and fraught with peril. Perhaps a little less dangerous than the previous .ix usage because its more limited in scope but not that far off

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 27, 2018 via email

@WillAyd
Copy link
Member

WillAyd commented Apr 27, 2018

How is that use case any different from doing index.loc['S'] in his examples? I'll admit up front that I'm biased against dot-notation but I find there to be a ton of nuances to supporting this that make it tough to simply convey how and when it actually works:

# Don't start with anything but a letter or underscore
>>> ser = pd.Series(range(3), index=['1S', '1I', '1R'])
>>> ser.1S
SyntaxError: invalid syntax

# Don't use reserved words
>>> ser = pd.Series(range(3), index=['and', 'or', 'for'])
>>> ser.1S
SyntaxError: invalid syntax

# Hope this doesn't ever mangle
ser = pd.Series(range(3), index=['index', 'values', 'unique'])

Non-ASCII names would not be supported in Python2.7 (though I suppose that shouldn't dictate pandas approach at this point). As you probably alluded to in your usage of getattr above too I'm not sure how / if it should be resolving something like the below:

>>> ser = pd.Series(1, index=pd.date_range('2017-01-01', periods=3, freq='s'))
>>> getattr(ser, '2017-01-01')
2017-01-01 00:00:00    1
2017-01-01 00:00:01    1
2017-01-01 00:00:02    1
Freq: S, dtype: int64

Don't mean to hijack this thread and this discussion could probably be moved elsewhere but figured I'd throw it out there as a future deprecation candidate

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 27, 2018 via email

@jreback
Copy link
Contributor

jreback commented Apr 27, 2018

how is that case compelling? why is this not using indexing under the hood (eg .loc[‘beta’]), rather relying on a attribute access which has these pitfalls?

the UX could certainly provide direct attribute access but that is not a pandas issue / problem

@jorisvandenbossche
Copy link
Member

Use case where I, somewhat unknowingly (as you don't directly interact with the "Series" object in this case), used this in the past is with apply on the rows:

df.apply(lambda row: Point(row.x, row.y), axis=1)

For sure, the above can also be written with [] (or actually more performantly by not converting each row into a Series).
But, I think we are for sure going to break a lot of code, for what benefit? Accessing columns as attributes also has its known pitfalls.

Anyhow, let's discuss that somewhere else if we want to continue this discussion.

For the actual PR, any objections for merging this?

I think the main question we need to answer if we are OK with breaking getattr(s, 'invalid_identifier') (so something that would never work as s.invalid_identifier) ?
I am fine with that.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs an asv (e.g. the printing case is prob ok)

@@ -4375,7 +4375,8 @@ def __getattr__(self, name):
name in self._accessors):
return object.__getattribute__(self, name)
else:
if name in self._info_axis:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is that you instead should be calling a function like

if self._info_axis._._can_hold(name):
     return self[name]

putting this here buries all kinds of logic in an obscure routine. no-one will be able to understand / find this later.
These should be the simplest routines possible, while pushing logic to the axis object itself.

@TomAugspurger
Copy link
Contributor Author

Just calling repr(s) doesn't hit the slow code.

In [6]: %timeit self.time_series_datetimeindex_repr()
1.21 s ± 57.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

vs.

In [8]: %timeit self.time_series_datetimeindex_repr()
2.08 µs ± 45.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@TomAugspurger TomAugspurger mentioned this pull request Apr 30, 2018
71 tasks
@TomAugspurger
Copy link
Contributor Author

Laying out the motivation for e0710f3:

  1. There's no logic in Series.__getattr__. All of the logic about whether name could possibly be in Series.index, so that Series.foo could succeed, is on the index class, where it belongs
  2. I used a property so that this is easier to document, and navigate between child and parent classes than a class attribute
  3. I overrode _can_hold_identifiers on child classes, since having the parent make that decision for the child seems improper.
  4. I used a property rather than a function, since the result doesn't depend on name. We're trying to determine whether series.foo could possibly be in series.index, not whether it is in Series.index.

@TomAugspurger
Copy link
Contributor Author

@jreback any thoughts on #20834 (comment)? Does the reason for not using a function make sense?

@jreback
Copy link
Contributor

jreback commented Apr 30, 2018

let me look

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 30, 2018 via email

@jreback
Copy link
Contributor

jreback commented Apr 30, 2018

. used a property rather than a function, since the result doesn't depend on name. We're trying to determine whether series.foo could possibly be in series.index, not whether it is in Series.index.

this is simply not True. you have a comparison

if (self._info_axis._can_hold_identifiers and
+                    name in self._info_axis):

you are checking a property on the index and then do a __contains__ check on the same index.

@TomAugspurger
Copy link
Contributor Author

Right, I'm talking about the _can_hold_identifiers bit. name in self._info_axis is working properly.

You would replace that with something like if self._info_axis._can_hold_identifiers_and_holds_name?

def _can_hold_identifiers_and_holds_name(self, name):
     # dispatch on the type of index we are
     # no real need to subclass as this check can be confined to 
     # a small number of indexes
     if self.is_object() or self.is_categorical():
        return name in self
     return False

Why do two things in the function?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 30, 2018

But at this point, I really, really am over this change. I'm just going to implement that and be done with it :)

@@ -4375,8 +4375,7 @@ def __getattr__(self, name):
name in self._accessors):
return object.__getattribute__(self, name)
else:
if (self._info_axis._can_hold_identifiers and
name in self._info_axis):
if self._info_axis._can_hold_identifiers_and_holds_name(name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much nicer! thanks!

@jreback
Copy link
Contributor

jreback commented May 1, 2018

lgtm. (assume it passes).

@jorisvandenbossche jorisvandenbossche merged commit 28edd06 into pandas-dev:master May 1, 2018
@jorisvandenbossche
Copy link
Member

Thanks Tom!

@douglasmacdonald
Copy link

Hello,

pd.__version__ 0.24.2

Is this problem related to this work?

Very slow

%prun df.loc['2018-11-20']


1367 function calls (1353 primitive calls) in 12.332 seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
2   12.330    6.165   12.330    6.165 {method 'get_loc' of 'pandas._libs.index.DatetimeEngine' objects}

Very fast

%prun  df.loc['2018-11-20':'2018-11-20']

     1199 function calls (1192 primitive calls) in 0.001 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    6    0.000    0.000    0.000    0.000 offsets.py:2308(delta)
    4    0.000    0.000    0.000    0.000 {method 'get_loc' of 'pandas._libs.index.DatetimeEngine' objects}
    2    0.000    0.000    0.000    0.000 {pandas._libs.tslibs.parsing.parse_time_string}

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented May 6, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Very slow printing of Series with DatetimeIndex
6 participants