PERF: Faster Series.getattribute #20834

TomAugspurger · 2018-04-26T19:51:13Z

Closes #19764

Closes pandas-dev#19764

TomAugspurger · 2018-04-26T19:52:21Z

Basic idea is to let the index class say whether it can hold valid python identifiers. Classes that can

Index
MultiIndex

I think the rest can't.

TomAugspurger · 2018-04-26T19:53:34Z

Note that it's can contain valid identifiers, not that it does.

I'd be happy to hear alternative names to is_dotable. I was just thinking .<key> accessing pronounced as "dot key".

TomAugspurger · 2018-04-26T19:54:34Z

pandas/core/indexes/numeric.py

@@ -31,6 +31,7 @@ class NumericIndex(Index):

    """
    _is_numeric_dtype = True
+    _is_dotable = False  # Can't contain Python identifiers


note: DatetimeIndex and friends inherit from Int64Index (which inherits from here).

jorisvandenbossche

Yes, this is much cleaner than my idea of a hack !

I would try to use getattr in the name (ideas: needs_getattr_check, can_getattr, ..), but not that important ;-)

jorisvandenbossche · 2018-04-26T20:00:05Z

doc/source/whatsnew/v0.23.0.txt

@@ -958,6 +958,7 @@ Performance Improvements
 - Improved performance of :func:`pandas.core.groupby.GroupBy.any` and :func:`pandas.core.groupby.GroupBy.all` (:issue:`15435`)
 - Improved performance of :func:`pandas.core.groupby.GroupBy.pct_change` (:issue:`19165`)
 - Improved performance of :func:`Series.isin` in the case of categorical dtypes (:issue:`20003`)
+- Improved performance of ``Series.__getattribute__`` when the Series has certain index types. This manifiested in slow printing of large Series with a ``DatetimeIndex`` (:issue:`19764`)


It's getattr instead of getattribute I think

TomAugspurger · 2018-04-26T20:12:44Z

Hmm the issue with names like getattr* is that it describes why it might be useful, but not what it actually does... How about _can_hold_identifiers?

pep8speaks · 2018-04-26T20:14:27Z

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 01, 2018 at 00:10 Hours UTC

codecov · 2018-04-26T21:40:39Z

Codecov Report

Merging #20834 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #20834      +/-   ##
==========================================
+ Coverage   91.77%   91.78%   +<.01%     
==========================================
  Files         153      153              
  Lines       49280    49340      +60     
==========================================
+ Hits        45229    45288      +59     
- Misses       4051     4052       +1

Flag	Coverage Δ
#multiple	`90.18% <100%> (ø)`	⬆️
#single	`41.94% <100%> (+0.06%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/generic.py	`95.94% <100%> (ø)`	⬆️
pandas/core/indexes/base.py	`96.64% <100%> (ø)`	⬆️
pandas/util/testing.py	`84.59% <0%> (-0.21%)`	⬇️
pandas/core/indexes/timedeltas.py	`91.15% <0%> (-0.07%)`	⬇️
pandas/core/indexes/datetimes.py	`95.73% <0%> (-0.04%)`	⬇️
pandas/core/arrays/base.py	`83.95% <0%> (ø)`	⬆️
pandas/io/formats/latex.py	`100% <0%> (ø)`	⬆️
pandas/api/extensions/__init__.py	`100% <0%> (ø)`	⬆️
pandas/io/pytables.py	`92.41% <0%> (ø)`	⬆️
pandas/core/internals.py	`95.59% <0%> (+0.01%)`	⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6cacdde...1b598d0. Read the comment docs.

jreback · 2018-04-26T22:17:50Z

pandas/core/generic.py

@@ -4375,7 +4375,8 @@ def __getattr__(self, name):
                name in self._accessors):
            return object.__getattribute__(self, name)
        else:
-            if name in self._info_axis:


this is not the right fix. The issue is that some index types DO allow a non-holding index type to be testing in __contains__, e.g. a string in a DTI. This check can be non-trivial because the index hashtable may need to be created to test if its a valid method.

Instead I would create a _contains private method to do this, where by it must be an exact dtype (and not a possible one, like a string in a DTI).

Note we already have contains method so can't really use that.

e.g. a string in a DTI

These can't be valid identifiers though, so they by definition can't be going through __getatrr__.

We can perform this cheap check at the class level, without having to deal with values at all.

Note that this isn't addressing the other performance problems with DTI, which are left for #17754

as I said, this is not a good solution here. Call a method on the object. In the future folks will have no idea this is called, nor where it is, it add unecessary context and complexity.

In the future folks will have no idea this is called

We have a comment explaining it on the base Index class and a link back to the original issue.

In what case will this cause issues? It's short-circuting a check that will always return False.

Agreed with Joris. This if block is specifically for dot selection. It makes sense to be there.

my point is that you instead should be calling a function like

if self._info_axis._._can_hold(name): return self[name]

putting this here buries all kinds of logic in an obscure routine. no-one will be able to understand / find this later.
These should be the simplest routines possible, while pushing logic to the axis object itself.

Why would it be a function? To outcome doesn't depend on name.

putting this here buries all kinds of logic in an obscure routine

Can you say exactly what logic you think I'm burying here? You have if self._info_axis._can_hold(name). I have if self._info_axis._can_hold_identifiers.

you are checking for a random attribute on an index class. This is so fragile, with no documentation whatsoever. Simply make it a function call, I am not sure why this is that hard.

I am not sure why this is that hard.

Uhm...

jorisvandenbossche · 2018-04-27T08:03:28Z

+1 on _can_hold_identifiers, much better than my suggestions :-)

TomAugspurger · 2018-04-27T11:09:58Z

@jreback could you clarify your proposed solution? I don't see how

NDFrame._info_axis_can_hold_identifiers()

is any clearer than just checking the attribute off the info axis. Why add the extra layer of indirection to just look up a piece of static data?

TomAugspurger · 2018-04-27T11:25:11Z

Oh dang, I apparently misunderstood how getattr works (this is on master).

In [1]: import pandas as pd

In [2]: s = pd.Series(1, index=pd.date_range('2017', periods=10))

In [3]: getattr(s, '2017-01-01')
Out[3]: 1

On this branch that would raise. Ugh....

TomAugspurger · 2018-04-27T11:26:37Z

Obviously doing that is an awful idea (right? Am I missing a valid use case?), so I'm probably OK just breaking it...

jorisvandenbossche · 2018-04-27T11:30:07Z

Yeah, I am totally fine with breaking that

WillAyd · 2018-04-27T21:48:47Z

Might have missed some conversations and I realize this is wrapped up in #19764 (comment) but have we considered dropping support for accessing index elements of a Series by name altogether? IMO seems strange to support and fraught with peril. Perhaps a little less dangerous than the previous .ix usage because its more limited in scope but not that far off

TomAugspurger · 2018-04-27T21:51:47Z

Allen Downey had a use case for it here: https://github.com/AllenDowney/ModSimPy/blob/master/code/chap05.ipynb

…

On Fri, Apr 27, 2018 at 4:48 PM, William Ayd ***@***.***> wrote: Might have missed some conversations and I realize this is wrapped up in #19764 (comment) <#19764 (comment)> but have we considered dropping support for accessing index elements of a Series by name altogether? IMO seems strange to support and fraught with peril. Perhaps a little less dangerous than the previous .ix usage because its more limited in scope but not that far off — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20834 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHInMw8wt16RluF2YLgVgxtVDz9TsPks5ts5JGgaJpZM4TnRnQ> .

WillAyd · 2018-04-27T22:16:26Z

How is that use case any different from doing index.loc['S'] in his examples? I'll admit up front that I'm biased against dot-notation but I find there to be a ton of nuances to supporting this that make it tough to simply convey how and when it actually works:

# Don't start with anything but a letter or underscore
>>> ser = pd.Series(range(3), index=['1S', '1I', '1R'])
>>> ser.1S
SyntaxError: invalid syntax

# Don't use reserved words
>>> ser = pd.Series(range(3), index=['and', 'or', 'for'])
>>> ser.1S
SyntaxError: invalid syntax

# Hope this doesn't ever mangle
ser = pd.Series(range(3), index=['index', 'values', 'unique'])

Non-ASCII names would not be supported in Python2.7 (though I suppose that shouldn't dictate pandas approach at this point). As you probably alluded to in your usage of getattr above too I'm not sure how / if it should be resolving something like the below:

>>> ser = pd.Series(1, index=pd.date_range('2017-01-01', periods=3, freq='s'))
>>> getattr(ser, '2017-01-01')
2017-01-01 00:00:00    1
2017-01-01 00:00:01    1
2017-01-01 00:00:02    1
Freq: S, dtype: int64

Don't mean to hijack this thread and this discussion could probably be moved elsewhere but figured I'd throw it out there as a future deprecation candidate

TomAugspurger · 2018-04-27T22:26:53Z

No worries about hijacking. Allen's case is compelling, because he's using building a higher-level interface on top of pandas. If your data only ever has these three known fields with known names, then doing `.key` seems natural. It'd be good to have broad feedback before deprecating this.

…

On Fri, Apr 27, 2018 at 5:16 PM, William Ayd ***@***.***> wrote: How is that use case any different from doing index.loc['S'] in his examples? I'll admit up front that I'm biased against dot-notation but I find there to be a ton of nuances to supporting this that make it tough to simply convey how and when it actually works: # Don't start with anything but a letter or underscore>>> ser = pd.Series(range(3), index=['1S', '1I', '1R'])>>> ser.1SSyntaxError: invalid syntax # Don't use reserved words>>> ser = pd.Series(range(3), index=['and', 'or', 'for'])>>> ser.1SSyntaxError: invalid syntax # Hope this doesn't ever mangle ser = pd.Series(range(3), index=['index', 'values', 'unique']) Non-ASCII names would not be supported in Python2.7 (though I suppose that shouldn't dictate pandas approach at this point). As you probably alluded to in your usage of getattr above too I'm not sure how / if it should be resolving something like the below: >>> ser = pd.Series(1, index=pd.date_range('2017-01-01', periods=3, freq='s'))>>> getattr(ser, '2017-01-01')2017-01-01 00:00:00 12017-01-01 00:00:01 12017-01-01 00:00:02 1 Freq: S, dtype: int64 Don't mean to hijack this thread and this discussion could probably be moved elsewhere but figured I'd throw it out there as a future deprecation candidate — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20834 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHInPrEdvzXNi6ldioVviuCjlJ9mjDks5ts5jBgaJpZM4TnRnQ> .

jreback · 2018-04-27T23:04:40Z

how is that case compelling? why is this not using indexing under the hood (eg .loc[‘beta’]), rather relying on a attribute access which has these pitfalls?

the UX could certainly provide direct attribute access but that is not a pandas issue / problem

jorisvandenbossche · 2018-04-28T09:36:42Z

Use case where I, somewhat unknowingly (as you don't directly interact with the "Series" object in this case), used this in the past is with apply on the rows:

df.apply(lambda row: Point(row.x, row.y), axis=1)

For sure, the above can also be written with [] (or actually more performantly by not converting each row into a Series).
But, I think we are for sure going to break a lot of code, for what benefit? Accessing columns as attributes also has its known pitfalls.

Anyhow, let's discuss that somewhere else if we want to continue this discussion.

For the actual PR, any objections for merging this?

I think the main question we need to answer if we are OK with breaking getattr(s, 'invalid_identifier') (so something that would never work as s.invalid_identifier) ?
I am fine with that.

jreback

needs an asv (e.g. the printing case is prob ok)

jreback · 2018-04-28T13:51:07Z

pandas/core/generic.py

@@ -4375,7 +4375,8 @@ def __getattr__(self, name):
                name in self._accessors):
            return object.__getattribute__(self, name)
        else:
-            if name in self._info_axis:


my point is that you instead should be calling a function like

if self._info_axis._._can_hold(name): return self[name]

putting this here buries all kinds of logic in an obscure routine. no-one will be able to understand / find this later.
These should be the simplest routines possible, while pushing logic to the axis object itself.

TomAugspurger · 2018-04-28T16:48:09Z

Just calling repr(s) doesn't hit the slow code.

In [6]: %timeit self.time_series_datetimeindex_repr()
1.21 s ± 57.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

vs.

In [8]: %timeit self.time_series_datetimeindex_repr()
2.08 µs ± 45.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

TomAugspurger · 2018-04-30T12:41:44Z

Laying out the motivation for e0710f3:

There's no logic in Series.__getattr__. All of the logic about whether name could possibly be in Series.index, so that Series.foo could succeed, is on the index class, where it belongs
I used a property so that this is easier to document, and navigate between child and parent classes than a class attribute
I overrode _can_hold_identifiers on child classes, since having the parent make that decision for the child seems improper.
I used a property rather than a function, since the result doesn't depend on name. We're trying to determine whether series.foo could possibly be in series.index, not whether it is in Series.index.

TomAugspurger · 2018-04-30T17:54:06Z

@jreback any thoughts on #20834 (comment)? Does the reason for not using a function make sense?

jreback · 2018-04-30T18:05:40Z

let me look

TomAugspurger · 2018-04-30T18:15:44Z

No rush. Won't merge without your OK.

…

On Mon, Apr 30, 2018 at 1:05 PM, Jeff Reback ***@***.***> wrote: let me look — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20834 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIkDZ4bgdPNIVYjdBV_sZg9S0Etnyks5tt1J5gaJpZM4TnRnQ> .

jreback · 2018-04-30T23:49:26Z

. used a property rather than a function, since the result doesn't depend on name. We're trying to determine whether series.foo could possibly be in series.index, not whether it is in Series.index.

this is simply not True. you have a comparison

if (self._info_axis._can_hold_identifiers and
+                    name in self._info_axis):

you are checking a property on the index and then do a __contains__ check on the same index.

TomAugspurger · 2018-04-30T23:58:13Z

Right, I'm talking about the _can_hold_identifiers bit. name in self._info_axis is working properly.

You would replace that with something like if self._info_axis._can_hold_identifiers_and_holds_name?

def _can_hold_identifiers_and_holds_name(self, name):
     # dispatch on the type of index we are
     # no real need to subclass as this check can be confined to 
     # a small number of indexes
     if self.is_object() or self.is_categorical():
        return name in self
     return False

Why do two things in the function?

TomAugspurger · 2018-04-30T23:58:59Z

But at this point, I really, really am over this change. I'm just going to implement that and be done with it :)

jreback · 2018-05-01T00:18:55Z

pandas/core/generic.py

@@ -4375,8 +4375,7 @@ def __getattr__(self, name):
                name in self._accessors):
            return object.__getattribute__(self, name)
        else:
-            if (self._info_axis._can_hold_identifiers and
-                    name in self._info_axis):
+            if self._info_axis._can_hold_identifiers_and_holds_name(name):


much nicer! thanks!

jreback · 2018-05-01T00:19:58Z

lgtm. (assume it passes).

jorisvandenbossche · 2018-05-01T06:45:29Z

Thanks Tom!

douglasmacdonald · 2019-05-06T16:21:43Z

Hello,

pd.__version__ 0.24.2

Is this problem related to this work?

Very slow

%prun df.loc['2018-11-20']


1367 function calls (1353 primitive calls) in 12.332 seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
2   12.330    6.165   12.330    6.165 {method 'get_loc' of 'pandas._libs.index.DatetimeEngine' objects}

Very fast

%prun  df.loc['2018-11-20':'2018-11-20']

     1199 function calls (1192 primitive calls) in 0.001 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    6    0.000    0.000    0.000    0.000 offsets.py:2308(delta)
    4    0.000    0.000    0.000    0.000 {method 'get_loc' of 'pandas._libs.index.DatetimeEngine' objects}
    2    0.000    0.000    0.000    0.000 {pandas._libs.tslibs.parsing.parse_time_string}

TomAugspurger · 2019-05-06T19:37:36Z

That's a different issue. This was about attribute access, not indexing.

…

On Mon, May 6, 2019 at 11:21 AM douglasmacdonald ***@***.***> wrote: Hello, pd.__version__ 0.24.2 Is this problem related to this work? Very slow %prun df.loc['2018-11-20'] 1367 function calls (1353 primitive calls) in 12.332 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 2 12.330 6.165 12.330 6.165 {method 'get_loc' of 'pandas._libs.index.DatetimeEngine' objects} Very fast %prun df.loc['2018-11-20':'2018-11-20'] 1199 function calls (1192 primitive calls) in 0.001 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 6 0.000 0.000 0.000 0.000 offsets.py:2308(delta) 4 0.000 0.000 0.000 0.000 {method 'get_loc' of 'pandas._libs.index.DatetimeEngine' objects} 2 0.000 0.000 0.000 0.000 {pandas._libs.tslibs.parsing.parse_time_string} — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20834 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIVQJAYUZXIJF6TBNLLPUBLJ5ANCNFSM4E45DHIA> .

PERF: Faster Series.__getattribute__

ad6a202

Closes pandas-dev#19764

TomAugspurger mentioned this pull request Apr 26, 2018

PERF: Very slow printing of Series with DatetimeIndex #19764

Closed

TomAugspurger commented Apr 26, 2018

View reviewed changes

jorisvandenbossche approved these changes Apr 26, 2018

View reviewed changes

Renamed

55072a0

TomAugspurger added 2 commits April 26, 2018 15:16

Linting

ad871ce

Fixed find-replace fail

4b7334c

jreback requested changes Apr 26, 2018

View reviewed changes

jorisvandenbossche added Datetime Datetime data dtype Performance Memory or execution speed performance labels Apr 27, 2018

jorisvandenbossche added this to the 0.23.0 milestone Apr 27, 2018

jreback requested changes Apr 28, 2018

View reviewed changes

asv

857364f

Move to instance properties

e0710f3

TomAugspurger mentioned this pull request Apr 30, 2018

RLS: 0.23.0 #20531

Closed

71 tasks

function

1b598d0

jreback reviewed May 1, 2018

View reviewed changes

jreback approved these changes May 1, 2018

View reviewed changes

jorisvandenbossche merged commit 28edd06 into pandas-dev:master May 1, 2018

TomAugspurger deleted the datetimeindex-perf branch May 1, 2018 11:18

This was referenced Jun 13, 2018

Jupyter Notebook displays Series VERY slowly #18137

Closed

printing out pandas series takes ages jupyter/notebook#3082

Closed

TomAugspurger mentioned this pull request Apr 7, 2019

PERF: tab completion with a large index #18587

Open

PERF: Faster Series.__getattribute__ #20834

PERF: Faster Series.__getattribute__ #20834

Conversation

TomAugspurger commented Apr 26, 2018

TomAugspurger commented Apr 26, 2018

TomAugspurger commented Apr 26, 2018

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Apr 26, 2018 • edited Loading

pep8speaks commented Apr 26, 2018 • edited Loading

Comment last updated on May 01, 2018 at 00:10 Hours UTC

codecov bot commented Apr 26, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 27, 2018

TomAugspurger commented Apr 27, 2018

TomAugspurger commented Apr 27, 2018

TomAugspurger commented Apr 27, 2018

jorisvandenbossche commented Apr 27, 2018

WillAyd commented Apr 27, 2018

TomAugspurger commented Apr 27, 2018 via email

WillAyd commented Apr 27, 2018

TomAugspurger commented Apr 27, 2018 via email

jreback commented Apr 27, 2018

jorisvandenbossche commented Apr 28, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Apr 28, 2018

TomAugspurger commented Apr 30, 2018

TomAugspurger commented Apr 30, 2018

jreback commented Apr 30, 2018

TomAugspurger commented Apr 30, 2018 via email

jreback commented Apr 30, 2018

TomAugspurger commented Apr 30, 2018

TomAugspurger commented Apr 30, 2018 • edited Loading

Choose a reason for hiding this comment

jreback commented May 1, 2018

jorisvandenbossche commented May 1, 2018

douglasmacdonald commented May 6, 2019

TomAugspurger commented May 6, 2019 via email

PERF: Faster Series.getattribute #20834

PERF: Faster Series.getattribute #20834

TomAugspurger commented Apr 26, 2018 •

edited

Loading

pep8speaks commented Apr 26, 2018 •

edited

Loading

codecov bot commented Apr 26, 2018 •

edited

Loading

TomAugspurger commented Apr 30, 2018 •

edited

Loading