-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Faster Series.__getattribute__ #20834
PERF: Faster Series.__getattribute__ #20834
Conversation
Basic idea is to let the index class say whether it can hold valid python identifiers. Classes that can
I think the rest can't. |
Note that it's can contain valid identifiers, not that it does. I'd be happy to hear alternative names to |
pandas/core/indexes/numeric.py
Outdated
@@ -31,6 +31,7 @@ class NumericIndex(Index): | |||
|
|||
""" | |||
_is_numeric_dtype = True | |||
_is_dotable = False # Can't contain Python identifiers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: DatetimeIndex and friends inherit from Int64Index (which inherits from here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is much cleaner than my idea of a hack !
I would try to use getattr in the name (ideas: needs_getattr_check, can_getattr, ..), but not that important ;-)
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -958,6 +958,7 @@ Performance Improvements | |||
- Improved performance of :func:`pandas.core.groupby.GroupBy.any` and :func:`pandas.core.groupby.GroupBy.all` (:issue:`15435`) | |||
- Improved performance of :func:`pandas.core.groupby.GroupBy.pct_change` (:issue:`19165`) | |||
- Improved performance of :func:`Series.isin` in the case of categorical dtypes (:issue:`20003`) | |||
- Improved performance of ``Series.__getattribute__`` when the Series has certain index types. This manifiested in slow printing of large Series with a ``DatetimeIndex`` (:issue:`19764`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's getattr instead of getattribute I think
Hmm the issue with names like |
Hello @TomAugspurger! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on May 01, 2018 at 00:10 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #20834 +/- ##
==========================================
+ Coverage 91.77% 91.78% +<.01%
==========================================
Files 153 153
Lines 49280 49340 +60
==========================================
+ Hits 45229 45288 +59
- Misses 4051 4052 +1
Continue to review full report at Codecov.
|
@@ -4375,7 +4375,8 @@ def __getattr__(self, name): | |||
name in self._accessors): | |||
return object.__getattribute__(self, name) | |||
else: | |||
if name in self._info_axis: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not the right fix. The issue is that some index types DO allow a non-holding index type to be testing in __contains__
, e.g. a string in a DTI. This check can be non-trivial because the index hashtable may need to be created to test if its a valid method.
Instead I would create a _contains
private method to do this, where by it must be an exact dtype (and not a possible one, like a string in a DTI).
Note we already have contains
method so can't really use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. a string in a DTI
These can't be valid identifiers though, so they by definition can't be going through __getatrr__
.
We can perform this cheap check at the class level, without having to deal with values at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this isn't addressing the other performance problems with DTI, which are left for #17754
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I said, this is not a good solution here. Call a method on the object. In the future folks will have no idea this is called, nor where it is, it add unecessary context and complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future folks will have no idea this is called
We have a comment explaining it on the base Index class and a link back to the original issue.
In what case will this cause issues? It's short-circuting a check that will always return False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with Joris. This if block is specifically for dot selection. It makes sense to be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point is that you instead should be calling a function like
if self._info_axis._._can_hold(name):
return self[name]
putting this here buries all kinds of logic in an obscure routine. no-one will be able to understand / find this later.
These should be the simplest routines possible, while pushing logic to the axis object itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it be a function? To outcome doesn't depend on name
.
putting this here buries all kinds of logic in an obscure routine
Can you say exactly what logic you think I'm burying here? You have if self._info_axis._can_hold(name)
. I have if self._info_axis._can_hold_identifiers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are checking for a random attribute on an index class. This is so fragile, with no documentation whatsoever. Simply make it a function call, I am not sure why this is that hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why this is that hard.
Uhm...
+1 on |
@jreback could you clarify your proposed solution? I don't see how NDFrame._info_axis_can_hold_identifiers() is any clearer than just checking the attribute off the info axis. Why add the extra layer of indirection to just look up a piece of static data? |
Oh dang, I apparently misunderstood how getattr works (this is on master). In [1]: import pandas as pd
In [2]: s = pd.Series(1, index=pd.date_range('2017', periods=10))
In [3]: getattr(s, '2017-01-01')
Out[3]: 1 On this branch that would raise. Ugh.... |
Obviously doing that is an awful idea (right? Am I missing a valid use case?), so I'm probably OK just breaking it... |
Yeah, I am totally fine with breaking that |
Might have missed some conversations and I realize this is wrapped up in #19764 (comment) but have we considered dropping support for accessing index elements of a Series by name altogether? IMO seems strange to support and fraught with peril. Perhaps a little less dangerous than the previous |
Allen Downey had a use case for it here:
https://github.com/AllenDowney/ModSimPy/blob/master/code/chap05.ipynb
…On Fri, Apr 27, 2018 at 4:48 PM, William Ayd ***@***.***> wrote:
Might have missed some conversations and I realize this is wrapped up in #19764
(comment)
<#19764 (comment)>
but have we considered dropping support for accessing index elements of a
Series by name altogether? IMO seems strange to support and fraught with
peril. Perhaps a little less dangerous than the previous .ix usage
because its more limited in scope but not that far off
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20834 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHInMw8wt16RluF2YLgVgxtVDz9TsPks5ts5JGgaJpZM4TnRnQ>
.
|
How is that use case any different from doing # Don't start with anything but a letter or underscore
>>> ser = pd.Series(range(3), index=['1S', '1I', '1R'])
>>> ser.1S
SyntaxError: invalid syntax
# Don't use reserved words
>>> ser = pd.Series(range(3), index=['and', 'or', 'for'])
>>> ser.1S
SyntaxError: invalid syntax
# Hope this doesn't ever mangle
ser = pd.Series(range(3), index=['index', 'values', 'unique']) Non-ASCII names would not be supported in Python2.7 (though I suppose that shouldn't dictate pandas approach at this point). As you probably alluded to in your usage of getattr above too I'm not sure how / if it should be resolving something like the below: >>> ser = pd.Series(1, index=pd.date_range('2017-01-01', periods=3, freq='s'))
>>> getattr(ser, '2017-01-01')
2017-01-01 00:00:00 1
2017-01-01 00:00:01 1
2017-01-01 00:00:02 1
Freq: S, dtype: int64 Don't mean to hijack this thread and this discussion could probably be moved elsewhere but figured I'd throw it out there as a future deprecation candidate |
No worries about hijacking.
Allen's case is compelling, because he's using building a higher-level
interface on top of pandas. If your data only ever has these three known
fields with known names, then doing `.key` seems natural.
It'd be good to have broad feedback before deprecating this.
…On Fri, Apr 27, 2018 at 5:16 PM, William Ayd ***@***.***> wrote:
How is that use case any different from doing index.loc['S'] in his
examples? I'll admit up front that I'm biased against dot-notation but I
find there to be a ton of nuances to supporting this that make it tough to
simply convey how and when it actually works:
# Don't start with anything but a letter or underscore>>> ser = pd.Series(range(3), index=['1S', '1I', '1R'])>>> ser.1SSyntaxError: invalid syntax
# Don't use reserved words>>> ser = pd.Series(range(3), index=['and', 'or', 'for'])>>> ser.1SSyntaxError: invalid syntax
# Hope this doesn't ever mangle
ser = pd.Series(range(3), index=['index', 'values', 'unique'])
Non-ASCII names would not be supported in Python2.7 (though I suppose that
shouldn't dictate pandas approach at this point). As you probably alluded
to in your usage of getattr above too I'm not sure how / if it should be
resolving something like the below:
>>> ser = pd.Series(1, index=pd.date_range('2017-01-01', periods=3, freq='s'))>>> getattr(ser, '2017-01-01')2017-01-01 00:00:00 12017-01-01 00:00:01 12017-01-01 00:00:02 1
Freq: S, dtype: int64
Don't mean to hijack this thread and this discussion could probably be
moved elsewhere but figured I'd throw it out there as a future deprecation
candidate
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20834 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHInPrEdvzXNi6ldioVviuCjlJ9mjDks5ts5jBgaJpZM4TnRnQ>
.
|
how is that case compelling? why is this not using indexing under the hood (eg .loc[‘beta’]), rather relying on a attribute access which has these pitfalls? the UX could certainly provide direct attribute access but that is not a pandas issue / problem |
Use case where I, somewhat unknowingly (as you don't directly interact with the "Series" object in this case), used this in the past is with
For sure, the above can also be written with Anyhow, let's discuss that somewhere else if we want to continue this discussion. For the actual PR, any objections for merging this? I think the main question we need to answer if we are OK with breaking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs an asv (e.g. the printing case is prob ok)
@@ -4375,7 +4375,8 @@ def __getattr__(self, name): | |||
name in self._accessors): | |||
return object.__getattribute__(self, name) | |||
else: | |||
if name in self._info_axis: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point is that you instead should be calling a function like
if self._info_axis._._can_hold(name):
return self[name]
putting this here buries all kinds of logic in an obscure routine. no-one will be able to understand / find this later.
These should be the simplest routines possible, while pushing logic to the axis object itself.
Just calling repr(s) doesn't hit the slow code. In [6]: %timeit self.time_series_datetimeindex_repr()
1.21 s ± 57.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) vs.
|
Laying out the motivation for e0710f3:
|
@jreback any thoughts on #20834 (comment)? Does the reason for not using a function make sense? |
let me look |
No rush. Won't merge without your OK.
…On Mon, Apr 30, 2018 at 1:05 PM, Jeff Reback ***@***.***> wrote:
let me look
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20834 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIkDZ4bgdPNIVYjdBV_sZg9S0Etnyks5tt1J5gaJpZM4TnRnQ>
.
|
this is simply not True. you have a comparison
you are checking a property on the index and then do a |
Right, I'm talking about the You would replace that with something like
Why do two things in the function? |
But at this point, I really, really am over this change. I'm just going to implement that and be done with it :) |
pandas/core/generic.py
Outdated
@@ -4375,8 +4375,7 @@ def __getattr__(self, name): | |||
name in self._accessors): | |||
return object.__getattribute__(self, name) | |||
else: | |||
if (self._info_axis._can_hold_identifiers and | |||
name in self._info_axis): | |||
if self._info_axis._can_hold_identifiers_and_holds_name(name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much nicer! thanks!
lgtm. (assume it passes). |
Thanks Tom! |
Hello,
Is this problem related to this work? Very slow
Very fast
|
That's a different issue. This was about attribute access, not indexing.
…On Mon, May 6, 2019 at 11:21 AM douglasmacdonald ***@***.***> wrote:
Hello,
pd.__version__ 0.24.2
Is this problem related to this work?
Very slow
%prun df.loc['2018-11-20']
1367 function calls (1353 primitive calls) in 12.332 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
2 12.330 6.165 12.330 6.165 {method 'get_loc' of 'pandas._libs.index.DatetimeEngine' objects}
Very fast
%prun df.loc['2018-11-20':'2018-11-20']
1199 function calls (1192 primitive calls) in 0.001 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
6 0.000 0.000 0.000 0.000 offsets.py:2308(delta)
4 0.000 0.000 0.000 0.000 {method 'get_loc' of 'pandas._libs.index.DatetimeEngine' objects}
2 0.000 0.000 0.000 0.000 {pandas._libs.tslibs.parsing.parse_time_string}
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20834 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIVQJAYUZXIJF6TBNLLPUBLJ5ANCNFSM4E45DHIA>
.
|
Closes #19764