-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: Deprecate convert parameter in take #17352
Conversation
9e0bcad
to
4471360
Compare
Codecov Report
@@ Coverage Diff @@
## master #17352 +/- ##
==========================================
- Coverage 91.03% 91.02% -0.02%
==========================================
Files 162 162
Lines 49567 49577 +10
==========================================
+ Hits 45125 45126 +1
- Misses 4442 4451 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17352 +/- ##
==========================================
- Coverage 91.27% 91.25% -0.02%
==========================================
Files 163 163
Lines 49766 49785 +19
==========================================
+ Hits 45422 45433 +11
- Misses 4344 4352 +8
Continue to review full report at Codecov.
|
4471360
to
3a1de27
Compare
pandas/core/groupby.py
Outdated
return self.data.take(self.sort_idx, axis=self.axis, convert=False) | ||
with warnings.catch_warnings(): | ||
warnings.simplefilter("ignore") | ||
return self.data.take(self.sort_idx, axis=self.axis, convert=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback @jorisvandenbossche : This is a tricky one for me here. On the one hand, I want to remove convert=False
, but because axis
of the groupby
object might be invalid when calling Series.take
, this is not possible for me to do (when True
, it re-indexes the value according to the length of the slice along the index).
On the other hand, this catch_warnings()
, while an easy way to get things to pass for now, is not a solution I want to merge. Suggestions on how to handle this problem (or is it best that we just expose convert
for all relevant implementations, though I'm not a fan).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, might have found the answer myself 😄
78f0c4e
to
bfa4bf6
Compare
@@ -3354,7 +3354,7 @@ def dropna(self, axis=0, how='any', thresh=None, subset=None, | |||
else: | |||
raise TypeError('must specify how or thresh') | |||
|
|||
result = self.take(mask.nonzero()[0], axis=axis, convert=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these internal uses should be removed. We're guaranteed the indexer is inbounds, so it's an unnecessary performance hit to check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by this? DataFrame.take
doesn't even respect the convert
parameter. Thus, this will have no effect on anything. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this example doesn't matter b/c it's on a frame, but does with the usages of _data.take
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And Series.take
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a bounds check anymore in Series.take
or _data.take
because I removed all of them for that same reason. We have so many bounds checks all over the place that removing just one of them is okay now.
pandas/core/series.py
Outdated
if kwargs: | ||
nv.validate_take(tuple(), kwargs) | ||
|
||
# check/convert indicies here | ||
if convert: | ||
indices = maybe_convert_indices(indices, len(self._get_axis(axis))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is safe to remove - with non numpy types, we call to an internal take impl that doesn't bounds check. Can you try pd.Series([1,2,3], dtype='category').take([-1, -2])
on your branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so that breaks:
2 NaN
1 1.0
dtype: category
Categories (3, int64): [1, 2, 3]
That being said, that call works without dtype='category'
, meaning we do handle negative indices correctly in some cases already. I might as well add that handling for category
too.
So I originally added this because I wanted the ability to a) convert neg to positive, and b) bounds check; these are not the default because of perf (IOW internally we often know that we don't need either of these things, but from a public perspetive you do need this). These shouldn't have been on the public NDFrame though (so good to remove). So the question is, do we need a |
I'd be in favor of a |
I suppose that in the name of performance, that's fair. I think I would just need to implement this solely for |
Yes, that's correct. |
@chris-b1 : Sorry, somehow your answer only confused me more. Yes or no, do I need to implement |
yes
…On Tue, Aug 29, 2017 at 3:37 PM, gfyoung ***@***.***> wrote:
@chris-b1 <https://github.com/chris-b1> : Sorry, somehow your answer only
confused me more. Yes or no, do I need to implement _take for just Series
and DataFrame ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17352 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB1b_Duk4u1gGVVH0bjyBu6xvA6OgmHvks5sdHaUgaJpZM4PD6BQ>
.
|
@chris-b1 : Alright, here goes. Let's see what CI has to say about my implementation. |
pandas/core/frame.py
Outdated
@@ -2033,7 +2033,7 @@ def _ixs(self, i, axis=0): | |||
return self.loc[:, lab_slice] | |||
else: | |||
if isinstance(label, Index): | |||
return self.take(i, axis=1, convert=True) | |||
return self.take(i, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every reference to an internal take should be to ._take
(this is pretty much all of them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really needed?
Can't we keep take
and only use _take
where it is needed? IMO it just makes it more complex to understand the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @jorisvandenbossche . I'm not sure I really understand what you mean by "reference to an internal take" since the original references are to a public function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have now 2 functions to do the same thing , a public and private version
it is much more clear to simply use _take internally in all cases
the public function is just an interface
otherwise a future reader will not know which to use - recipe for confusion
pls change all uses to _take
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be confusion anyhow, because then people will wonder what the difference is with pubkic take, and in majority of the cases, there is no difference. IMO that is more clear by using the public function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is AFAIK only one class where negative indexing isn't handled properly, and that is Categorical
. Makes we wonder whether we should revisit the option of adding supporting given this disagreement about design though (@chris-b1 thoughts?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me boil it down to this (in reference to my previous comment): besides potential test failures, is there reason why we call take_1d
in pandas.core.categorical
instead of just calling self._codes.take(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls change all references to _take
.take is ONLY a wrapper function and should not be used anywhere in the code base
this eliminates any possibility of confusion and is inline with the existing style
mixing usages is cause for trouble
your other references are simply not relevant here and are again purely internal uses (take_1d)
if u want to bring them up in another issue by all means
let's keep the scope to changing refs to the NDFrame .take changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there were multiple places where I had to use ._take()
, then I think I would be more inclined to agree with you @jreback . However, at this point, there is only one place where this had to happen, and if necessary, I can easily "clear up" any confusion by adding a comment.
IMO one usage isn't convincing enough for me to convert all of other invocations (both in NDFrame
and by any other pandas
class to use ._take()
). I think I still have to side with @jorisvandenbossche on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung if this were a new public function, and you wrote a private impl _take
as well. Then you certainly would use the private impl everywhere. I don't see a difference here. Pls make the change.
pandas/core/generic.py
Outdated
@@ -2058,6 +2058,25 @@ def __delitem__(self, key): | |||
except KeyError: | |||
pass | |||
|
|||
def _take(self, indices, axis=0, convert=True, is_copy=False): | |||
""" | |||
The internal version of `self.take` that will house the `convert` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more descriptive doc-string with parameters and such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done.
pandas/core/generic.py
Outdated
result._set_is_copy(self) | ||
|
||
return result | ||
|
||
def take(self, indices, axis=0, convert=True, is_copy=True, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also deprecate is_copy
; that is another internal parameter; its only necessary for _take
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point. If you don't mind, I'll do this in a subsequent PR once the infrastructure for _take
has been merged in.
|
||
See also | ||
-------- | ||
numpy.ndarray.take |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have this? now that it is generic is shouldn't be needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have what exactly? You commented on a not-so clear part of the diff.
pandas/core/sparse/series.py
Outdated
@@ -604,14 +604,38 @@ def sparse_reindex(self, new_index): | |||
|
|||
def take(self, indices, axis=0, convert=True, *args, **kwargs): | |||
""" | |||
Sparse-compatible version of ndarray.take |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a shared doc string for all public .take()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/generic.py
Outdated
|
||
self._consolidate_inplace() | ||
new_data = self._data.take(indices, | ||
axis=self._get_block_manager_axis(axis), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass through convert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
e0b209b
to
35812d0
Compare
so this still needs to change virtually all uses of |
47e1879
to
f30634f
Compare
if convert: | ||
indices = maybe_convert_indices(indices, len(self._get_axis(axis))) | ||
|
||
indices = _ensure_platform_int(indices) | ||
new_index = self.index.take(indices) | ||
new_values = self._values.take(indices) | ||
return (self._constructor(new_values, index=new_index, fastpath=True) | ||
.__finalize__(self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason you can't just call the super method here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ._constructor
? I was trying to preserve existing (working) code as much as possible when making these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no the entire routine
give it s try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://travis-ci.org/pandas-dev/pandas/builds/279118227
I gave it a try, and there were a bunch of failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, not really sure why it shoulnt' work. its almost the same code
f30634f
to
e9c9bd9
Compare
https://travis-ci.org/pandas-dev/pandas/jobs/279023089 @jreback : Not sure why doing this conversion to |
e9c9bd9
to
41504af
Compare
889fb83
to
ec299bc
Compare
xref pandas-devgh-16948. The parameter is not respected, nor is it a parameter in many 'take' implementations.
ec299bc
to
9325f21
Compare
@jreback : I've done all of the converting (without breaking tests). PTAL. |
thanks @gfyoung ! |
* 'master' of github.com:pandas-dev/pandas: (188 commits) Separate out _convert_datetime_to_tsobject (pandas-dev#17715) DOC: remove whatsnew note for xref pandas-dev#17131 BUG: Regression in .loc accepting a boolean Index as an indexer (pandas-dev#17738) DEPR: Deprecate cdate_range and merge into bdate_range (pandas-dev#17691) CLN: replace %s syntax with .format in pandas.core: categorical, common, config, config_init (pandas-dev#17735) Fixed the memory usage explanation of categorical in gotchas from O(nm) to O(n+m) (pandas-dev#17736) TST: add backward compat for offset testing for pickles (pandas-dev#17733) remove unused time conversion funcs (pandas-dev#17711) DEPR: Deprecate convert parameter in take (pandas-dev#17352) BUG:Time Grouper bug fix when applied for list groupers (pandas-dev#17587) BUG: Fix some PeriodIndex resampling issues (pandas-dev#16153) BUG: Fix unexpected sort in groupby (pandas-dev#17621) DOC: Fixed typo in documentation for 'pandas.DataFrame.replace' (pandas-dev#17731) BUG: Fix series rename called with str altering name rather index (GH17407) (pandas-dev#17654) DOC: Add examples for MultiIndex.get_locs + cleanups (pandas-dev#17675) Doc improvements for IntervalIndex and Interval (pandas-dev#17714) BUG: DataFrame sort_values and multiple "by" columns fails to order NaT correctly Last of the timezones funcs (pandas-dev#17669) Add missing file to _pyxfiles, delete commented-out (pandas-dev#17712) update imports of DateParseError, remove unused imports from tslib (pandas-dev#17713) ...
coming up on the 3.5 build of master seems we r not suppressing somewhere (or prob needs to change to use _take maybe) can u take care of @gfyoung |
@jreback : The cause of the warning is that an older version of Thus, this is a compatibility issue, as it disappears in later versions of I'll put a check in the test to expect that warning for older versions of |
xref pandas-devgh-16948. The parameter is not respected, nor is it a parameter in many 'take' implementations.
xref pandas-devgh-16948. The parameter is not respected, nor is it a parameter in many 'take' implementations.
xref #16948.
The parameter is not respected, nor is it a parameter in many 'take' implementations.
cc @jorisvandenbossche