-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR/API: disallow lists within list for set_index #24697
Conversation
Codecov Report
@@ Coverage Diff @@
## master #24697 +/- ##
===========================================
- Coverage 92.38% 43.06% -49.32%
===========================================
Files 166 166
Lines 52310 52309 -1
===========================================
- Hits 48326 22527 -25799
- Misses 3984 29782 +25798
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24697 +/- ##
==========================================
+ Coverage 91.75% 91.75% +<.01%
==========================================
Files 173 173
Lines 52960 52966 +6
==========================================
+ Hits 48595 48601 +6
Misses 4365 4365
Continue to review full report at Codecov.
|
I don't think we silently deprecated anything. There was a DOC PR (#22775) updating the docstring, and I think it was rather an oversight of a contributor that didn't know that behaviour and which was not catched during review. And that is a good catch! So I think the documentation change in this PR is certainly welcome, and the code changes require some discussion. Would you want to do the docstring changes in a separate PR that can be merged more quickly? |
@h-vetinari and to give you some expectations: I think most of us are rather swamped with trying to get the rc out, so I am not sure we will have time to finalize the set_index related discussions before the RC |
There was not accusation, I used "silent deprecation" to mean that this functionality is suddenly not mentioned anymore.
Which is fair enough. That's why I put up this PR (even though I don't really have the time), to let you judge the trade-off (e.g. currently adding more list-likes due to #22486, vs. the ambiguity we're talking about in #24046). |
While I understand @jorisvandenbossche 's concern...
... I think on aligning the docs to the functionalities we can all agree, and
I think this is an important argument even leaving aside #24046 , because I see (sorry for not noticing before) that tuples are among these "list-likes", which I think is just wrong. We discussed in several other issues the fact that in pandas we want to distinguish tuples - which are either keys of a So I think this PR really deserves to get considered for the release. Will review in detail later today. |
I cautioned about the tuple case twice in this thread, but it was an explicit review request to use In any case, I agree with you, hence this PR (as well as #24688). |
I see... in a similar case I resolved to 32ee973#diff-1e79abbbdd150d4771b91ea60a4e1cc7R2701 ... but I agree this is annoying, so see #24702 . |
I saw @TomAugspurger mentioned cutting the RC soon. In the unlikely case that someone still wants to consider this on short notice, feel free to commit into this PR - I'm offline the next 6-7h. |
doc/source/whatsnew/v0.24.0.rst
Outdated
- :meth:`DataFrame.set_index` now allows all one-dimensional list-likes, raises a ``TypeError`` for incorrect types, | ||
has an improved ``KeyError`` message, and will not fail on duplicate column names with ``drop=True``. (:issue:`22484`) | ||
- :meth:`DataFrame.set_index` now gives a better (and less frequent) KeyError, raises a ``ValueError`` for incorrect types, | ||
and will not fail on duplicate column names with ``drop=True``. (:issue:`22484`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done everyone on getting the RC ready and out the door! Was wondering if this PR should still be considered for 0.24 nevertheless? Seems unlikely, but not impossible, considering e.g. what @jreback wrote in #24060:
@TomAugspurger @jorisvandenbossche @jreback @toobaz @WillAyd |
@h-vetinari would not be averse to: 1) improving the doc-string / examples 2) raising better errors on existing behavior, but the impl here does too many things. Basically the non-controversial things would be fine. |
Split off #24762.
The chances here are as advertised, and really minimal. The only things worth mentioning are: - if not isinstance(keys, list):
- keys = [keys]
+ if (is_scalar(keys) or isinstance(keys, tuple)
+ or isinstance(keys, (ABCIndexClass, ABCSeries, np.ndarray))):
+ # make sure we have a container of keys/arrays we can iterate over
+ # tuples can appear as valid column keys!
+ keys = [keys]
+ elif not isinstance(keys, list):
+ raise ValueError(err_msg) which replaces indiscriminate list-wrapping with a very reasonable list of allowed types, and + depr_warn = False
for col in keys:
if (is_scalar(col) or isinstance(col, tuple)) and col in self:
# if col is a valid column key, everything is fine
continue
elif is_scalar(col) and col not in self:
# tuples that are not keys are not considered missing,
# but illegal (see below)
missing.append(col)
- elif (not is_list_like(col, allow_sets=False)
- or getattr(col, 'ndim', 1) > 1):
- raise TypeError(...)
+ elif isinstance(col, list):
+ depr_warn = True
+ elif (not isinstance(col, (ABCIndexClass, ABCSeries, np.ndarray))
+ or getattr(col, 'ndim', 1) > 1):
+ raise ValueError(err_msg)
if missing:
raise KeyError('{}'.format(missing))
+ if depr_warn:
+ msg = ('passing lists within a list to the parameter "keys" is '
+ 'deprecated and will be removed in a future version.')
+ warnings.warn(msg, FutureWarning, stacklevel=2) which does the actual deprecation and restricts to the allowed types (the The rest is just doc changes, removing some test cases added by #22486 from the test parametrization, and catching the deprecation warnings in the tests (without changing the tests themselves). |
Sorry, I haven't been able to keep up. Are either of the following being deprecated? In [8]: df.set_index([['a', 'b', 'c']])
Out[8]:
A B
a 1 4
b 2 5
c 3 6
In [9]: df.set_index(['A', ['a', 'b', 'c']])
Out[9]:
B
A
1 a 4
2 b 5
3 c 6 what's our recommendation to users doing this (the message should tell them). |
@TomAugspurger I've adapted the warning message to give a clearer example of that. |
@TomAugspurger @jorisvandenbossche @toobaz @jreback |
not for 0.24.0 |
@jreback @TomAugspurger @toobaz |
@TomAugspurger @jorisvandenbossche @toobaz @jreback |
@h-vetinari To make sure there is a common understanding: this PR deprecates the usage of lists for the case of array-like of values (so not for the list of labels/array-likes). Is there any ambiguous case this is solving? Or is it only with the intent to make it less complex/confusing for users? |
That's correct.
One of the main points of discussion about the capabilities of This PR tries to break the stalemate between longstanding capabilities (including mixing column labels and arrays since before v0.12.) and this ambiguity, by deprecating the array-like role of lists, which can functionally be easily replaced by Series/Index/np.ndarray. This also improves the situation with the confusingly-similar-yet-totally-different |
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -79,6 +79,7 @@ Deprecations | |||
~~~~~~~~~~~~ | |||
|
|||
- Deprecated the `M (months)` and `Y (year)` `units` parameter of :func: `pandas.to_timedelta`, :func: `pandas.Timedelta` and :func: `pandas.TimedeltaIndex` (:issue:`16344`) | |||
- :meth:`DataFrame.set_index` has deprecated using lists of values *within* lists. It remains possible to pass array-likes, both directly and within a list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would help here is a sub-section note that shows a good example of this.
I'm assuming you'll find the section I added too long, but it's easier to shorten than to add. |
@jreback |
@jreback @jorisvandenbossche @TomAugspurger @gfyoung |
after some discussion in real space. This is introducing more of a special case. This is not desired as adds user complexitly w/o a corresponding improvement. Thanks for the try here. This is a pretty tricky area of api and don't want to make it more confusing. |
@jreback @jorisvandenbossche @TomAugspurger PS. Lists might seem like a special case here (although #22264 did exactly the same for |
git diff upstream/master -u -- "*.py" | flake8 --diff
I wanted to add this before the RC gets cut for two reasons.
On the one hand, the docs have silently deprecated arrays in
df.set_index
compare master:and 0.23.4:
This is IMO a breaking change. I'd wait for the outcome of the discussion of #24046, but I feel that could easily fall under the table before the RC, so I wanted to provide a worked-out implementation.
Equally importantly, #22486 added capabilities for lots of list-likes within a list to
df.set_index
, and has not seen a release yet. Therefore, deprecation would be much easier now, than after0.24.0rc
.@jreback @TomAugspurger @jorisvandenbossche @toobaz @WillAyd