-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR/API: tuples in labels #24688
Comments
Worth a discussion for sure. Taking into consideration my own workflow as a bias I've never intentionally found a use case for tuples as a label, though we get a surprising amount of issues pop up as a result so I suspect others do have workflows where this is valid We just added a |
As long as we have MultiIndex, I am not sure it is possible to get rid of all cases where you might end up with tuples. First, when referring to a MultiIndex key, you need to use a tuple. Eg the Another aspect is that some operations will give you tuples.
In a simplified data model where there would be no multi-indexed columns possible, it would make sense to restrict column names eg to only strings (and disallow tuples), but in the current pandas data model, I am not sure it is possible / desirable. |
Wouldn't this exactly solve that? Today it's potentially unclear whether a tuple used as an indexer refers to a location in a MultiIndex or one particular label that happens to be a tuple. If we disallow the latter, the ambiguity is gone, no? |
This was badly formulated on my part. Of course, the tuples would be necessary for accessing a Your second example is intriguing and more relevant. Because with |
The ambiguity I referred to was the one between a tuple as values vs a tuple as a single label (but indeed, for sure, the amount of ambiguous cases decreases a bit :-)) Whether the tuple label is referring to a tuple key in a flat index or to a single key of multi-level columns, it is still this tuple that is conflicting with tuple as array-like (that is how I interpreted "because one needs to check if a tuple is a valid key (both for the frame and the MI.names), and interpret it as a list-like otherwise") |
On second thought, it may be sufficient to allow tuples in |
You're right there as well, those cases would still need extra logic (for a good reason, as you showed in the example). But I'm guessing the vast majority of bugs are due to your first point (i.e. using "tuple-as-single-label"), so that would still be a substantial improvement/simplification (considering the complexity that would be needed to not just crash and burn as the OP illustrated) |
Yes, if we are speaking about bugs and not ambiguities/complexities in the API, you are probably right. |
That's exactly what I have in my bug report and PR. I'd welcome this suggestion. See the following code: from pandas._libs import lib
lib.clean_index_list([(('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog'))])[0] Output: |
pandas/pandas/tests/indexing/multiindex/test_iloc.py Lines 23 to 37 in f160c7d
|
Seeing this now only. I'm against suppressing tuples as values. I think @h-vetinari 's strongest point is
... but even that is simply solved by recurring to stuff like Moreover, there is huge evidence of people using tuples as labels in their code. And even evidence of people who have no reasons to stop. And cases we can't provide better solutions. Then, there is the more general argument of what exactly we would be forbidding. What about types which subclass tuples? Other hashable iterables? As in #24984 , in order to find not properly rewrite some internal code which should definitely be rewrite, we plan to complicate the (description) of the API. By the way, disallowing tuples would mean at the very minimum showing a warning for every usage, for some time. That's much more annoying for me to implement than a couple of refactorings which we should do anyway. The one argument that could convince me is that there are many cases of API ambiguity due to accepting tuples. However I can't find any. And on this topic,
This is wrong, or just backwards-compatible code to be removed. Tuples are not list-likes, and this is orthogonal to having them as atomic labels. And they can't be list-likes because they are |
The use case I've had where tuple-labels were indispensible is where I had |
I haven't found a previous issue for this despite a bunch of searching, apologies if I overlooked something.
Yes, tuples are hashable and can therefore be used for indexing in principle, but not least with using
df.loc[(lvl1, lvl2), col]
for MultiIndexing, it becomes a big complexity burden to allow tuples as labels.There are also a couple of places where some inputs take either labels or arrays (e.g.
DataFrame.set_index
), and it's a hassle to get all the corner cases right, because one needs to check if a tuple is a valid key (both for the frame and the MI.names), and interpret it as a list-like otherwise (or whatever the precedence might be within the method).Tuples are also notoriously hard to get into an array, because numpy will often change the semantics if it's not
is_scalar
.Furthermore, there's lots of bugs hiding because many of these corner cases are not tested:
I'd bet that there are 100s of similar cases that are hidden because the vast majority of tests does not check tuples as index entries.
@WillAyd just commented on an issue:
Why allow it at all in this case? Why not just deprecate tuples in
Index
/ column names /MultiIndex.names
?The text was updated successfully, but these errors were encountered: