-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: capabilities of df.set_index #24046
Comments
@h-vetinari should |
@jbrockmendel |
@jreback
I've opened this issue here for exactly this purpose (discussing your objections to existing capabilities of |
@h-vetinari While I think locking #22225 was an unnecessary move from @jreback , you have to realize that the "''overruling approving reviews''" thing is not a good argument to raise in such a discussion. True, in pandas we look for devs consensus, but in the end we prefer to do so by argumenting rather than by any form of proper vote. So that comment was not very useful, to use an euphemism. Back to the topic of the discussion: if I understand correctly, @jreback is argumenting that
If this summary of the discussion is correct, then I think I am also against introducing And actually, I'm probably in favor of deprecating I understand you argument about
is a bad idea. ''There should be one - and preferably only one - obvious way to do it." EDIT: By the way: sorry for not reacting before to the ping - busy period. |
@h-vetinari In my opinion, the locking was not the best way to handle the discussion in the PR, so sorry about that. In the meantime, Jeff has unlocked the conversation there, but let's continue the discussion here. On the topic: given the behaviour of
@toobaz I personally almost never seen someone use |
Agreed that locking was not appropriate. On the issue itself, to me it's pretty clear that In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"A": [1, 2, 3]})
In [3]: df.set_index([['a', 'b', 'c']])
Out[3]:
A
a 1
b 2
c 3 Since In [4]: df.A.set_index([['a', 'b', 'c']]) work as well. |
I don't think continuing to post here or in the associated PR is an effective use of anyone's time. Why don't we just add it as a discussion point to the next dev chat? |
The implementation is fine, and deserves to go in 0.24.0 if we can agree on
the desired behavior. No need to delay I think.
…On Mon, Jan 7, 2019 at 10:09 AM William Ayd ***@***.***> wrote:
I don't think continuing to post here or in the associated PR is an
effective use of anyone's time. Why don't we just add it as a discussion
point to the next dev chat?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24046 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIrAEcEJoHmYf7ZMXGa7rvVpAlYUFks5vA3EtgaJpZM4Y9ZRa>
.
|
There isn't agreeance on desired behavior hence why I suggest moving to a separate forum. I don't think it's something we need to push into 0.24 at the end here either |
I really don't think this is a difficult decision though, is it? Do we want
which isn't correct, as shown in |
I agree that if it seems difficult to find an agreement, that discussing this on the next dev chat can be more productive / effective. |
@TomAugspurger We all agree (I think) that it makes sense for One alternative (which I don't particularly like) is what (I think) |
I am -1 due to ambiguity. I don't know what the desired behavior of the following is: df = pd.DataFrame(np.ones((3,3)))
df.set_index([1, 0, 2]) |
That is not fully the discussion. As that is about a DataFrame, and that behaviour is already defined (it first prefers column names). Given the confusion and talking next to each other, it might be good if someone attempts to make a good illustrated and complete summary of the actual discussion. |
Is |
No, it's not. But as already stated, if |
On deprecating passing values, rather that column labels to In [11]: df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
In [12]: df.set_index(["A", [1, 2, 3]])
Out[12]:
B
A
1 1 4
2 2 5
3 3 6 Without that, I think you'd have some ugly In [18]: df.set_axis(pd.MultiIndex.from_arrays([df.A, [1, 2, 3]]), inplace=False).drop(['A'], axis=1)
Out[18]:
B
A
1 1 4
2 2 5
3 3 6 |
you could just raise / warn on ambguity |
this needs coupling with possibly deprecating set_axis as well |
I would personally be happy to get rid of |
Deprecating the one method that works as expected?!
The problem is also for |
Sorry, I removed the new content and also most of the old. Without arguing for one position over another, here's the state of things as seen from the user's point of view who looks to the documentation for guidance:
At the same time, all example uses of I think it's fair to say that the docs currently advocate for always setting the index directly and gives no indication that setting a series index in a method-chain is supported. I care about this more than I care about which competing alternative wins out. My personal preference is still to keep |
You're not trying to understand. |
I'm pretty sure this will never pass. |
I understand that, to you, the functionality is not removed because it lives on in
It doesn't have to be everywhere or at the same time, but removing such overloaded interpretations would greatly simplify the API surface as well as code maintainability. I'm saying it could be a goal to strive for, like striving to have tuples always be MI-keys (and was an argument why "special-casing" in option 3 can be desirable).
@toobaz, this is part of the reason why I consider shifting the array-capability from |
We mostly disagree on what constitutes productive interaction. And from my part, this is the last "metacomment" in this discussion.
I totally agree it would simplify the code, but we don't want our users to pay the price. And it would be particularly ironic to start doing so because we want to overload the interpretation of the argument to So among the options you listed, only 4 and 5 are feasible. To be honest I don't like 5, but this might be subjective. Do you agree that a deprecation message which tells our users "please replace set_index with set_axis" will serve our users at least as well as a deprecation message which tells our users "please replace set_index with set_index(arrays=.)"? |
I agree that users should not pay the price. I also don't want the overloaded interpretation in
As a deprecation, yes. After the removal, not so much (because
I think 3 and 5 are feasible. The overlap is 5, and I'd be happy to support that.
Indeed, and we both have our failings there. Thanks for taking the time/energy to stick with it. |
Why "obvious"? For the name "
3 was so far ruled out by both me and @jreback - not very useful to cite it if you don't have new arguments. I guess we never discussed much 5, so I'll be happy to mention it in the next live chat. I still think we will end up choosing 4. |
That plays a large role, yes, because we're
I have enough arguments**, but so far you have hardly responded to them, except by appeal to authority. 4 or 5 is your opinion, 3 or 5 is mine. Of course, you're in a much better position to enforce your ideas, but that does not improve the strength of your argument. Still, we have already found some sort of minimal common ground with 5. If your preference for 4 over 5 is not too large to be overcome at all, I'd be happy to submit a PR that implements 5. ** I realised I haven't even fully articulated one of the most important arguments against 4: saying that arrays can only be used in
|
This is a way to waste the time of both of us. So far, the only effect in this discussion of me being a core dev is that I am patiently replying to your rants, instead than doing more productive things. But since you later repeated your arguments in a tidy way, I will for the last time to assume you're trying to be constructive, and repeat my objections to your arguments. But neither your arguments nor my replies are new - and my replies are not just mine, other devs contributed to this and related discussions. If you want to just discuss the points below over and over, let's do this in a live chat so at least we wast less time.
The fact that an ability is added later on does not mean it cannot be deprecated (actually the opposite).
"setting index with arrays" is a functionality (not so sure it is even a "core" one - setting from columns is much more frequent in the code I see), having it in
Your argument would suggest we want to make them almost-duplicates (apart from the
As already clearly stated: you are right this would simplify our life, but we don't want our users to pay for. We always allowed users to skip the step of explicitly creating a vectorized object to feed our vectorized objects, and I see no reason why we should change our mind now. As for terminology, 2D arrays are collections of 1D arrays, which are collections of elements, so lists of lists make perfect sense.
There is nothing to "solve" if the |
I appreciate the time you took, and thanks for that. Indeed I was trying to have a productive discussion, and what seemed like rants to you was my response to the perception that you dismissed my points out of hand (until your last reply).
The point is that the concepts
Except maybe if it is the cause of the hotly-contested ambiguity of Anyway, thanks for taking the time to respond. I'll note that you didn't react to my attempt at extending an olive branch though:
|
In API design, "adherence to language" is only one of the many arguments - and an argument which is not new in this discussion. And for sure the fact that we have (almost-)duplicates in language doesn't mean we want (almost-)duplicates in the API.
You seem to think that this problem is important because it stalled for a year - maybe it stalled for a year because there were more important things ;-)
Flat lists and nested lists should be clearly different objects to our users.
I did. I said I don't like 5, I said why, but I also said I'm happy to discuss at our next live chat. Now, unless there are new arguments, I suggest we wait for that. |
Then the consequence should be deprecating
Is there a date/time already? |
Setting the index from columns and from data are two operations (I guess this is what you mean by "concepts") which are different enough (we have seen) as to raise the need to avoid ambiguity. It can be done through different functions, or different args, or by defining special cases users should be aware of, but we definitely agree they are distinct. Which of these solutions we pick is mostly not a matter of linguistics.
No |
(but just write me privately if you want to set up a chat with me) |
I'd like to point out that the tone of this thread makes me a bit uncomfortable. As a reminder, this project has a code of conduct https://github.com/pandas-dev/pandas-governance/blob/master/code-of-conduct.md In such discussions, I think we (both maintainers and contributors) need to stick to facts and technical arguments and leave feelings and editorial comments out of the process. There is a risk in technical arguments to stoop to emotive conjugation (https://en.wikipedia.org/wiki/Emotive_conjugation) in describing others' actions. In general my understanding is that this project operates on the basis of consensus-based decision making -- when there is no consensus about a change, the default option is probably to do nothing. In theory as the BDFL I can help settle disagreements, but I would prefer not to except in truly exceptional circumstances. I question whether GitHub issues was the appropriate venue for this discussion compared with some form of RFC / design document. |
@wesm I have striven to avoid any emotionally charged words, but don't claim that I always succeeded. I believe all participants truly want the best for the combination of user- & maintainer-base, but such impassioned arguments take a lot of time and energy (which, I presume is the reason why many participants have not joined the discussion anymore). I do object to the way some things were handled in this whole episode, but will not dwell on that. My main reasons for not resigning from this discussion are that I don't want the case dismissed without a fair hearing/counter-argument (even if I'm not core-dev), and that I feel the picture is much less one-sided even on the dev-side, as the impression that the last few comments might give.
I'd be happy to participate in another format, but didn't know a better way than through an issue here. PS. Thanks for the link about emotive conjugation. "How would I describe myself in their shoes" will be an excellent self-check before speaking/posting. |
In the last dev chat, which was held just few minutes ago, this issue was discussed and there was clear consensus for option 4, that is, "deprecate using set_index with arrays, and point to set_axis instead". (Related to the "tone" of the discussion, I definitely try to keep emotionally charged language away from my comments, but whenever should I fail to do so, I welcome being explicitly reprimanded - ideally in private. I'm not a native speaker, and in any case I definitely won't be offended by any such rebuke. On the other hand, language aside, I think that when a discussion takes more of our time and energy than it is worth, there is nothing wrong in stating it, even if it is not a "fact or technical argument" on pandas itself.) |
@toobaz can you give some reasoning why this is the clear preference? |
@toobaz @jorisvandenbossche |
To be honest there was more a recall of the reasoning already exposed here than any new argument. Some devs (e.g., @WillAyd ) were already clearly in favour of option 4. The only new thing was a proposal (by @TomAugspurger if I recall correctly) to deprecate @h-vetinari you probably already know, but just in case: the call was publicly announced on the [pydata] mailing list. In any case, again, if there had been new arguments I would have written them here. It is hard to see the 76 comments here + other in related issues a "lack of documentation and transparency".
Wlll reply there. |
@toobaz
I do not consider dispersed discussion in several threads and comments as appropriate documentation (again: not as a criticism of you or the other devs, but rather of the current process). I added a comment about this in #28568. |
@h-vetinari The calls and meeting notes are public.
Not quite: I was just choosing different names to highlight the different behavior. Clearly set_index should stay :) |
OK, thanks for the clarification ;-) |
This is coming out of a discussion that has stalled #22225 (which is about adding
.set_index
to Series, see #21684). The discussion has shifted away from what capabilities a putativeSeries.set_index
should have, but what capabilitiesdf.set_index
has currently.The main issue (for @jreback) is that
df.set_index
takes arrays:Further on:
I don't think I am confusing them. If I want to set the
.index
-attribute of a Series/DataFrame, then using.set_index
is the most reasonable name by far. If anything,set_axis
should be a superset ofset_index
(and a putativeset_columns
), that just switches between the two based on theaxis
-kwarg.More than that, the current capabilities of
df.set_index
are a proper superset ofdf.set_axis(axis=0)
**, in that it's possible to fillkeys
with onlySeries
/Index
/ndarray
/list
etc.:** there is one caveat, in that lists (and only lists; out of all containers) need to be wrapped in another list, i.e.
df.set_index([[0, 8, 3, 0]])
instead ofdf.set_index([0, 8, 3, 0])
. This is the heart of the ambiguity that @jreback mentioned above (because a list is interpreted as a list of column keys).Summing up:
set_index
is the most natural name for setting the.index
-attributedf.set_index
should be able to process list-likes (as it currently does; this is the source of the ambiguity of the list case).df.set_axis
should be able to do everything thatdf.set_index
does, and just switch between operating on index/columns based on theaxis
-kwarg (after all,index
andcolumns
are the two axes of a DF).set_columns
on aDataFrame
axis
-kwarg ofset_axis
should just switch between the behaviour ofset_index
(i.e. dealing with keys and array-likes) andset_columns
.Series.set_index
should support the same signature asdf.set_index
, with the exception of thedrop
-keyword (which only makes sense for column labels).set_index
andset_axis
methods should be exactly the same.Since I can't tag @pandas-dev/pandas-core, here are a few individual tags: @jreback @TomAugspurger @jorisvandenbossche @gfyoung @WillAyd @jbrockmendel @jschendel @toobaz.
EDIT: Forgot to add an xref from @jreback:
In that issue, there's discussion largely around
.rename
, and how to make that method more consistent. Also discussed was potentially introducing.relabel
, as well as.set_columns
.The text was updated successfully, but these errors were encountered: