-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing the index to be referenced by name, like a column #8162
Comments
I recall another issue about this - can u have a look for it? further this is not difficult want to try a pr? |
Yeah, I'd love to take a shot at implementing this. I spent a few minutes looking for the old issue but couldn't find anything other than the tangentially relevant #8082 . Do you remember any other details? |
I think I am remembering implementing (then reverting) this you will need to change need good tests! |
I think this is a great idea. I did something similar in xray. A few things to consider for a full-fledged implementation:
|
@shoyer - thank you so much! I was pondering the first myself - great point about the type, I wonder if Index follows the Series interface exactly. If so, shouldn't be a problem. Second and third hadn't even occured to me. It looks like Index and Series inherits IndexOpsMixin (https://github.com/pydata/pandas/blob/master/pandas/core/base.py#L283) https://github.com/pydata/pandas/blob/master/pandas/core/index.py#L74 and https://github.com/pydata/pandas/blob/master/pandas/core/series.py#L80 @jreback thoughts? |
this is very simple just change the methods I showed above |
In [7]: df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=['a', 'b', 'c'])
In [8]: df.index.name = 'idx' Does A B
idx
a 1 4
b 2 5
c 3 6 with idx A B
0 a 1 4
1 b 2 5
2 c 3 6 with |
@TomAugspurger actually, I think it should either be your first example, or something like: idx A B
idx
a a 1 4
b b 2 5
c c 3 6 This has the disadvantage of now having a redundant column/index with the same name. But I don't like changing the index based on indexing particular columns -- if you want that, you can use |
I think the first one is simpler too. We're not hiding that it's the index, and we're not promoting it to be a column, we're just allowing it to be referred to and used as a column. |
But is should be consistent I think. If |
I agree with @jorisvandenbossche. Columns are never going to be fully interchangeable with indexes (even after this change), and if you're explicitly indexing the index as a column you presumably want it as a series, not an index. Another edge case to test for: let's make sure |
+1 for @shoyer's example. I should have explained why I think that including I had an issue and PR about the @shoyer's groupby that I never finished off. We can handle groupby separately, but If this goes into 0.15, I'll finish up that PR. |
@shoyer didn't know about the level=idx! The groupby was on my list because it's such a pain in the butt. One question, is wrapping the index in a series and adding it onto the dataframe essentially a no-op, or is it going to be horribly inefficient for larger dataframes? |
I think a broader theme of the issue is that it is intuitive to think of an "index" as a special type of column, rather than as a separate type of entity. |
Just to reraise this with another use-case, this would help out matplotlib with their labeled data plotting. I haven't looked recently, but an earlier version had to workaround not being able to use I'm less sure about the need to allow |
@TomAugspurger in defense of |
There is currently code on that branch so that plt.plot('foo', data=df)
plt.plot(df['foo']) Will grab both the index to use as the index instead of But, major 👍 from me on this ability. I don't have a view on the list slicing, but the name should be something other than |
This "problem" was also on the ggplot todo list. I would vote for |
Transplanting from #17061 on convergence in Index/Series behavior. It would be nice to be able to access |
If I've understood the suggestion correctly, I'm -1 on it, because of the ambiguity in what should happen if a column has the same name as the index In [5]: df
Out[5]:
a b
a
7 1 4
8 2 5
9 3 6
# what does df['a'] return? |
closing as per today's discussion then - thanks anyway for the issue |
@MarcoGorelli Is there a link to any notes from the discussions? |
yes but they just say "agreed to close" 😄 https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit# A related issue which was brought up is #27652, which may still be considered |
😞 I am curious what the persuasive argument was. I see how from inside of pandas the index is very special, but from the outside it just looks like any other column. In the case where you need to consume input from users of many types (which may just be a Matplotlib problem) being able to treat dict-of-array, dataframes, h5py groups, xarray, [anything that returns an array for On the other hand I see the namespace problem may be intractable and the above use case might be niche enough that it is not worth the engineering and documentation effort to make it work. |
The main pain point was cases where there the index name(s) matched a column label |
That's doesn't seem like a great reason to not proceed. @MarcoGorelli if there's a deeper reason, it would be great to know so I can properly give up hope :). Otherwise, from all the other comments this doesn't seem like an impossible thing, I'm happy to contribute. |
Personally, I'd rather not add even more auto-magic and inconsistencies. This is going to open up more issues. There's enough to work on. If a PDEP were raised, I'd probably vote down, sorry But that doesn't mean you need to give up hope 😄 If you can get another core member on board, write a PDEP with them, and then get a 2/3 majority of core members to vote it up, then you could bypass my negativity |
What if we allowed the index of a dataframe to be referred to in the usual ways?
I find myself setting and resetting indices very often to join to a different dataframe or to pull in the values of the index to a subselection of the dataframe, etc. I figure this is because of how the data is stored under the hood, but wouldn't this be convenient?
The text was updated successfully, but these errors were encountered: