Allowing the index to be referenced by name, like a column #8162

makmanalp · 2014-09-02T20:58:40Z

Idea: use df.index/df.columns names to automatically choose axis along which to broadcast #13243 use df.index/df.columns names to automatically choose axis along which to broadcast
Request for some kind of named arguments loc #11373 Request for some kind of named arguments loc
Partial Selection on MultiIndex: The need for empty slice support & dict indexing #4036 Partial Selection on MultiIndex: The need for empty slice support & dict indexing

What if we allowed the index of a dataframe to be referred to in the usual ways?

data = pd.read_table("...", index_col="id")
data.id  # breaks
data["id"]  # breaks

I find myself setting and resetting indices very often to join to a different dataframe or to pull in the values of the index to a subselection of the dataframe, etc. I figure this is because of how the data is stored under the hood, but wouldn't this be convenient?

jreback · 2014-09-02T21:15:10Z

I recall another issue about this - can u have a look for it?

further this is not difficult

want to try a pr?

makmanalp · 2014-09-02T22:12:35Z

Yeah, I'd love to take a shot at implementing this. I spent a few minutes looking for the old issue but couldn't find anything other than the tangentially relevant #8082 . Do you remember any other details?

jreback · 2014-09-02T23:08:38Z

I think I am remembering implementing (then reverting) this

you will need to change __getattr__ and _get_item_cached in core/generic.py

need good tests!

shoyer · 2014-09-04T03:25:57Z

I think this is a great idea. I did something similar in xray.

A few things to consider for a full-fledged implementation:

What should the type of data['id'] be? I think it should be a Series (i.e., data.index.to_series() or pd.Series(data.index, data.index)) rather than an Index (data.index), to follow the rule that the items in a DataFrame are always Series objects.
This should work with a MultiIndex. In this case, you should get a Series where the values are only from the named level (i.e., pd.Series(data.index.get_level_values('id'), data.index)).
Don't forget indexing columns with lists. This should also work, returning a DataFrame: data[['id', 'other_col']]

makmanalp · 2014-09-04T15:28:28Z

@shoyer - thank you so much! I was pondering the first myself - great point about the type, I wonder if Index follows the Series interface exactly. If so, shouldn't be a problem. Second and third hadn't even occured to me.

It looks like Index and Series inherits IndexOpsMixin (https://github.com/pydata/pandas/blob/master/pandas/core/base.py#L283)

https://github.com/pydata/pandas/blob/master/pandas/core/index.py#L74 and https://github.com/pydata/pandas/blob/master/pandas/core/series.py#L80

@jreback thoughts?

jreback · 2014-09-04T15:40:20Z

this is very simple

just change the methods I showed above
and wrap with _constructor

TomAugspurger · 2014-09-04T18:44:46Z

Regarding @shoyer's #3, with

In [7]: df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=['a', 'b', 'c'])
In [8]: df.index.name = 'idx'

Does df[['idx', 'A', 'B']] return

with idx in the index still, or

  idx  A  B
0   a  1  4
1   b  2  5
2   c  3  6

with idx as a column? It should be the second one IMO.

shoyer · 2014-09-04T18:48:59Z

@TomAugspurger actually, I think it should either be your first example, or something like:

     idx  A  B
idx
a      a  1  4
b      b  2  5
c      c  3  6

This has the disadvantage of now having a redundant column/index with the same name. But I don't like changing the index based on indexing particular columns -- if you want that, you can use reset_index().

makmanalp · 2014-09-04T22:32:48Z

I think the first one is simpler too. We're not hiding that it's the index, and we're not promoting it to be a column, we're just allowing it to be referred to and used as a column.

jorisvandenbossche · 2014-09-05T06:32:11Z

But is should be consistent I think. If df['idx] returns the index wrapped in a Series, then df[['idx', 'A', 'B']] should also return it as a Series, and thus a DataFrame with 3 columns I think (so the example how @shoyer showed it). df[['idx', 'A', 'B']] and df[['A', 'B']] should not be the same I think.

shoyer · 2014-09-05T07:37:09Z

I agree with @jorisvandenbossche. Columns are never going to be fully interchangeable with indexes (even after this change), and if you're explicitly indexing the index as a column you presumably want it as a series, not an index.

Another edge case to test for: let's make sure df.groupby('idx') works. Right now you need to write df.groupby(level='idx').

TomAugspurger · 2014-09-05T12:37:02Z

+1 for @shoyer's example. I should have explained why I think that including idx in the slice should return it as a column. First of all there's the mental model that df[<list>] always returns a DataFrame whose columns are in the list. Second this would be the only way to do things like df[['idx', 'A', 'B']].sum(1) without resorting to the ugly old way of restet_index()ing.

I had an issue and PR about the @shoyer's groupby that I never finished off. We can handle groupby separately, but If this goes into 0.15, I'll finish up that PR.

makmanalp · 2014-09-05T18:21:35Z

@shoyer didn't know about the level=idx! The groupby was on my list because it's such a pain in the butt.

One question, is wrapping the index in a series and adding it onto the dataframe essentially a no-op, or is it going to be horribly inefficient for larger dataframes?

shoyer · 2014-10-02T23:40:55Z

I think a broader theme of the issue is that it is intuitive to think of an "index" as a special type of column, rather than as a separate type of entity.

TomAugspurger · 2015-08-14T02:37:03Z

Just to reraise this with another use-case, this would help out matplotlib with their labeled data plotting. I haven't looked recently, but an earlier version had to workaround not being able to use __getitem__ to get to the index.

I'm less sure about the need to allow df[['index_name', 'other_col']].

makmanalp · 2015-08-14T16:41:41Z

@TomAugspurger in defense of df[['index_name', 'other_col']], what's nice about it is that it saves you from a ton of gross foo.reset_index().blah.set_index() and other similar cruft that isn't really meaningful and obscures what your code is actually trying to do.

tacaswell · 2015-08-20T01:04:00Z

There is currently code on that branch so that

plt.plot('foo', data=df)
plt.plot(df['foo'])

Will grab both the index to use as the index instead of range, but that is only implemented for plot, but nothing else.

But, major 👍 from me on this ability. I don't have a view on the list slicing, but the name should be something other than id as that seems like a source of endless collisions.

jankatins · 2015-08-20T06:46:31Z

This "problem" was also on the ggplot todo list. I would vote for df["__index__"] being treated special (=return df.index) and have named index also show up in df[[<...>]]

jbrockmendel · 2017-07-24T17:01:58Z

Transplanting from #17061 on convergence in Index/Series behavior.

It would be nice to be able to access foo.dt without first having to check whether foo is an Index or Series. This could be accomplished by having DatetimeIndex, PeriodIndex, and TimedeltaIndex have a property dt that just returns self. If others agree, I'll put together a PR. Thoughts?

MarcoGorelli · 2023-03-30T10:34:06Z

If I've understood the suggestion correctly, I'm -1 on it, because of the ambiguity in what should happen if a column has the same name as the index

In [5]: df
Out[5]:
   a  b
a
7  1  4
8  2  5
9  3  6

# what does df['a'] return?

MarcoGorelli · 2023-04-12T19:08:15Z

closing as per today's discussion then - thanks anyway for the issue

tacaswell · 2023-04-12T19:12:12Z

@MarcoGorelli Is there a link to any notes from the discussions?

MarcoGorelli · 2023-04-12T19:21:16Z

yes but they just say "agreed to close" 😄 https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit#

A related issue which was brought up is #27652, which may still be considered

tacaswell · 2023-04-13T00:17:16Z

😞

I am curious what the persuasive argument was.

I see how from inside of pandas the index is very special, but from the outside it just looks like any other column. In the case where you need to consume input from users of many types (which may just be a Matplotlib problem) being able to treat dict-of-array, dataframes, h5py groups, xarray, [anything that returns an array for __getitem__(key: str) -> Array], etc is pretty nice. From the outside it is a weird wart that there is data on dataframes that can not be extract via __getitem__.

On the other hand I see the namespace problem may be intractable and the above use case might be niche enough that it is not worth the engineering and documentation effort to make it work.

jbrockmendel · 2023-04-13T01:19:18Z

The main pain point was cases where there the index name(s) matched a column label

davidgilbertson · 2023-07-19T21:30:48Z

That's doesn't seem like a great reason to not proceed. .groupby works seamlessly across columns and indexes. If an index and column share a name, it errors with ValueError: <name> is both an index level and a column label, which is ambiguous.

@MarcoGorelli if there's a deeper reason, it would be great to know so I can properly give up hope :). Otherwise, from all the other comments this doesn't seem like an impossible thing, I'm happy to contribute.

MarcoGorelli · 2023-07-20T06:43:09Z

Personally, I'd rather not add even more auto-magic and inconsistencies. This is going to open up more issues. There's enough to work on. If a PDEP were raised, I'd probably vote down, sorry

But that doesn't mean you need to give up hope 😄 If you can get another core member on board, write a PDEP with them, and then get a 2/3 majority of core members to vote it up, then you could bypass my negativity

jreback added API Design labels Sep 3, 2014

shoyer mentioned this issue Sep 22, 2014

Bloomberg Hackathon #8323

Closed

shoyer mentioned this issue Apr 27, 2015

Towards "pandas 1.0" #10000

Closed

jorisvandenbossche added this to the 0.17.0 milestone Jul 26, 2015

jorisvandenbossche mentioned this issue Aug 14, 2015

API: select levels of a MultiIndex #10816

Open

jreback modified the milestones: Next Major Release, 0.17.0 Sep 1, 2015

jreback added Prio-medium labels Sep 1, 2015

This was referenced Jul 25, 2017

WIP: Refactor accessors, unify usage, make "recipe" #17042

Closed

DISC: add accessor attributes to Index for consistency with Series #17134

Open

jorisvandenbossche mentioned this issue Aug 9, 2017

Add dt accessor to Index #17204

Closed

4 tasks

jreback modified the milestones: Next Major Release, High Level Issue Tracking Sep 24, 2017

Dr-Irv mentioned this issue Mar 23, 2018

BUG: New feature allowing merging on combination of columns and index levels drops levels of index #20452

Closed

TomAugspurger removed the Master Tracker High level tracker for similar issues label Jul 6, 2018

TomAugspurger removed this from the High Level Issue Tracking milestone Jul 6, 2018

datapythonista added this to the Someday milestone Jul 8, 2018

jorisvandenbossche mentioned this issue Aug 1, 2019

API: Meta-issue for making consistent API's to refer to column names and index names #27652

Open

jbrockmendel removed Effort Medium labels Oct 21, 2019

jbrockmendel self-assigned this Dec 26, 2019

jorisvandenbossche mentioned this issue Jan 31, 2020

Enhancement: code sugar for index.get_level_values #31444

Closed

ChrisStuff mentioned this issue Jun 15, 2020

ENH: Inconsistency when string refers both to index level and column label #34791

Closed

mroeschke added Enhancement and removed API Design labels Apr 11, 2021

mroeschke removed this from the Someday milestone Oct 13, 2022

jbrockmendel removed their assignment Mar 30, 2023

MarcoGorelli closed this as completed Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allowing the index to be referenced by name, like a column #8162

Allowing the index to be referenced by name, like a column #8162

makmanalp commented Sep 2, 2014 •

edited by jreback

Loading

jreback commented Sep 2, 2014

makmanalp commented Sep 2, 2014

jreback commented Sep 2, 2014

shoyer commented Sep 4, 2014

makmanalp commented Sep 4, 2014

jreback commented Sep 4, 2014

TomAugspurger commented Sep 4, 2014

shoyer commented Sep 4, 2014

makmanalp commented Sep 4, 2014

jorisvandenbossche commented Sep 5, 2014 •

edited

Loading

shoyer commented Sep 5, 2014

TomAugspurger commented Sep 5, 2014

makmanalp commented Sep 5, 2014

shoyer commented Oct 2, 2014

TomAugspurger commented Aug 14, 2015

makmanalp commented Aug 14, 2015

tacaswell commented Aug 20, 2015

jankatins commented Aug 20, 2015

jbrockmendel commented Jul 24, 2017

MarcoGorelli commented Mar 30, 2023

MarcoGorelli commented Apr 12, 2023

tacaswell commented Apr 12, 2023

MarcoGorelli commented Apr 12, 2023 •

edited

Loading

tacaswell commented Apr 13, 2023

jbrockmendel commented Apr 13, 2023

davidgilbertson commented Jul 19, 2023

MarcoGorelli commented Jul 20, 2023

Allowing the index to be referenced by name, like a column #8162

Allowing the index to be referenced by name, like a column #8162

Comments

makmanalp commented Sep 2, 2014 • edited by jreback Loading

jreback commented Sep 2, 2014

makmanalp commented Sep 2, 2014

jreback commented Sep 2, 2014

shoyer commented Sep 4, 2014

makmanalp commented Sep 4, 2014

jreback commented Sep 4, 2014

TomAugspurger commented Sep 4, 2014

shoyer commented Sep 4, 2014

makmanalp commented Sep 4, 2014

jorisvandenbossche commented Sep 5, 2014 • edited Loading

shoyer commented Sep 5, 2014

TomAugspurger commented Sep 5, 2014

makmanalp commented Sep 5, 2014

shoyer commented Oct 2, 2014

TomAugspurger commented Aug 14, 2015

makmanalp commented Aug 14, 2015

tacaswell commented Aug 20, 2015

jankatins commented Aug 20, 2015

jbrockmendel commented Jul 24, 2017

MarcoGorelli commented Mar 30, 2023

MarcoGorelli commented Apr 12, 2023

tacaswell commented Apr 12, 2023

MarcoGorelli commented Apr 12, 2023 • edited Loading

tacaswell commented Apr 13, 2023

jbrockmendel commented Apr 13, 2023

davidgilbertson commented Jul 19, 2023

MarcoGorelli commented Jul 20, 2023

makmanalp commented Sep 2, 2014 •

edited by jreback

Loading

jorisvandenbossche commented Sep 5, 2014 •

edited

Loading

MarcoGorelli commented Apr 12, 2023 •

edited

Loading