Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent name handling in value_counts, part 2 #11579

Closed
corr724 opened this issue Nov 12, 2015 · 5 comments
Closed

inconsistent name handling in value_counts, part 2 #11579

corr724 opened this issue Nov 12, 2015 · 5 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug

Comments

@corr724
Copy link

corr724 commented Nov 12, 2015

I was happy to see in the release notes for 0.17.0 that value_counts no longer discards the series name, but the implementation wasn't what I expected.

0.17.0 gives

>>> series = pd.Series([1731, 364, 813, 1731], name='user_id')
>>> series.value_counts()
1731    2
813     1
364     1
Name: user_id, dtype: int64

which doesn't set the index name.

In my opinion the old series name belongs in the index, not in the series name:

>>> series.value_counts()
user_id
1731    2
813     1
364     1
dtype: int64

Why:

  • It's logical: the user_id has moved to the index, and the values now represent occurrence counts
  • This would be consistent with how .groupby().size() behaves
  • Adding a missing index name is cumbersome and requires creating a temporary variable
  • In many cases the series name is discarded, while index names tend to stick around: for example, pd.DataFrame({'n': series.value_counts(), 'has_duplicates': series.value_counts() > 1}) should really have user_id as an index name

There are three options:

  • result.name = None and result.index.name = series.name
  • result.name = series.name and result.index.name = series.name
  • result.name = 'size' or 'count' and result.index.name = series.name

The first option seems more elegant to me but @sinhrks, who reported #10150, apparently expected result.name to be filled, so perhaps there are use cases where the second option is useful.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2015

Can you show where you think .groupby preserves? AFAICT it does not.

In [111]: s = pd.Series([1731, 364, 813, 1731], name='user_id', index=Index(range(4),name='foo'))

In [112]: s
Out[112]: 
foo
0    1731
1     364
2     813
3    1731
Name: user_id, dtype: int64

In [113]: s.groupby(s.values).count()           
Out[113]: 
364     1
813     1
1731    2
dtype: int64

In [114]: s.value_counts()
Out[114]: 
1731    2
813     1
364     1
Name: user_id, dtype: int64

@jreback jreback added Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Nov 12, 2015
@sinhrks
Copy link
Member

sinhrks commented Nov 12, 2015

@corr724 #10150 isn't the same as your second option. It is:

  • result.name = series.name and result.index.name = None

Agreed to first option is consistent considering groupby().count().

s.groupby(s).count().name
# None
s.groupby(s).count().index.name
# 'user_id'

@corr724
Copy link
Author

corr724 commented Nov 12, 2015

Yes, the three options are all ways to change the current behavior. The second option was intended as a compromise between the first option and the change in #10150.

@jreback: That's because s.values discards all Pandas metadata, but using either s.groupby(s) or df.groupby('column') will set the index name.

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

ok, that seems reasonable then.

I think the option 1 is fine (though ok with name=size/count as well, your option 3), just should make sure that groupby.size/count is the same.

@mroeschke mroeschke added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug and removed API Design Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 21, 2021
@lukemanley
Copy link
Member

closing as this was implemented in #49912

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug
Projects
None yet
Development

No branches or pull requests

5 participants