-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: various .value_counts() result in different names / indices #49497
Comments
cc @pandas-dev/pandas-core @pandas-dev/pandas-triage in case anyone has thoughts - hoping we can consider this a bug and change the behaviour in version 2.0 |
It could break existing code. In your examples, if you now do: pd.DataFrame(ser.value_counts()) you get a I think that we'd need a deprecation cycle for this??? |
True, it would break part of the API, but it would make the API more consistent Not sure how to raise a deprecation warning for this - where would the warning be raised? |
Isn't this inconsistent with other Series methods, e.g.
|
I don't think so, no - in
In |
I guess I could see an argument that Not sure how I feel about an operation just introducing a name like "counts"/"normalized" |
Isn't that what already happens in the groupby case though, where an extra column is inserted with the name In [11]: df.groupby('a', as_index=False)['b'].value_counts()
Out[11]:
a b count
0 1 4 1
1 2 4 1
2 3 5 1 I find it odd that the corresponding In [19]: df.groupby('a')['b'].value_counts()
Out[19]:
a b
1 4 1
2 4 1
3 5 1
Name: b, dtype: int64 not least because If the name of the resulting series were
would be aligned |
One thing to consider is to add a parameter to |
For indexes, the resulting
This doesn't seem quite right as it cannot support a
In the (as a side note - I think |
That's a good point @lukemanley , thanks, I've added that to the issue description @Dr-Irv true, there could be an extra parameter, but then
I've added this to the agenda on Wednesday, hope we can touch base on it |
Do I understand correctly that you are discussing |
Yes, this discussion is limited to |
Currently all Series methods that return a Series keep the Series name (excluding |
In In [5]: ser
Out[5]:
a 1
b 1
c 3
Name: foo, dtype: int64
In [6]: ser.mode()
Out[6]:
0 1
Name: foo, dtype: int64 whereas in In [7]: ser.value_counts()
Out[7]:
1 2
3 1
Name: foo, dtype: int64 Hence why I think the original Series name should propagate to the result Index name, rather than to the result Series name |
It might be worth having the could also have a parameter for the index name as well.
|
Isn't this what I think we should prefer method chaining in this case. |
I see your point. Still might be worth introducing it here to allow the current 1.x behavior to be preserved, then we deprecate the parameter later. Would require two deprecation cycles, one for the values of the default arguments and one for the new behavior. As I said earlier, this is a kind of pivot table operation, so having control of the names does make sense. I don’t think we have other places in pandas where we create a default name like this. |
I agree. Much prefer reducing arguments in peripheral cases where method chaining does the job. Some actions are so common that arguments are useful to many end users ( and may not be avoidable) but chaining gives a greater understanding of how the api can be used in generality. |
I agree with @MarcoGorelli's point about that all those examples you give, the values in the series "stay" (in some form, filtered or reduced) in the series' values. While for value_counts, the values move to the index, and the values of the series are now counts. To make this very explicit with a non-numerical example: >>> df = pd.DataFrame({'species': ["cat", "cat", "dog"]})
>>> df["species"].value_counts()
cat 2
dog 1
Name: species, dtype: int64
>>> df["species"].value_counts().reset_index()
index species
0 cat 2
1 dog 1 After resetting the index, the column with the species names (the values in the original "species" column), is no longer in the "species" column. I often tell people that >>> df.groupby("species").size().reset_index()
species 0
0 cat 2
1 dog 1
>>> df.groupby("species").size().rename("count").reset_index()
species count
0 cat 2
1 dog 1 With this method the original name of the column you are grouping by actually gets "correctly" set to the index (or grouping column, after a reset of the index). |
That seems like a nice potential intermediate solution. It could provide an easier migration path (although an actual deprecation would still be very noisy given how much # (hypothetical, not actual code with current pandas)
>>> df["species"].value_counts(name="counts").reset_index()
species count
0 cat 2
1 dog 1
I certainly agree with the general principle. We should prefer and promote composability of methods. But just to be very concrete, using my example from above, I can think of the following options to do the renaming: >>> df["species"].value_counts().rename("count").rename_axis("species").reset_index()
species count
0 cat 2
1 dog 1
>>> df["species"].value_counts().reset_index(name="count").rename(columns={"index": "species"})
species count
0 cat 2
1 dog 1
>>> df["species"].value_counts().reset_index().rename(columns={"species": "count", "index": "species"})
species count
0 cat 2
1 dog 1 (I am adding a So it's not just an easy single "rename" in addition to Given that this is not that straightforward to achieve, I think practicality could beat purity here, and an additional keyword argument could be worth it (I find a potential |
I'm persuaded by the arguments of @MarcoGorelli and @jorisvandenbossche, and find the Regarding adding the |
Thanks all for the discussion, and really nice example from Joris! Regarding adding a
makes sense to me.
Same, just confirmed this by scraping Kaggle notebooks and looking through them, feature_count = train[i].value_counts(dropna=False)[:40].reset_index().rename(columns={i: 'count', 'index': i}) so it seems that they also find the current behaviour unexpected |
+1 on the suggested behavior, -0.5 on adding a new keyword |
If we do the API breaking change (and it's really a behavior change), I think we should introduce the two new params (
I checked some of our code, and we have used |
if we're telling users "Do X to get the old behavior", why can't the "X" be "use rename(...)" so we don't have to add new arguments? |
Fair enough. Having said that, I think that the examples given by @jorisvandenbossche in #49497 (comment) suggest that the new arguments would be a good thing. |
@Dr-Irv what would be the rationale for also having FYI, I just noticed that R's tally method uses "n" instead of "count" as the resulting column name, so that's also an option. Although I think for us "count" is fine given that this resembles the method name. |
Maybe it is because of symmetry with naming the columns, so if we have |
and also given that that's what pandas already does in the groupby (as_index=False) case:
Regarding adding extra arguments, I'd be -1 on that: using |
From the last pandas-dev call, it was agreed that this change is desirable, but there's no agreement on how to make the change. I can see the following options:
In short, I'd favour option 4. I'll go through the options one-by-one, explaining drawbacks I see: 1. always raise a warning on
|
Also +1 to option 4 |
+1 to option 4 ; option 3 wouldn't object (but also noisy) |
+1 option 4. option 2 and 3 I'm somewhat indifferent to. - seeing merits and drawbacks. |
Some feedback on the specifics of the different options:
Small note, for this specific case users can (should?) do
I think this could be simpler: replace it with Assuming we would, if
If we do this (and to be clear I am not opposed to this), this is IMO clearly not a bug fix, rather it's an intentional breaking change of long standing behaviour.
See above, I am convinced we can provide an easy way to get the future behaviour with this new keyword (and in a way that could keep working the same in the future). Just to throw out one more variant that I mentioned briefly on the call: kind of a 2b: still add a warning in 1.5.x, but only a DeprecationWarning (and not FutureWarning, so end users won't see it this was relied upon in actual code (while the authors of that code can still see it and fix it), but will only see it in interactive exploration that calls it directly), and change it in 2.0.0. |
I have another idea. What if we did something like ask users to do So then you could do option 2 or 3, and there is an easy way to both get the future behavior AND silence the warnings, and you can use |
To enable the behaviour globally, instead of an import, it could also be done as an option (that might be more consistent with other cases where we use options to enable a certain behaviour) |
I do like the sound of that more, thanks Irv and Joris! So a warning could be something like
And this could go in to 1.5.3 |
I don't think adding a way to silence this via global option (sure could be done via context manager as well) is actually helpful The problem as a user is you see a warning and you just want to adapt your code so that it works under the current and new versions. Here you just now make 2 changes - set the option and then remove setting the option later. I don't believe we have this for any other deprecation (or anything really) and hence would be -1 on 2b +1 on option 4 (just change it for 2.0) |
But in version 2.0, the setting would be a NO-OP. So you could leave it in your code and nothing would happen. |
Given the breadth of 2, I think it is also reasonable to go with option 4. |
Thinking about this further, I think this is the suggestion I like the most. In pandas 2.0.0, the option becomes a no-op, people can remove it but they can also safely keep it, and there needn't be any hurry from the pandas side to deprecate it. I'm putting together a PDEP for this (both for visibility, and because - as discussed in the governance meeting - it's the kind of change that probably warrants one) |
I don’t like the precedent that api changes of this size/scope merit a PDEP.
…On Thu, Nov 24, 2022 at 12:06 PM Marco Edward Gorelli < ***@***.***> wrote:
But in version 2.0, the setting would be a NO-OP. So you could leave it in
your code and nothing would happen.
Thinking about this further, I think this is the suggestion I like the
most. In pandas 2.0.0, the option becomes a no-op, people can remove it but
they can also safely keep it, and there needn't be any hurry from the
pandas side to deprecate it.
I'm putting together a PDEP for this (both for visibility, and because -
as discussed in the governance meeting - it's the kind of change that
probably warrants one)
—
Reply to this email directly, view it on GitHub
<#49497 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5UM6FE6JD3UQM36XLGPP3WJ7DFZANCNFSM6AAAAAARWCAIS4>
.
You are receiving this because you are on a team that was mentioned.Message
ID: ***@***.***>
|
That's fair - let's keep the discussion here then Looking back through this thread, the option that's received the most support (and also no strong objections) is to "just change" (option 4), so I'll raise a PR to do that - if people do have strong objections, they can block the PR and we can explore other options |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Let's start with:
Issue Description
ser.value_counts
:name
is from original Series,index
is unnameddf.value_counts
:name
isNone
,index
names from columns of original DataFramedf.groupby('a')['b'].value_counts()
:name
isb
, index names area
andb
df.groupby('a', as_index=False)['b'].value_counts()
: new column is'count'
, index is a newRangeIndex
Expected Behavior
My suggestion is:
ser.value_counts
:name
is'count'
,index
is name of original Seriesdf.value_counts
:name
is'count'
,index
names from columns of original DataFramedf.groupby('a')['b'].value_counts()
:name
is'count'
, index names area
andb
df.groupby('a', as_index=False)['b'].value_counts()
: new column is'count'
, index is a newRangeIndex
[same as now]NOTE: if
normalize=True
, then replace'count'
with'proportion'
Bug or API change?
Given that:
df.groupby('a')['b'].value_counts().reset_index()
errorsdf.groupby('a')[['b']].value_counts().reset_index()
anddf.groupby('a', as_index=False)[['b']].value_counts()
don't matchmulti_idx.value_counts()
doesn't preserve the multiindex's name anywhereI'd be more inclined to consider it a bug and to make the change in 2.0
Installed Versions
INSTALLED VERSIONS
commit : a215264
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.102.1-microsoft-standard-WSL2
Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 2.0.0.dev0+550.ga215264d47
numpy : 1.23.3
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 59.8.0
pip : 22.2.2
Cython : 0.29.32
pytest : 7.1.3
hypothesis : 6.54.6
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.0.3
IPython : 8.5.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : 0.8.3
fsspec : 2021.11.0
gcsfs : 2021.11.0
matplotlib : 3.6.0
numba : 0.56.2
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.17.8
pyarrow : 9.0.0
pyreadstat : 1.1.9
pyxlsb : 1.0.9
s3fs : 2021.11.0
scipy : 1.9.1
snappy :
sqlalchemy : 1.4.41
tables : 3.7.0
tabulate : 0.8.10
xarray : 2022.9.0
xlrd : 2.0.1
zstandard : 0.18.0
tzdata : None
None
The text was updated successfully, but these errors were encountered: