-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.loc on Hierarchical Index with single-valued index level can drop that index level in place #13842
Comments
pls read the documentation: http://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers a scalar will always drop and a list will never drop since you are showing scalars this is as expected - |
I think this issue should be re-opened. @jreback I know you're probably busy but I think you missed the part where an existing DataFrame's index is modified in place by the use of that .loc. That is a serious issue, whether I used the correct syntax or not, .loc should never ever modify something without reassignment, right? From a user's perspective that is horrifying. Thank you, however, regarding the syntax. I've read through that documentation several times (possibly older versions) and I thought I'd finally understood the MultiIndex slicing. It's not exactly the most straightforward thing. |
@jreback Actually, further, "a scalar will always drop and a list will never drop" is simply not true. Look more closely at the example I gave you. I have two different dataframes with the same number of levels. In both cases I provided a scalar. In one case, the index was dropped. In the other case, it was not. If level 0 has more than one unique value, it did not drop. If it had only one unique value, it did. The scary part in particular is simply that in one of the cases the dataframe was also modified in place. You are correct however, that df1.loc[pd.IndexSlice[[1], :, :]] gives the expected behavior. |
@mborysow I believe you're correct about the bug. I'll reopen. I've also edited your original post to be a bit more succinct 😄 |
@TomAugspurger Thanks. I'll try to cut clearer to the point next time. =) |
I believe that's the difference between unique vs. dupes (though I could be wrong). Let's keep this issue focused on |
@mborysow the problem is you are addressing 'things you don't like' and not a focused examples e.g
|
which does look buggy. please have a look and see if you can come up with the reason why (in the code). |
@TomAugspurger Oops. I didn't notice the duplicate indices in there... I swapped in a dataframe that didn't have any, see below. Result is exactly the same, FYI.
|
@jreback Sorry. I appreciate the feedback on issue submission. I was trying to point out the difference in behavior in the two cases. I'll try to focus it down next time. Maybe I should have opened two issues. One pointing out the difference in result and the other pointing out the side effect. Would that have been better? |
@mborysow yes that would have been better. The first is a user question, the 2nd a bug. Ideally an issue is a simple repro that get's right to the point. The longer it is the more likely it won't be read / acted on / understood immeditaly and will just cause confusion. |
@jreback Should I go ahead and create a new issue for that now? I suppose there's a reasonable chance they stem from the same root cause. |
for what exactly? |
So there were two issues.
Example output next and code to copy and paste to reproduce below... df2, where 'A' is not dropped (A has more than one unique value)
df1, where 'A' is dropped (all rows have A = 1)
Code to reproduce:
|
Personally, it wouldn't surprise me to find out the the root cause of both those things is the same. |
as I said before this is as expected use a list to have no drops whether unique or not further using a non unique mi is generally not supports that well |
Yeah, I will use lists from now on for sure. But just to clarify, maybe I misunderstand what you are calling unique... I assumed non-unique in this context meant that two rows shared an exact index., e.g., When you say non-unique mi, are you also referring to the following as a non-unique mi? Is it the former, or the latter? |
I agree with @mborysow that this behavior isn't very intuitive. It feels like an implementation detail that has leaked into the API. For operations that select out a single value along a level, I don't see why we couldn't always drop that level from the index. @mborysow What @jreback means about "non-unique" is that each row is unique. So your second example would be unique: |
@shoyer I actually would rather argue the opposite. I prefer if it would never drop the level. Worse though is that whether you select one value or multiple give you a different number of levels. Some programmatic code may just choose items that pass some threshold, if sometimes it's just one, then everywhere I do this I need code to check what the new shape of the index is, and that's not fun. For the same reason, if you select none (e.g., via an empty list in the slicer) I think it should just return an empty DataFrame with the index intact. Otherwise, any time I choose a variable number of items from the DataFrame I have to check for two separate outliers (0 values or 1 value). It makes much more sense to me that .loc and similar indexing methods should just return a consistent number of levels regardless of what is selected. |
I think I did a poor job of explaining the alternative, which is closer to the existing behavior. I agree that behavior absolutely should not depend on data values or their length. However, it's OK to make distinctions based on types. The current behavior (for unique MultiIndexes) is:
This mirrors the rule for dropping axis with normal indexing, which in turn mirrors similar behavior from numpy. In fact, this is where the different behavior depending on uniqueness arises -- indexing a non-unique index with a scalar returns an object that still has that axis (by necessity), whereas indexing a unique-index with a scalar drops the axis. Changing this behavior (to never drop levels/axes) might be desirable, but it would be a major API change, so it would be best discussed in a separate issue. |
@shoyer Ahh. Then I agree completely. =) |
@shoyer I agree completely with the following that you said: I agree that behavior absolutely should not depend on data values or their length. However, it's OK to make distinctions based on types. The current behavior (for unique MultiIndexes) is: For scalar values, drop the level. Are we in agreement though that this is not what is currently happening? @jreback ck called what I described above the expected behavior. In this comment I made above there are two dataframes: In the first case the indices are 111, 112, 221, and 222 and in the other it's 111, 112, 121, and 122. Clearly unique indices based on the description above, and hence why I clarified. These two have different behaviors when indexing on the scalar value. They do behave the same when indexing on the list value. I'm perfectly happy with the scalar vs. list indexing working as you've described it if it's consistent, but that's the problem, it's not currently consistent. Anyhow, the thing I'm sure has been communicated and acknowledged is the side effect (the in-place index modification). That's a clear bug. The thing I'm not sure has been communicated is the difference in behavior based on the dataframe. I suspect strongly that the two are related, but I can't say for certain. This is the thing I was trying to clarify whether or not I should create a new issue for. Sorry if I'm beating a dead horse. Just paranoid that I'm not communicating the issue well. |
Yes, this looks like a bug to me. Both of these of this indexes are unique and lex-sorted (monotonic), so they should work the same way when indexed with a scalar value. |
Hopped into a debugger and found where it is happening. This is for version 0.18.1. I can look a little deeper if I can be directed. Let me know if this is helpful or not. Trying to help point in the right direction.. df1 (all A = 1)... df2 (multiple values for A) The actual culprit for the overwrite is in pandas.core.generic.NDFrame.xs line 1778. This code block is NOT reached by df2, only df1. At this point these two lines are executed: result = self.iloc[loc]
result.index = new_index loc is slice(None, None, None) for the df1 case. I'm guessing that self.iloc[:] returns the initial dataframe and not a new dataframe object pointing to the same data. Right here the index is overwritten (here the index has had level 0 dropped). |
... A little more... In pandas.indexes.multi.MultiIndex.get_loc_level, line 1710: for i, k in enumerate(key):
if not isinstance(k, slice):
k = self._get_level_indexer(k, level=i)
if isinstance(k, slice):
# everything
if k.start == 0 and k.stop == len(self):
k = slice(None, None)
else:
k_index = k
if isinstance(k, slice):
if k == slice(None, None):
continue
else:
raise TypeError(key) At "k.start == 0 and k.stop == len(self)." In df1, indexing with the scalar value of 1 selects everything, so: For df2, since selecting A=1 does not select all the values in the index, so the TypeError exception is raised in the last line of the code pasted above (line 1719 in pandas.indexes.multi.MultiIndex.get_loc_level. |
@jreback, @shoyer Is the above helpful in finding a solution? result = self.iloc[loc]
result.index = new_index could become like this: result = self.iloc[loc]
if isinstance(loc, slice) and loc == slice(None, None, None):
pass
else:
result.index = new_index |
Oh, or as @shoyer just linked to another issue, don't return the original object for .iloc[:] and .loc[:]. =) That makes more sense. Ok, so the inconsistency is a separate issue. I will create a new issue about it later. |
@mborysow I would suggest something like this instead (if you don't fix the underlying issue):
|
@shoyer I'm in an airgapped environment and have actually never made a pull request (also, most of my experience is with mercurial). I'll take a stab at it tonight when I get home. |
ok thanks. if u think we need additional tests pls PR |
Small Example
Expected Output
The output of the slice
Out[14]
is correct, butdf1
should not be modified inplace. So the expectedOut[15]
is the originaldf1
:I'm still not good at submitting issues here with code and print out, so I appreciate your patience. Also, thank you guys for making pandas as amazing as it is!!
Anyhow...
I have dataframes that sometimes have up to 5 levels on their multiindex. It's not uncommon for me to want to just grab a subset containing only one value on a certain level. If one level of that index has only one value, then .loc can drop that level inplace. I'd say this is highly undesirable.
First the normal behavior. Here's my input:
When I have a multi-indexed dataframe, and I do:
df.loc[1]
I get:
I personally expect it to return the original multi-index where the first level has only that value. Sadly, it drops it entirely ( I think this is terrible, since if you plan on resetting the index or concatenating later, you've just unwittingly lost information).
Anyhow, I recognize now that you need to provide an index for all levels, e.g., the way I expected it to work can actually be achieved by (for a three level index):
df.loc[pd.IndexSlice[1, :, :]]
Here's the rub... If the level that I indexed above has more than one unique value, this works fine. If it has only one, then once again that level gets dropped, but worse, the index is modified in place during the .loc operation.
Here's the dataframe showing the bad behavior:
df.loc[pd.IndexSlice[1, :, :]] gives:
Same syntax as the other case, but it dropped index A. Worse is that this is now df.
print(df)
If I modify the syntax slightly. I.e., df.loc[pd.IndexSlice[1, :, :], :] (with the original not modifed frame, I get the expected result:
I've tried to provide a code sample with comments that demonstrates the problem.
Code Sample, a copy-pastable example if possible
Here's what I get from running the code
Expected Output
What I expect from all of the examples above, is:
output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Linux
OS-release: 4.6.3-300.fc24.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.1
nose: None
pip: 8.0.3
setuptools: 20.1.1
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.5.1
matplotlib: 1.5.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.10
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: