-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix .to_excel() for MultiIndex containing a NaN value #13511 #13551
BUG: Fix .to_excel() for MultiIndex containing a NaN value #13511 #13551
Conversation
values = levels.take(labels) | ||
|
||
if levels._can_hold_na: | ||
values = levels.take(labels, fill_value=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need the conditional - just pass fill_valiue
it won't have an effect of no nas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove the conditional, the tests test_to_excel_multiindex
, test_to_excel_multiindex_cols
and test_to_excel_multiindex_dates
fail with equivalent tracebacks and the same Exception. Here's the traceback of test_to_excel_multiindex
as an example:
======================================================================
ERROR: test_to_excel_multiindex (pandas.io.tests.test_excel.Openpyxl20Tests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/mpuels/progs/pandas-mpuels/pandas/io/tests/test_excel.py", line 1776, in wrapped
orig_method(self, *args, **kwargs)
File "/home/mpuels/progs/pandas-mpuels/pandas/io/tests/test_excel.py", line 1321, in test_to_excel_multiindex
frame.to_excel(path, 'test1', header=False)
File "/home/mpuels/progs/pandas-mpuels/pandas/core/frame.py", line 1431, in to_excel
startrow=startrow, startcol=startcol)
File "/home/mpuels/progs/pandas-mpuels/pandas/io/excel.py", line 875, in write_cells
for cell in cells:
File "/home/mpuels/progs/pandas-mpuels/pandas/formats/format.py", line 1984, in get_formatted_cells
self._format_body()):
File "/home/mpuels/progs/pandas-mpuels/pandas/formats/format.py", line 1955, in _format_hierarchical_rows
values = levels.take(labels, fill_value=True)
File "/home/mpuels/progs/pandas-mpuels/pandas/indexes/base.py", line 1438, in take
raise ValueError(msg.format(self.__class__.__name__))
ValueError: Unable to fill values because Int64Index cannot contain NA
I added the conditional, because take(level, fill_value=True)
only works when the corresponding level of the MultiIndex contains NaNs. When it doesn't, the aforementioned exception is raised.
Here is a small example:
df = (pd.DataFrame({'c1': [1,1,2,2],
'c2': [None] + "b c d".split(),
'v' : [6,7,8,9]})
.set_index(['c1', 'c2']))
df
yields
c1 c2 v
1 6
1 b 7
2 c 8
2 d 9
df.index
yields
MultiIndex(levels=[[1, 2], [u'b', u'c', u'd']],
labels=[[0, 0, 1, 1], [-1, 0, 1, 2]],
names=[u'c1', u'c2'])
The first level doesn't contain NaNs, so .take(_, fill_value=True)
raises an Exception:
levels_0 = df.index.levels[0]
labels_0 = df.index.labels[0]
values_0 = levels_0.take(labels_0, fill_value=True)
values_0
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-bbcc31a032be> in <module>()
1 levels_0 = df.index.levels[0]
2 labels_0 = df.index.labels[0]
----> 3 values_0 = levels_0.take(labels_0, fill_value=True); values_0
/home/mpuels/progs/pandas-mpuels/pandas/indexes/base.pyc in take(self, indices, axis, allow_fill, fill_value, **kwargs)
1436 if allow_fill and fill_value is not None:
1437 msg = 'Unable to fill values because {0} cannot contain NA'
-> 1438 raise ValueError(msg.format(self.__class__.__name__))
1439 taken = self.values.take(indices)
1440 return self._shallow_copy(taken)
ValueError: Unable to fill values because Int64Index cannot contain NA
If the level of the MultiIndex contains NaNs, take(_, fill_values=True)
works:
levels_1 = df.index.levels[1]
labels_1 = df.index.labels[1]
values_1 = levels_1.take(labels_1, fill_value=True)
values_1
Index([nan, u'b', u'c', u'd'], dtype='object', name=u'c2')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpuels It does not work whether the multiindex level contains NaNs or not, but if it can contain NaNs.
In [4]: level = pd.Index(['a', 'b'])
In [5]: level._can_hold_na
Out[5]: True
In [6]: level.hasnans
Out[6]: False
In [7]: level.take([0,0,1], fill_value=True)
Out[7]: Index([u'a', u'a', u'b'], dtype='object')
But you are correct you need this conditional here (checking if it can contain NaNs).
In principle, you could also make a one-liner of it by passing levels._can_hold_na
to allow_fill
in take
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche You are right. It doesn't matter if a level contains NaNs, but if it can contain NaNs.
Unfortunately the approach with the one-liner doesn't work. I've taken the same DataFrame as above and changed take(labels_0, fill_value=True)
to take(labels_0, fill_falue=False)
. The method take
still raises the same exception:
levels_0 = df.index.levels[0]
labels_0 = df.index.labels[0]
values_0 = levels_0.take(labels_0, fill_value=False); values_0
...
ValueError: Unable to fill values because Int64Index cannot contain NA
If the level cannot contain NaNs as labels, only take(_, fill_value=None)
works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the allow_fill
that you have pass True/False to depending on _can_hold_na
Current coverage is 85.32% (diff: 100%)@@ master #13551 diff @@
==========================================
Files 141 141
Lines 50679 50679
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 43240 43240
Misses 7439 7439
Partials 0 0
|
@mpuels I think there went something wrong with merge/rebase. Normally, running the following should do exactly what you want to clean it up here:
|
f907286
to
7272a96
Compare
@jorisvandenbossche Thanks for the help on git! Additionally I wanted to change the commit message of commit 7272a96 to something like 'CLN: Break line to avoid long line.'. I tried |
@mpuels No, it is certainly possible. Normally using Further, when this PR is merged, the commits will be squased any way and the PR's title is used for the squashed commit (so it is not that important to change the latest commit's message) |
frame = self.frame | ||
frame.A = np.arange(len(frame)) | ||
frame.iloc[0, 0] = None | ||
frame.set_index(['A', 'B'], inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use inplace, assign instead
@jreback I used
to assign to the first row of column
Then the values of the Index would be predictable and I could refer to the first row of And can you please give me a rule of thumb when to use test data which is already available and when to construct new test data? Thanks! |
@mpuels best to use a sample frame that you construct as the expected (almost always). |
@mpuels Can you rebase and update according to the comments? |
@jorisvandenbossche Sorry for the long delay. I don't have time today, but will do it tomorrow. |
No problem! |
7272a96
to
2335cee
Compare
lgtm. @jorisvandenbossche |
@mpuels Thanks a lot! |
git diff upstream/master | flake8 --diff