BUG: Fix .to_excel() for MultiIndex containing a NaN value #13511 #13551

mpuels · 2016-07-03T01:46:10Z

closes NaN label in MultiIndex is assigned a non NaN value when writing to excel file #13511
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

jreback · 2016-07-03T02:09:24Z

pandas/formats/format.py

-                    values = levels.take(labels)
+
+                    if levels._can_hold_na:
+                        values = levels.take(labels, fill_value=True)


you don't need the conditional - just pass fill_valiue
it won't have an effect of no nas

If I remove the conditional, the tests test_to_excel_multiindex, test_to_excel_multiindex_cols and test_to_excel_multiindex_dates fail with equivalent tracebacks and the same Exception. Here's the traceback of test_to_excel_multiindex as an example:

====================================================================== ERROR: test_to_excel_multiindex (pandas.io.tests.test_excel.Openpyxl20Tests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/mpuels/progs/pandas-mpuels/pandas/io/tests/test_excel.py", line 1776, in wrapped orig_method(self, *args, **kwargs) File "/home/mpuels/progs/pandas-mpuels/pandas/io/tests/test_excel.py", line 1321, in test_to_excel_multiindex frame.to_excel(path, 'test1', header=False) File "/home/mpuels/progs/pandas-mpuels/pandas/core/frame.py", line 1431, in to_excel startrow=startrow, startcol=startcol) File "/home/mpuels/progs/pandas-mpuels/pandas/io/excel.py", line 875, in write_cells for cell in cells: File "/home/mpuels/progs/pandas-mpuels/pandas/formats/format.py", line 1984, in get_formatted_cells self._format_body()): File "/home/mpuels/progs/pandas-mpuels/pandas/formats/format.py", line 1955, in _format_hierarchical_rows values = levels.take(labels, fill_value=True) File "/home/mpuels/progs/pandas-mpuels/pandas/indexes/base.py", line 1438, in take raise ValueError(msg.format(self.__class__.__name__)) ValueError: Unable to fill values because Int64Index cannot contain NA

I added the conditional, because take(level, fill_value=True) only works when the corresponding level of the MultiIndex contains NaNs. When it doesn't, the aforementioned exception is raised.

Here is a small example:

df = (pd.DataFrame({'c1': [1,1,2,2], 'c2': [None] + "b c d".split(), 'v' : [6,7,8,9]}) .set_index(['c1', 'c2'])) df

yields

c1 c2 v 1 6 1 b 7 2 c 8 2 d 9 df.index

yields

MultiIndex(levels=[[1, 2], [u'b', u'c', u'd']], labels=[[0, 0, 1, 1], [-1, 0, 1, 2]], names=[u'c1', u'c2'])

The first level doesn't contain NaNs, so .take(_, fill_value=True) raises an Exception:

levels_0 = df.index.levels[0] labels_0 = df.index.labels[0] values_0 = levels_0.take(labels_0, fill_value=True) values_0 --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-16-bbcc31a032be> in <module>() 1 levels_0 = df.index.levels[0] 2 labels_0 = df.index.labels[0] ----> 3 values_0 = levels_0.take(labels_0, fill_value=True); values_0 /home/mpuels/progs/pandas-mpuels/pandas/indexes/base.pyc in take(self, indices, axis, allow_fill, fill_value, **kwargs) 1436 if allow_fill and fill_value is not None: 1437 msg = 'Unable to fill values because {0} cannot contain NA' -> 1438 raise ValueError(msg.format(self.__class__.__name__)) 1439 taken = self.values.take(indices) 1440 return self._shallow_copy(taken) ValueError: Unable to fill values because Int64Index cannot contain NA

If the level of the MultiIndex contains NaNs, take(_, fill_values=True) works:

levels_1 = df.index.levels[1] labels_1 = df.index.labels[1] values_1 = levels_1.take(labels_1, fill_value=True) values_1 Index([nan, u'b', u'c', u'd'], dtype='object', name=u'c2')

@mpuels It does not work whether the multiindex level contains NaNs or not, but if it can contain NaNs.

In [4]: level = pd.Index(['a', 'b']) In [5]: level._can_hold_na Out[5]: True In [6]: level.hasnans Out[6]: False In [7]: level.take([0,0,1], fill_value=True) Out[7]: Index([u'a', u'a', u'b'], dtype='object')

But you are correct you need this conditional here (checking if it can contain NaNs).
In principle, you could also make a one-liner of it by passing levels._can_hold_na to allow_fill in take

@jorisvandenbossche You are right. It doesn't matter if a level contains NaNs, but if it can contain NaNs.

Unfortunately the approach with the one-liner doesn't work. I've taken the same DataFrame as above and changed take(labels_0, fill_value=True) to take(labels_0, fill_falue=False). The method take still raises the same exception:

levels_0 = df.index.levels[0] labels_0 = df.index.labels[0] values_0 = levels_0.take(labels_0, fill_value=False); values_0 ... ValueError: Unable to fill values because Int64Index cannot contain NA

If the level cannot contain NaNs as labels, only take(_, fill_value=None) works.

It is the allow_fill that you have pass True/False to depending on _can_hold_na

codecov-io · 2016-07-03T03:00:37Z

Current coverage is 85.32% (diff: 100%)

Merging #13551 into master will not change coverage

@@             master     #13551   diff @@
==========================================
  Files           141        141          
  Lines         50679      50679          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43240      43240          
  Misses         7439       7439          
  Partials          0          0

Powered by Codecov. Last update 474fd05...2335cee

jorisvandenbossche · 2016-07-04T13:32:48Z

@mpuels I think there went something wrong with merge/rebase. Normally, running the following should do exactly what you want to clean it up here:

git checkout fix-multiindex-nan-label-to-excel
git fetch upstream
git rebase upstream/master
git push -f origin/fix-multiindex-nan-label-to-excel

mpuels · 2016-07-04T17:02:32Z

@jorisvandenbossche Thanks for the help on git! Additionally I wanted to change the commit message of commit 7272a96 to something like 'CLN: Break line to avoid long line.'. I tried git rebase -i HEAD~3 and changed the commit message there, but it didn't work. Or is it impossible?

jorisvandenbossche · 2016-07-04T17:09:37Z

@mpuels No, it is certainly possible. Normally using git rebase -i should work to rename a commit (I think there is the option 'reword' to select?). But if it is the latest commit, you can also do git commit --amend -m "new message".

Further, when this PR is merged, the commits will be squased any way and the PR's title is used for the squashed commit (so it is not that important to change the latest commit's message)

jreback · 2016-07-05T10:44:53Z

pandas/io/tests/test_excel.py

+        frame = self.frame
+        frame.A = np.arange(len(frame))
+        frame.iloc[0, 0] = None
+        frame.set_index(['A', 'B'], inplace=True)


don't use inplace, assign instead

mpuels · 2016-07-11T20:14:28Z

@jreback I used

frame.iloc[0, 0] = ...

to assign to the first row of column A, because self.frame's index contains random values. Would it be OK if I don't use self.frame but instead construct new test data? Something like

df = pd.DataFrame({'A': [1,2,3],
                   'B': [10,20,30],
                   'C': np.random.sample(3)})
df.loc[0, 'A'] = None
df = df.set_index(['A', 'B'])

Then the values of the Index would be predictable and I could refer to the first row of A using df.loc.

And can you please give me a rule of thumb when to use test data which is already available and when to construct new test data? Thanks!

jreback · 2016-07-11T21:03:45Z

@mpuels best to use a sample frame that you construct as the expected (almost always).

jorisvandenbossche · 2016-07-23T16:10:17Z

@mpuels Can you rebase and update according to the comments?

mpuels · 2016-07-23T16:31:52Z

@jorisvandenbossche Sorry for the long delay. I don't have time today, but will do it tomorrow.

jorisvandenbossche · 2016-07-23T16:42:43Z

No problem!

…#13511

…isting one.

jreback · 2016-07-25T11:56:23Z

lgtm. @jorisvandenbossche

jorisvandenbossche · 2016-07-25T15:07:12Z

@mpuels Thanks a lot!

jreback reviewed Jul 3, 2016
View reviewed changes

jorisvandenbossche added Bug IO Excel read_excel, to_excel labels Jul 3, 2016

jorisvandenbossche added this to the 0.18.2 milestone Jul 3, 2016

jorisvandenbossche added the MultiIndex label Jul 3, 2016

mpuels force-pushed the fix-multiindex-nan-label-to-excel branch from f907286 to 7272a96 Compare July 4, 2016 16:57

jreback reviewed Jul 5, 2016
View reviewed changes

mpuels added 4 commits July 24, 2016 22:58

BUG: Fix .to_excel() for MultiIndex containing a NaN value pandas-dev…

9abc4e8

…#13511

CLN: Get rid of conditional.

ba41db6

BUG: Fix .to_excel() for MultiIndex containing a NaN value pandas-dev…

335cf86

…#13511

TST: Construct DataFrame specifically for test, instead of reusing ex…

2335cee

…isting one.

mpuels force-pushed the fix-multiindex-nan-label-to-excel branch from 7272a96 to 2335cee Compare July 24, 2016 20:59

jorisvandenbossche merged commit 4c2840e into pandas-dev:master Jul 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix .to_excel() for MultiIndex containing a NaN value #13511 #13551

BUG: Fix .to_excel() for MultiIndex containing a NaN value #13511 #13551

mpuels commented Jul 3, 2016

jreback Jul 3, 2016

mpuels Jul 4, 2016

jorisvandenbossche Jul 4, 2016 •

edited

Loading

mpuels Jul 4, 2016

jorisvandenbossche Jul 4, 2016

codecov-io commented Jul 3, 2016 •

edited

Loading

jorisvandenbossche commented Jul 4, 2016

mpuels commented Jul 4, 2016

jorisvandenbossche commented Jul 4, 2016 •

edited

Loading

jreback Jul 5, 2016

mpuels commented Jul 11, 2016

jreback commented Jul 11, 2016

jorisvandenbossche commented Jul 23, 2016

mpuels commented Jul 23, 2016

jorisvandenbossche commented Jul 23, 2016

jreback commented Jul 25, 2016

jorisvandenbossche commented Jul 25, 2016

BUG: Fix .to_excel() for MultiIndex containing a NaN value #13511 #13551

BUG: Fix .to_excel() for MultiIndex containing a NaN value #13511 #13551

Conversation

mpuels commented Jul 3, 2016

jreback Jul 3, 2016

Choose a reason for hiding this comment

mpuels Jul 4, 2016

Choose a reason for hiding this comment

jorisvandenbossche Jul 4, 2016 • edited Loading

Choose a reason for hiding this comment

mpuels Jul 4, 2016

Choose a reason for hiding this comment

jorisvandenbossche Jul 4, 2016

Choose a reason for hiding this comment

codecov-io commented Jul 3, 2016 • edited Loading

Current coverage is 85.32% (diff: 100%)

jorisvandenbossche commented Jul 4, 2016

mpuels commented Jul 4, 2016

jorisvandenbossche commented Jul 4, 2016 • edited Loading

jreback Jul 5, 2016

Choose a reason for hiding this comment

mpuels commented Jul 11, 2016

jreback commented Jul 11, 2016

jorisvandenbossche commented Jul 23, 2016

mpuels commented Jul 23, 2016

jorisvandenbossche commented Jul 23, 2016

jreback commented Jul 25, 2016

jorisvandenbossche commented Jul 25, 2016

jorisvandenbossche Jul 4, 2016 •

edited

Loading

codecov-io commented Jul 3, 2016 •

edited

Loading

jorisvandenbossche commented Jul 4, 2016 •

edited

Loading