Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix .to_excel() for MultiIndex containing a NaN value #13511 #13551

Merged

Conversation

mpuels
Copy link
Contributor

@mpuels mpuels commented Jul 3, 2016

values = levels.take(labels)

if levels._can_hold_na:
values = levels.take(labels, fill_value=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the conditional - just pass fill_valiue
it won't have an effect of no nas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove the conditional, the tests test_to_excel_multiindex, test_to_excel_multiindex_cols and test_to_excel_multiindex_dates fail with equivalent tracebacks and the same Exception. Here's the traceback of test_to_excel_multiindex as an example:

======================================================================
ERROR: test_to_excel_multiindex (pandas.io.tests.test_excel.Openpyxl20Tests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mpuels/progs/pandas-mpuels/pandas/io/tests/test_excel.py", line 1776, in wrapped
    orig_method(self, *args, **kwargs)
  File "/home/mpuels/progs/pandas-mpuels/pandas/io/tests/test_excel.py", line 1321, in test_to_excel_multiindex
    frame.to_excel(path, 'test1', header=False)
  File "/home/mpuels/progs/pandas-mpuels/pandas/core/frame.py", line 1431, in to_excel
    startrow=startrow, startcol=startcol)
  File "/home/mpuels/progs/pandas-mpuels/pandas/io/excel.py", line 875, in write_cells
    for cell in cells:
  File "/home/mpuels/progs/pandas-mpuels/pandas/formats/format.py", line 1984, in get_formatted_cells
    self._format_body()):
  File "/home/mpuels/progs/pandas-mpuels/pandas/formats/format.py", line 1955, in _format_hierarchical_rows
    values = levels.take(labels, fill_value=True)
  File "/home/mpuels/progs/pandas-mpuels/pandas/indexes/base.py", line 1438, in take
    raise ValueError(msg.format(self.__class__.__name__))
ValueError: Unable to fill values because Int64Index cannot contain NA

I added the conditional, because take(level, fill_value=True) only works when the corresponding level of the MultiIndex contains NaNs. When it doesn't, the aforementioned exception is raised.

Here is a small example:

df = (pd.DataFrame({'c1': [1,1,2,2],
                    'c2': [None] + "b c d".split(),
                    'v' : [6,7,8,9]})
        .set_index(['c1', 'c2']))

df

yields

c1  c2  v
1       6
1   b   7
2   c   8
2   d   9

df.index

yields

MultiIndex(levels=[[1, 2], [u'b', u'c', u'd']],
           labels=[[0, 0, 1, 1], [-1, 0, 1, 2]],
           names=[u'c1', u'c2'])

The first level doesn't contain NaNs, so .take(_, fill_value=True) raises an Exception:

levels_0 = df.index.levels[0]
labels_0 = df.index.labels[0]
values_0 = levels_0.take(labels_0, fill_value=True)
values_0

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-bbcc31a032be> in <module>()
      1 levels_0 = df.index.levels[0]
      2 labels_0 = df.index.labels[0]
----> 3 values_0 = levels_0.take(labels_0, fill_value=True); values_0

/home/mpuels/progs/pandas-mpuels/pandas/indexes/base.pyc in take(self, indices, axis, allow_fill, fill_value, **kwargs)
   1436             if allow_fill and fill_value is not None:
   1437                 msg = 'Unable to fill values because {0} cannot contain NA'
-> 1438                 raise ValueError(msg.format(self.__class__.__name__))
   1439             taken = self.values.take(indices)
   1440         return self._shallow_copy(taken)

ValueError: Unable to fill values because Int64Index cannot contain NA

If the level of the MultiIndex contains NaNs, take(_, fill_values=True) works:

levels_1 = df.index.levels[1]
labels_1 = df.index.labels[1]
values_1 = levels_1.take(labels_1, fill_value=True)
values_1

Index([nan, u'b', u'c', u'd'], dtype='object', name=u'c2')

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jul 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpuels It does not work whether the multiindex level contains NaNs or not, but if it can contain NaNs.

In [4]: level = pd.Index(['a', 'b'])

In [5]: level._can_hold_na
Out[5]: True

In [6]: level.hasnans
Out[6]: False

In [7]: level.take([0,0,1], fill_value=True)
Out[7]: Index([u'a', u'a', u'b'], dtype='object')

But you are correct you need this conditional here (checking if it can contain NaNs).
In principle, you could also make a one-liner of it by passing levels._can_hold_na to allow_fill in take

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche You are right. It doesn't matter if a level contains NaNs, but if it can contain NaNs.

Unfortunately the approach with the one-liner doesn't work. I've taken the same DataFrame as above and changed take(labels_0, fill_value=True) to take(labels_0, fill_falue=False). The method take still raises the same exception:

levels_0 = df.index.levels[0]
labels_0 = df.index.labels[0]
values_0 = levels_0.take(labels_0, fill_value=False); values_0

...
ValueError: Unable to fill values because Int64Index cannot contain NA

If the level cannot contain NaNs as labels, only take(_, fill_value=None) works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the allow_fill that you have pass True/False to depending on _can_hold_na

@codecov-io
Copy link

codecov-io commented Jul 3, 2016

Current coverage is 85.32% (diff: 100%)

Merging #13551 into master will not change coverage

@@             master     #13551   diff @@
==========================================
  Files           141        141          
  Lines         50679      50679          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43240      43240          
  Misses         7439       7439          
  Partials          0          0          

Powered by Codecov. Last update 474fd05...2335cee

@jorisvandenbossche jorisvandenbossche added Bug IO Excel read_excel, to_excel labels Jul 3, 2016
@jorisvandenbossche jorisvandenbossche added this to the 0.18.2 milestone Jul 3, 2016
@jorisvandenbossche
Copy link
Member

@mpuels I think there went something wrong with merge/rebase. Normally, running the following should do exactly what you want to clean it up here:

git checkout fix-multiindex-nan-label-to-excel
git fetch upstream
git rebase upstream/master
git push -f origin/fix-multiindex-nan-label-to-excel

@mpuels mpuels force-pushed the fix-multiindex-nan-label-to-excel branch from f907286 to 7272a96 Compare July 4, 2016 16:57
@mpuels
Copy link
Contributor Author

mpuels commented Jul 4, 2016

@jorisvandenbossche Thanks for the help on git! Additionally I wanted to change the commit message of commit 7272a96 to something like 'CLN: Break line to avoid long line.'. I tried git rebase -i HEAD~3 and changed the commit message there, but it didn't work. Or is it impossible?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jul 4, 2016

@mpuels No, it is certainly possible. Normally using git rebase -i should work to rename a commit (I think there is the option 'reword' to select?). But if it is the latest commit, you can also do git commit --amend -m "new message".

Further, when this PR is merged, the commits will be squased any way and the PR's title is used for the squashed commit (so it is not that important to change the latest commit's message)

frame = self.frame
frame.A = np.arange(len(frame))
frame.iloc[0, 0] = None
frame.set_index(['A', 'B'], inplace=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use inplace, assign instead

@mpuels
Copy link
Contributor Author

mpuels commented Jul 11, 2016

@jreback I used

frame.iloc[0, 0] = ...

to assign to the first row of column A, because self.frame's index contains random values. Would it be OK if I don't use self.frame but instead construct new test data? Something like

df = pd.DataFrame({'A': [1,2,3],
                   'B': [10,20,30],
                   'C': np.random.sample(3)})
df.loc[0, 'A'] = None
df = df.set_index(['A', 'B'])

Then the values of the Index would be predictable and I could refer to the first row of A using df.loc.

And can you please give me a rule of thumb when to use test data which is already available and when to construct new test data? Thanks!

@jreback
Copy link
Contributor

jreback commented Jul 11, 2016

@mpuels best to use a sample frame that you construct as the expected (almost always).

@jorisvandenbossche
Copy link
Member

@mpuels Can you rebase and update according to the comments?

@mpuels
Copy link
Contributor Author

mpuels commented Jul 23, 2016

@jorisvandenbossche Sorry for the long delay. I don't have time today, but will do it tomorrow.

@jorisvandenbossche
Copy link
Member

No problem!

@mpuels mpuels force-pushed the fix-multiindex-nan-label-to-excel branch from 7272a96 to 2335cee Compare July 24, 2016 20:59
@jreback
Copy link
Contributor

jreback commented Jul 25, 2016

lgtm. @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche merged commit 4c2840e into pandas-dev:master Jul 25, 2016
@jorisvandenbossche
Copy link
Member

@mpuels Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NaN label in MultiIndex is assigned a non NaN value when writing to excel file
4 participants