Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change pandas ExcelWriter engine from XlsxWriter to openpyxl #329

Merged
merged 6 commits into from
Jul 5, 2020
Merged

Change pandas ExcelWriter engine from XlsxWriter to openpyxl #329

merged 6 commits into from
Jul 5, 2020

Conversation

JS3xton
Copy link
Contributor

@JS3xton JS3xton commented May 7, 2020

Changing pandas ExcelWriter engine from XlsxWriter to openpyxl eliminates the need for the XlsxWriter package. Since we're already considering openpyxl for the read engine, this would reduce the number of external packages upon which FlowCal depends.

Closes #326.

Comparison of Excel files generated by >python -m FlowCal.excel_ui -i ./examples/experiment.xlsx in Python 3.8, Anaconda 2020.02:

Before pull request (9cd83ef) After pull request (95b3ac1)

The column widths appear narrower despite being calculated the same way, but overall they appear similar.

FlowCal.excel_ui.write_workbook() also works correctly in Python 2.7, Anaconda v4.3.0.

NOTE: If this pull request is adopted, #325 can be slightly simplified since openpyxl will already have to be imported.

-Eliminates need for `XlsxWriter` package.
@JS3xton JS3xton requested a review from castillohair May 7, 2020 19:15
@castillohair
Copy link
Collaborator

The column widths appear narrower despite being calculated the same way, but overall they appear similar.

Yeah that's mildly annoying. What do you think about doing something like this in excel_ui.write_workbook()?

writer.sheets[sheet_name].column_dimensions[col_letter].width = 1.1*width

NOTE: If this pull request is adopted, #325 can be slightly simplified since openpyxl will already have to be imported

So what's the plan if we want to fully switch to openpyxl? It seems it will not just be accepting this PR and #325.

@JS3xton
Copy link
Contributor Author

JS3xton commented May 9, 2020

writer.sheets[sheet_name].column_dimensions[col_letter].width = 1.1*width

I'm fine with more spacing. Please update with whatever you desire.

So what's the plan if we want to fully switch to openpyxl? It seems it will not just be accepting this PR and #325.

I would merge #329 first and then update (merge from develop) and make the minor simplifications to #325 (which I'm happy to do when the time comes). Specifically, openpyxl can simply be imported, and we can Except directly on openpyxl.utils.exceptions.InvalidFileException. Everything else should be consistent between the two pull requests.

@castillohair
Copy link
Collaborator

#325 has been merged. Please simplify this PR. There are also conflicts between this PR and the current develop branch. Please fix these.

Finally, we need to find out what are the oldest versions of openpyxl and pandas we can support with this PR.

@JS3xton
Copy link
Contributor Author

JS3xton commented Jun 15, 2020

Finally, we need to find out what are the oldest versions of openpyxl and pandas we can support with this PR.

For pandas, ExcelWriter appears to add the engine parameter in pandas version 0.13.0, and it supported the 'openpyxl' option from the beginning. We already require pandas version 0.16.1, so we're clear there.

The handoff from pandas to openpyxl happens at the sheet instances. Back to pandas version 0.13.0, ExcelWriter has a self.sheets dictionary, which is populated by a call to openpyxl.workbook.Workbook().create_sheet().

For openpyxl, Worksheet appears to have self.column_dimensions and ColumnDimension appears to have self.width back to openpyxl version 1.0, so we may not have a minimum requirement for openpyxl, at least for this PR.

Other things to consider:

I have a mild preference that we bump the openpyxl version up to 2.4.7 so we still permit Anaconda 4.4.0 out of the box.

-Some exception handling that was no longer necessary was removed.
-A comment noting pandas versions that would not recognize
 openpyxl was corrected.
@JS3xton
Copy link
Contributor Author

JS3xton commented Jun 15, 2020

Merge conflicts with develop have been resolved. Unit tests still pass in Python 3.8 + Anaconda 2020.02 and Python 2.7 + Anaconda 4.4.0.

Tasks remaining:

  • Agree on a minimum openpyxl version number.
  • Get column spacing to @castillohair's liking.

@castillohair
Copy link
Collaborator

  • I originally chose openpyxl version 2.4.1 to match Anaconda version 4.3.0, which I previously thought was our minimum supported version.

I have a mild preference that we bump the openpyxl version up to 2.4.7 so we still permit Anaconda 4.4.0 out of the box.

I don't understand this. If it works with 2.4.1, why would we bump it up to 2.4.7? It should still work with Anaconda 4.4.0, right?

@JS3xton
Copy link
Contributor Author

JS3xton commented Jun 19, 2020

All Anaconda versions before the current one (2020.02) have a pandas version < 0.25.0, which causes FlowCal to fall back on xlrd (so the FlowCal unit tests still pass).

However, pandas versions 0.25.x throw an ImportError with openpyxl versions < 2.4.8 (no Anaconda version uses pandas version 0.25.x, though).

Moreover, currently, pandas versions >= 1.0.0 throw an ImportError with openpyxl versions < 2.5.7. And I don't think anything bars them from raising the minimum supported openpyxl version in the future. Unit tests therefore fail, for example, under a current installation if openpyxl is held at version 2.4.1 in the requirements.txt file.

A good way to proceed might therefore be to set the minimum openpyxl version to 2.2.0 to ensure openpyxl.utils.exceptions.InvalidFileException is available, and also to more gracefully handle the case where the openpyxl version is less than what pandas requires. I.e. add another clause where FlowCal falls back to xlrd. If we remove xlrd in the future, though, we won't be able to rely on that and will probably have to increase some minimum supported package versions substantially. Also it appears that writing Excel files does not suffer the ImportError (just from reading the underlying pandas source code), but that seems sloppily inconsistent to me.

@JS3xton
Copy link
Contributor Author

JS3xton commented Jun 19, 2020

The aforementioned changes have been made (4742848).

All unit tests pass in Python 3.8 + Anaconda 2020.02 and Python 2.7 + Anaconda 4.4.0.

>python -m unittest test.test_excel_ui passes with a vanilla install where openpyxl is held to version 2.2.0. Turns out, it still passes with version 2.1.5 thanks to the new ImportError exception, but I still think version 2.2.0 is a good place to stop since I clearly wrote the code expecting openpyxl.utils.exceptions.InvalidFileException to exist. This requirement should also be more robust if we remove xlrd later and can no longer fall back on it.

@castillohair
Copy link
Collaborator

castillohair commented Jun 21, 2020

Running the Excel UI with the current minimum packages on Python 2.7 fails with the following message:

Calculating statistics for all samples...
Generating histograms table...
Saving output Excel file...
Traceback (most recent call last):
  File "/Users/castillohair/.pyenv/versions/2.7.16/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/Users/castillohair/.pyenv/versions/2.7.16/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Users/castillohair/Documents/Programs/FlowCal-recent/FlowCal/FlowCal/excel_ui.py", line 1710, in <module>
    run_command_line()
  File "/Users/castillohair/Documents/Programs/FlowCal-recent/FlowCal/FlowCal/excel_ui.py", line 1707, in run_command_line
    hist_sheet=args.histogram_sheet)
  File "/Users/castillohair/Documents/Programs/FlowCal-recent/FlowCal/FlowCal/excel_ui.py", line 1636, in run
    write_workbook(output_path, table_list)
  File "/Users/castillohair/Documents/Programs/FlowCal-recent/FlowCal/FlowCal/excel_ui.py", line 259, in write_workbook
    df.to_excel(writer, sheet_name=sheet_name, index=False)
  File "/Users/castillohair/.pyenv/versions/py2.7_flowcal_min/lib/python2.7/site-packages/pandas/core/frame.py", line 1269, in to_excel
    startrow=startrow, startcol=startcol)
  File "/Users/castillohair/.pyenv/versions/py2.7_flowcal_min/lib/python2.7/site-packages/pandas/io/excel.py", line 778, in write_cells
    xcell.style = xcell.style.copy(**style_kwargs)
  File "/Users/castillohair/.pyenv/versions/py2.7_flowcal_min/lib/python2.7/site-packages/openpyxl/styles/styleable.py", line 107, in style
    protection=self.protection
  File "/Users/castillohair/.pyenv/versions/py2.7_flowcal_min/lib/python2.7/site-packages/openpyxl/styles/__init__.py", line 42, in __init__
    self._font = font
  File "/Users/castillohair/.pyenv/versions/py2.7_flowcal_min/lib/python2.7/site-packages/openpyxl/descriptors/base.py", line 35, in __set__
    raise TypeError('expected ' + str(self.expected_type))
TypeError: expected <class 'openpyxl.styles.fonts.Font'>

It looks like openpyxl 2.2 actually contains the Font class in question. However, it is likely that pandas is using it in a way that openpyxl's API didn't support at the time. Using more recent openpyxl versions while keeping the same pandas 0.16.1 result in the following error:

Calculating statistics for all samples...
Generating histograms table...
Saving output Excel file...
Traceback (most recent call last):
  File "/Users/castillohair/.pyenv/versions/2.7.16/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/Users/castillohair/.pyenv/versions/2.7.16/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Users/castillohair/Documents/Programs/FlowCal-recent/FlowCal/FlowCal/excel_ui.py", line 1710, in <module>
    run_command_line()
  File "/Users/castillohair/Documents/Programs/FlowCal-recent/FlowCal/FlowCal/excel_ui.py", line 1707, in run_command_line
    hist_sheet=args.histogram_sheet)
  File "/Users/castillohair/Documents/Programs/FlowCal-recent/FlowCal/FlowCal/excel_ui.py", line 1636, in run
    write_workbook(output_path, table_list)
  File "/Users/castillohair/Documents/Programs/FlowCal-recent/FlowCal/FlowCal/excel_ui.py", line 259, in write_workbook
    df.to_excel(writer, sheet_name=sheet_name, index=False)
  File "/Users/castillohair/.pyenv/versions/py2.7_flowcal_min/lib/python2.7/site-packages/pandas/core/frame.py", line 1269, in to_excel
    startrow=startrow, startcol=startcol)
  File "/Users/castillohair/.pyenv/versions/py2.7_flowcal_min/lib/python2.7/site-packages/pandas/io/excel.py", line 749, in write_cells
    from openpyxl.cell import get_column_letter
ImportError: cannot import name get_column_letter

This is also likely due to an incorrect use of openpyxl. Looking at pandas's code, support for openpyxl 2 is annotated as experimental and the offending line can be found immediately below. This is not changed until pandas 0.23 . Accordingly, the Excel UI runs successfully with pandas 0.23.0 and openpyxl 2.2.0. So we will need to update pandas to this version, which will result in all Anaconda versions before 5.2.0 being unsupported. I'm honestly not too concerned about deprecating slightly older versions. In my experience, non-programmers haven't had much issue when instructed to "install a newer anaconda version", and programmers can deal with dependencies via pip and/or virtual environments.

As an aside, we currently don't have a unit test for excel file writing, which needs to get fixed.

Finally, I want to point something out. The fact that the minimum openpyxl and pandas versions are intertwined will be a headache in the future, especially considering that both change frequently. The worst is that this shouldn't be our problem, since openpyxl is a pandas dependency and should be handled by it. Unfortunately, it is listed as an optional dependency and not enforced by the package managers. While there is a mechanism to automatically deal with these type of dependencies, it seems that pandas decided that excel libraries were not important enough to be included. Hopefully this will change in the future.

So, in summary, I would like to have the following before accepting this PR:

  • The minimum pandas needs to be updated to 0.23.
  • A simple unittest for excel_ui.write_workbook() should be made.

@JS3xton
Copy link
Contributor Author

JS3xton commented Jun 23, 2020

  • The minimum pandas needs to be updated to 0.23.

Should be done in (a41a816), unless there's something more complicated that I missed.

  • A simple unittest for excel_ui.write_workbook() should be made.

How exactly do you wanna do that?

@castillohair
Copy link
Collaborator

  • A simple unittest for excel_ui.write_workbook() should be made.

How exactly do you wanna do that?

I can think of two things:

  • Test 1: just write a dataframe with excel_ui.write_workbook(). The test's purpose would be to check that calling excel_ui.write_workbook() doesn't result in exceptions.
  • Test 2: write a small dataframe with excel_ui.write_workbook(), read it back with pandas and check that the contents are correct.

@JS3xton
Copy link
Contributor Author

JS3xton commented Jul 4, 2020

  • A simple unittest for excel_ui.write_workbook() should be made.

How exactly do you wanna do that?

I can think of two things:

  • Test 1: just write a dataframe with excel_ui.write_workbook(). The test's purpose would be to check that calling excel_ui.write_workbook() doesn't result in exceptions.
  • Test 2: write a small dataframe with excel_ui.write_workbook(), read it back with pandas and check that the contents are correct.

425bb07 adds a version of Test 2. It's based on test_read_table() and test_pickle_unpickle().

All unit tests pass in Python 3.8 + Anaconda 2020.02 and Python 2.7 + Anaconda 5.2.0.

I don't understand your previous comment well enough to know if it covers the issues stated therein (e.g. all unit tests still pass in Python 2.7 + Anaconda 4.4.0. I couldn't get a virtual environment to work with only the minimum package requirements).

@castillohair
Copy link
Collaborator

Yeah everything is working now. The issue in the comment you mention was fixed by increasing the minimum pandas version number. The fact that the issue wasn't caught by unit testing is what made me suggest the writing unit test. I will merge now.

@castillohair castillohair merged commit 3affd34 into taborlab:develop Jul 5, 2020
@JS3xton JS3xton deleted the excel-write-engine branch July 5, 2020 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants