Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASV: occasional asv failures on xlwt #19779

Closed
jreback opened this issue Feb 20, 2018 · 11 comments · Fixed by #19811 or #19926
Closed

ASV: occasional asv failures on xlwt #19779

jreback opened this issue Feb 20, 2018 · 11 comments · Fixed by #19811 or #19926
Labels
IO Excel read_excel, to_excel Performance Memory or execution speed performance
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Feb 20, 2018

https://travis-ci.org/pandas-dev/pandas-ci/jobs/343616367
this is on current pandas master (this is just the CI job running it).

excel asv's

The following asvs benchmarks (if any) failed.
[ 43.02%] ··· Running io.excel.Excel.time_read_excel                  1/3 failed
                   xlwt      failed 
DONE displaying failed asvs benchmarks.

I have seen this work as well. maybe a race-condition?

@jreback jreback added Performance Memory or execution speed performance IO Excel read_excel, to_excel labels Feb 20, 2018
@jreback jreback added this to the 0.23.0 milestone Feb 20, 2018
@jreback
Copy link
Contributor Author

jreback commented Feb 20, 2018

from (pandas) bash-3.2$ more asv_bench/benchmarks/io/excel.py

def setup(...):
      ....
        self.bio_write = BytesIO()
        self.bio_write.seek(0)
        self.writer_write = ExcelWriter(self.bio_write, engine=engine)

    def time_read_excel(self, engine):
        read_excel(self.bio_read)

    def time_write_excel(self, engine):
        self.df.to_excel(self.writer_write, sheet_name='Sheet1')
        self.writer_write.save()

I think the time_write_excel should have the .writer_write setup I think.

jreback added a commit to jreback/pandas that referenced this issue Feb 21, 2018
jreback added a commit that referenced this issue Feb 21, 2018
@jorisvandenbossche
Copy link
Member

If it is defined in the setup function (as it is now), it should be available in the benchmark function. It would be strange that this solves it.

@jreback jreback reopened this Feb 26, 2018
@jreback
Copy link
Contributor Author

jreback commented Feb 26, 2018

its passing 1 our of n times: https://travis-ci.org/pandas-dev/pandas-ci/jobs/345996588

@jreback
Copy link
Contributor Author

jreback commented Feb 26, 2018

cc @mroeschke @WillAyd if you have any ideas

@WillAyd
Copy link
Member

WillAyd commented Feb 26, 2018

Perhaps we should be explicitly closing the BytesIO objects that are getting created in a teardown? Due to the intermittency of it I'm wondering if the GC is taking an pass at closing those for us, but getting tripped up with the parallelized execution that asv provides

@mroeschke
Copy link
Member

Looks to be specifically a problem with the time_read_excel benchmark.

File "/home/travis/build/pandas-dev/pandas-ci/pandas/asv_bench/benchmarks/io/excel.py", line 29, in time_read_excel
                    read_excel(self.bio_read)
File "/home/travis/miniconda3/envs/pandas/lib/python3.6/site-packages/pandas-0.23.0.dev0+381.g8f1dfa74e-py3.6-linux-x86_64.egg/pandas/util/_decorators.py", line 172, in wrapper
                    return func(*args, **kwargs)
File "/home/travis/miniconda3/envs/pandas/lib/python3.6/site-packages/pandas-0.23.0.dev0+381.g8f1dfa74e-py3.6-linux-x86_64.egg/pandas/util/_decorators.py", line 172, in wrapper
                    return func(*args, **kwargs)
File "/home/travis/miniconda3/envs/pandas/lib/python3.6/site-packages/pandas-0.23.0.dev0+381.g8f1dfa74e-py3.6-linux-x86_64.egg/pandas/io/excel.py", line 315, in read_excel
                    io = ExcelFile(io, engine=engine)
File "/home/travis/miniconda3/envs/pandas/lib/python3.6/site-packages/pandas-0.23.0.dev0+381.g8f1dfa74e-py3.6-linux-x86_64.egg/pandas/io/excel.py", line 391, in __init__
                    self.book = xlrd.open_workbook(file_contents=data)
File "/home/travis/miniconda3/envs/pandas/lib/python3.6/site-packages/xlrd/__init__.py", line 116, in open_workbook
                    with open(filename, "rb") as f:
                TypeError: expected str, bytes or os.PathLike object, not NoneType

Digging into xlrd.open_workbook for the file_contents variable.
http://www.lexicon.net/sjmachin/xlrd.html#xlrd.open_workbook-function

file_contents
... as a string or an mmap.mmap object or some other behave-alike object. If file_contents is supplied, filename will not be used, except (possibly) in messages.

Looks like filename=None is the default as well, but for some reason its being used despite the note above?

@jreback
Copy link
Contributor Author

jreback commented Feb 27, 2018

its funny because it has worked at times. really odd.

@WillAyd
Copy link
Member

WillAyd commented Feb 27, 2018

While it doesn't explain why this is happening I think if we add io.seek(0) just before the below line it will "fix" the issue at hand (at least it did locally for me):

data = io.read()

@jreback
Copy link
Contributor Author

jreback commented Feb 27, 2018

can u replicate this in a test? (and then fix)?

@mcrot
Copy link
Contributor

mcrot commented Mar 20, 2018

Hi all,

I'm completely new to pandas development and I've just prepared a working environment following the guide Contributing to pandas, because I wanted to contribute to some other issue. So I'm not fully sure whether this is the right place in order to address the following test failure, but for me it seems to be related to #19926:

When running the tests for pandas.io.excel

pytest pandas/tests/io/test_excel.py

it comes up with three failures (because of 3 parameters) for the test method TestXlrdReader.test_read_from_http_url:

F
pandas/tests/io/test_excel.py:557 (TestXlrdReader.test_read_from_http_url[.xls])
self = <pandas.tests.io.test_excel.TestXlrdReader object at 0x7fc1d823a1d0>
ext = '.xls'

    @tm.network
    def test_read_from_http_url(self, ext):
        url = ('https://raw.github.com/pandas-dev/pandas/master/'
               'pandas/tests/io/data/test1' + ext)
>       url_table = read_excel(url)

test_excel.py:562: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../util/_decorators.py:172: in wrapper
    return func(*args, **kwargs)
../../util/_decorators.py:172: in wrapper
    return func(*args, **kwargs)
../../io/excel.py:315: in read_excel
    io = ExcelFile(io, engine=engine)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pandas.io.excel.ExcelFile object at 0x7fc1d82d50f0>
io = <http.client.HTTPResponse object at 0x7fc1d81e3320>, kwds = {}
err_msg = 'Install xlrd >= 0.9.0 for Excel support'
xlrd = <module 'xlrd' from '/home/mcrot/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/xlrd/__init__.py'>
ver = (1, 1), engine = None

    def __init__(self, io, **kwds):
    
        err_msg = "Install xlrd >= 0.9.0 for Excel support"
    
        try:
            import xlrd
        except ImportError:
            raise ImportError(err_msg)
        else:
            ver = tuple(map(int, xlrd.__VERSION__.split(".")[:2]))
            if ver < (0, 9):  # pragma: no cover
                raise ImportError(err_msg +
                                  ". Current version " + xlrd.__VERSION__)
    
        # could be a str, ExcelFile, Book, etc.
        self.io = io
        # Always a string
        self._io = _stringify_path(io)
    
        engine = kwds.pop('engine', None)
    
        if engine is not None and engine != 'xlrd':
            raise ValueError("Unknown engine: {engine}".format(engine=engine))
    
        # If io is a url, want to keep the data as bytes so can't pass
        # to get_filepath_or_buffer()
        if _is_url(self._io):
            io = _urlopen(self._io)
        elif not isinstance(self.io, (ExcelFile, xlrd.Book)):
            io, _, _, _ = get_filepath_or_buffer(self._io)
    
        if engine == 'xlrd' and isinstance(io, xlrd.Book):
            self.book = io
        elif not isinstance(io, xlrd.Book) and hasattr(io, "read"):
            # N.B. xlrd.Book has a read attribute too
            if hasattr(io, 'seek'):
                # GH 19779
>               io.seek(0)
E               io.UnsupportedOperation: seek

../../io/excel.py:392: UnsupportedOperation

It seems like the HTTPResponse object returned by urllib.request.urlopen does not support seeking, although the seek() method is available.

This fixes the tests for me:

@@ -10,6 +10,7 @@ import os
 import abc
 import warnings
 import numpy as np
+from http.client import HTTPResponse
 
 from pandas.core.dtypes.common import (
     is_integer, is_float,
@@ -387,7 +388,9 @@ class ExcelFile(object):
             self.book = io
         elif not isinstance(io, xlrd.Book) and hasattr(io, "read"):
             # N.B. xlrd.Book has a read attribute too
-            if hasattr(io, 'seek'):
+            #
+            # http.client.HTTPResponse.seek() -> UnsupportedOperation exception
+            if not isinstance(io, HTTPResponse) and hasattr(io, 'seek'):
                 # GH 19779
                 io.seek(0)

Should I create a pull request here or a new issue or do I miss something in my setup such that the tests can't run?

My currently installed versions are:

INSTALLED VERSIONS
------------------
commit: 01882ba5b4c21b0caf2e6b9279fb01967aa5d650
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-116-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: de_DE.UTF-8

pandas: 0.23.0.dev0+657.g01882ba
pytest: 3.4.2
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.2
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.2
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Mar 20, 2018

@mcrot I would suggest that you open a new issue for that and then submit a PR referencing the issue

mcrot added a commit to mcrot/pandas that referenced this issue Mar 22, 2018
Closes pandas-dev#20434.

Back in pandas-dev#19779 a call of a seek() method was added. This call fails
on HTTPResponse instances with an UnsupportedOperation exception,
so for this case a try..except wrapper was added here.
mcrot added a commit to mcrot/pandas that referenced this issue Mar 22, 2018
Closes pandas-dev#20434.

Back in pandas-dev#19779 a call of a seek() method was added. This call fails
on HTTPResponse instances with an UnsupportedOperation exception,
so for this case a try..except wrapper was added here.
mcrot added a commit to mcrot/pandas that referenced this issue Apr 3, 2018
Closes pandas-dev#20434.

Back in pandas-dev#19779 a call of a seek() method was added. This call fails
on HTTPResponse instances with an UnsupportedOperation exception,
so for this case a try..except wrapper was added here.
TomAugspurger pushed a commit that referenced this issue Apr 3, 2018
Closes #20434.

Back in #19779 a call of a seek() method was added. This call fails
on HTTPResponse instances with an UnsupportedOperation exception,
so for this case a try..except wrapper was added here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Performance Memory or execution speed performance
Projects
None yet
5 participants