BUG: Patch for skipping seek() when loading Excel files from url #20437

mcrot · 2018-03-21T11:31:13Z

closes UnsupportedOperation 'seek' when loading excel files from url #20434
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This is my first PR and I tried to follow the guideline. Sorry, if I overlooked something.
I don't have to add anything in whatsnew, because the bug isn't present in the current stable version 0.22, correct?

codecov · 2018-03-21T12:17:30Z

Codecov Report

Merging #20437 into master will increase coverage by 0.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #20437      +/-   ##
==========================================
+ Coverage   91.82%   91.84%   +0.02%     
==========================================
  Files         153      153              
  Lines       49256    49256              
==========================================
+ Hits        45229    45241      +12     
+ Misses       4027     4015      -12

Flag	Coverage Δ
#multiple	`90.23% <ø> (+0.02%)`	⬆️
#single	`41.9% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/plotting/_converter.py	`66.81% <0%> (+1.73%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73cb32e...88d91f5. Read the comment docs.

TomAugspurger · 2018-03-21T12:35:40Z

cc @WillAyd

WillAyd · 2018-03-21T12:38:23Z

Pretty simple change so lgtm - fine with it if you are Tom. Thanks for the contribution @mcrot

TomAugspurger · 2018-03-21T12:47:46Z

pandas/io/excel.py

@@ -10,6 +10,7 @@
 import abc
 import warnings
 import numpy as np
+from http.client import HTTPResponse


Did this exist in py2? May have to import from pandas.compat.

Thanks for your feedback, Tom. You're right, http.client was named httplib in py2, I forgot to check py2 compatibility, sorry. So maybe import pandas.compat.httplib as httplib and using httplib.HTTPResponse would work, but I will think about your and @WillAyd's comment first.

Shouldn't this have failed on the AppVeyor 2.7 build?

Yes, but it worked .. I'm just trying to figure out why.

Sorry, I've no idea why the code worked for Python 2.7 - not only in AppVeyor, but also in TravisCI and CircleCI. In all of them pytest reports to use py 2.7.14. And the docs clearly state that http.client has come up with py3. As example, http.client is not available in my local py27 environment:

Python 2.7.14 | packaged by conda-forge | (default, Dec 25 2017, 01:16:05) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from http.client import HTTPResponse Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named http.client

How is this possible, is there some CI magic I haven't understood yet?
I've also searched for some other compatibility libs which may have used in the pandas code base in order to achieve this, but haven't found any.

I think it would be good for me to understand this before implementing a py2 compatible fix here.

Are you able to set up a python2 dev environment and reproduce locally? That may be easier than going through the CI logs.

Just tested on your branch in my py2 env

In [1]: import pandas as pd --------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-1-af55e7023913> in <module>() ----> 1 import pandas as pd /Users/taugspurger/sandbox/pandas-ip-2/pandas/pandas/__init__.pyc in <module>() 55 56 from pandas.util._print_versions import show_versions ---> 57 from pandas.io.api import * 58 from pandas.util._tester import test 59 import pandas.testing /Users/taugspurger/sandbox/pandas-ip-2/pandas/pandas/io/api.py in <module>() 7 from pandas.io.parsers import read_csv, read_table, read_fwf 8 from pandas.io.clipboards import read_clipboard ----> 9 from pandas.io.excel import ExcelFile, ExcelWriter, read_excel 10 from pandas.io.pytables import HDFStore, get_store, read_hdf 11 from pandas.io.json import read_json /Users/taugspurger/sandbox/pandas-ip-2/pandas/pandas/io/excel.py in <module>() 11 import warnings 12 import numpy as np ---> 13 from http.client import HTTPResponse 14 15 from pandas.core.dtypes.common import ( ImportError: No module named http.client

So something strange is going on.

I wonder if one of our dependencies installs an http backport?

I've tried to use tox -e py27 in order to test with python 2.7 - that did not work, because the bigquery dependency could not be satisfied. I'll try to create a separate directory and build with python 2.7, that's maybe a better idea.

I'll also have a look if there are dependecies to a http backport...

Hm, I've built a local development environment mit py27 and the tests fail as expected, because http.client cannot be imported. So it seems this is only related to the CI mechanism..

In order to check the CI dependencies, I've made a small py2 script which collects all dependencies from the files ci/requirements-2.7(_[A-Z]+)?\.(build|pip|run), imports them one by one (assuming the correct version is installed) and tries to import http.client after each import. Haven't found a backport of http by doing this.

I've added the script and my output here for reference. I had to manually install some of the collected dependencies before (couldn't find 'nomkl' btw). Not sure whether this helps.

import os import re import logging logging.basicConfig(level=logging.INFO) if __name__=='__main__': DEPS_DIR = '/home/mcrot/prj/pandas-mcrot-py2/ci' DEPS_REGEX = r'^requirements-2.7(_[A-Z]+)?\.(build|pip|run)$' DEPS_BLACKLIST = ['python', 'ipython'] DEPS_IMPORT_NAMES = { 'PyCrypto': 'Crypto', 'pytables': 'tables', 'python-dateutil': 'dateutil', 'beautifulsoup4': 'bs4', 'pandas-gbq': 'pandas_gbq'} deps = set() for fn in os.listdir(DEPS_DIR): if re.match(DEPS_REGEX, fn): logging.info("Loading deps from file '%s'..", fn) f = open(os.path.join(DEPS_DIR, fn)) new_deps = [] for s in f.readlines(): dep = s.split('=')[0].strip() if dep in DEPS_BLACKLIST: continue # rename if possible try: dep = DEPS_IMPORT_NAMES[dep] except KeyError: pass new_deps.append(dep) deps = deps.union(new_deps) deps = sorted(d for d in deps) logging.info("Checking these dependencies: %s", deps) for dep in deps: try: mod = __import__(dep) except ImportError: logging.warning("Cannot import module '%s'", dep) continue try: from http.client import HTTPResponse logging.error("Module '%s' DOES IMPORT http.", dep) raise Exception("Backport of http detected in module '%s'" % str(mod)) except ImportError as exc: # as expected for py2 logging.info("Module '%s' does not import http.", mod)

check-backport.txt

So I guess, I need a local CI in order to find out why http.client can be imported?

TomAugspurger · 2018-03-21T12:53:01Z

pandas/io/excel.py

+            #
+            # Excluding the case http.client.HTTPResponse.seek() here
+            # because it raises an UnsupportedOperation exception
+            if not isinstance(io, HTTPResponse) and hasattr(io, 'seek'):


I worry a bit about this hardcoding HTTPResponse here.

IIUC, io could be anything that implements the __fspath__ protocol. Not all of those objects will implement seek.

What's the harm in cactching (OSError, ValueError) here? (I don't really know why we seek in the first place)

Back in #19779 we were getting very intermittent failures on ASV tests due to a TypeError being thrown. tbh I'm still not entirely clear on why it was happening but I believe that intermittently that buffer was being read before this code was being executed. Adding in the conditional seek back to the beginning it prevented the TypeError.

So it's certainly a fair question on your end to which I don't have a great answer. try...catch could be an option, but then we'd probably have to seek back to the start in the catch block and maybe still have the conditional (to distinguish from other TypeErrors) so not sure there is a ton to gain there

TomAugspurger · 2018-03-21T16:54:06Z

That sounds difficult. You might try adding a `conda list` to the end of `ci/install_travis.sh` to print out the final version of all the packages. My also throw a `python -c "import http.client"` there too, to ensure that our hypothesis about a backport is correct.

…

On Wed, Mar 21, 2018 at 11:43 AM, Michael Röttger ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pandas/io/excel.py <#20437 (comment)>: > @@ -10,6 +10,7 @@ import abc import warnings import numpy as np +from http.client import HTTPResponse Hm, I've built a local development environment mit py27 and the tests fail as expected, because http.client cannot be imported. So it seems this is only related to the CI mechanism.. In order to check the CI dependencies, I've made a small py2 script which collects all dependencies from the files ci/requirements-2.7(_[A-Z]+)?\ .(build|pip|run), imports them one by one (assuming the correct version is installed) and tries to import http.client after each import. Haven't found a backport of http by doing this. I've added the script and my output here for reference. I had to manually install some of the collected dependencies before (couldn't find 'nomkl' btw). Not sure whether this helps. import osimport re import logging logging.basicConfig(level=logging.INFO) if __name__=='__main__': DEPS_DIR = '/home/mcrot/prj/pandas-mcrot-py2/ci' DEPS_REGEX = r'^requirements-2.7(_[A-Z]+)?\.(build|pip|run)$' DEPS_BLACKLIST = ['python', 'ipython'] DEPS_IMPORT_NAMES = { 'PyCrypto': 'Crypto', 'pytables': 'tables', 'python-dateutil': 'dateutil', 'beautifulsoup4': 'bs4', 'pandas-gbq': 'pandas_gbq'} deps = set() for fn in os.listdir(DEPS_DIR): if re.match(DEPS_REGEX, fn): logging.info("Loading deps from file '%s'..", fn) f = open(os.path.join(DEPS_DIR, fn)) new_deps = [] for s in f.readlines(): dep = s.split('=')[0].strip() if dep in DEPS_BLACKLIST: continue # rename if possible try: dep = DEPS_IMPORT_NAMES[dep] except KeyError: pass new_deps.append(dep) deps = deps.union(new_deps) deps = sorted(d for d in deps) logging.info("Checking these dependencies: %s", deps) for dep in deps: try: mod = __import__(dep) except ImportError: logging.warning("Cannot import module '%s'", dep) continue try: from http.client import HTTPResponse logging.error("Module '%s' DOES IMPORT http.", dep) raise Exception("Backport of http detected in module '%s'" % str(mod)) except ImportError as exc: # as expected for py2 logging.info("Module '%s' does not import http.", mod) check-backport.txt <https://github.com/pandas-dev/pandas/files/1834240/check-backport.txt> So I guess, I need a local CI in order to find out why http.client can be imported? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#20437 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHInpeAh5hpMqgPugPwF7Y70MGz9cyks5tgoNDgaJpZM4SzZlx> .

jreback · 2018-03-21T17:03:55Z

this might simpler, in pandas/compat/__init__.py we already do

try:
    import __builtin__ as builtins
    # not writeable when instantiated with string, doesn't handle unicode well
    from cStringIO import StringIO as cStringIO
    # always writeable
    from StringIO import StringIO
    BytesIO = StringIO
    import cPickle
    import httplib
except ImportError:
    import builtins
    from io import StringIO, BytesIO
    cStringIO = StringIO
    import pickle as cPickle
    import http.client as httplib

so you are likely getting:https://docs.python.org/2/library/httplib.html in python 2 (assuming we are doing the import from compat, which we should be doing)

TomAugspurger · 2018-03-21T17:06:53Z

which we should be doing

Right, but the latest CI run did an unconditional import http.client, which should have failed on py2 (and does for us locally, but not on any of the CI runs).

jreback · 2018-03-21T17:10:52Z

its possible this test is skipped on a 2.7 run?

TomAugspurger · 2018-03-21T17:15:21Z

Possibly, but I think that the import is hit by an import pandas, so that shouldn't matter. The entire test suite should fail (unless I'm misunderstanding something).

mcrot · 2018-03-21T17:16:54Z

The test is not skipped and the import http.client statement is on top of pandas.io.excel and unconditional, so it is executed on every test which imports the excel modul.
I've created a new branch in my repo, activated TravisCI and added the suggested commands at the end of the install script. Let's see..

mcrot · 2018-03-22T10:14:12Z

Ok, the TravisCI run my branch why-does-import-http-client-work-in-py2 failed already on the first import pandas as @TomAugspurger expected, see here. I don't understand why the pandas-dev TravisCI runs pass without problems - and also for the other two CI environments.

I've had a look at of a diff of TravisCI logs of both instances and haven't found a significant difference yet. The TravisCI on my repo had to built some wheels first and couldn't resuse Cython cache as the pandas-dev instance did. That shouldn't matter?!

However I'm going to temporarily add the extra output of conda list at the end of travis_install.sh also to this PR in order to see its output on pandas-dev TravisCI instance. Just to be sure we have the same package versions everywhere.

Concerning this PR here, maybe this py2 CI issue is slowly getting a little off-topic.. maybe I should be handled elsewhere in the long run, I'm not sure how you normally deal with CI issues like that. Or there is an easy explanation I cannot see, then sorry for the noise.

TomAugspurger · 2018-03-22T11:02:16Z

Concerning this PR here, maybe this py2 CI issue is slowly getting a little off-topic.. maybe I should be handled elsewhere in the long run

Yes, feel free to just make the changes to fix the excel bug (ideally tested locally with Py2), and we'll fix the CI in a followup.

mcrot · 2018-03-22T14:02:08Z

Ok, thanks .. it's a little weird, because this time the tests here on pandas-dev also fail for py27 as expected.

However I've continued with the PR and just catched OSError and ValueError as suggested by @TomAugspurger. Btw. would catching io.UnsupportedOperation be too specific? It is available in py2 and py3.

Locally the tests run both for py27 and py36, I will push now. I hope this PR does not affect negatively affect ASV tests as mentioned by @WillAyd.

TomAugspurger · 2018-03-22T14:06:16Z

Btw. would catching io.UnsupportedOperation be too specific? It is available in py2 and py3.

That should be equivalent, right? I just assumed it wasn't around in python2, but if so the use that.

mcrot · 2018-03-22T15:00:23Z

Ok, I'm catching io.UnsupportedOperation now instead of (OSError, ValueError). It's working locally for py36 and py27. I've pushed to this branch.

TomAugspurger · 2018-03-22T15:02:32Z

Great. Could you open an issue with a small summary to what we found here about the CI issue?

…

On Thu, Mar 22, 2018 at 10:00 AM, Michael Röttger ***@***.***> wrote: Ok, I'm catching io.UnsupportedOperation now instead of (OSError, ValueError). It's working locally for py36 and py27. I've pushed to this branch. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20437 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIsROzV6mBjLbLc2acb7Y9fYe0Et4ks5tg7yMgaJpZM4SzZlx> .

mcrot · 2018-03-22T16:02:53Z

Done, see #20453.

jreback · 2018-03-22T22:53:37Z

can you add the test from the issue?

mcrot · 2018-03-23T08:47:00Z

The test which reveals #20434 is already included in the code base. I openend #20434, because
I'd just setup a working environment following the guide Contributing to pandas and afterwards, when running the tests, especially

pytest pandas/tests/io/test_excel.py

I got failures for the test method TestXlrdReader.test_read_from_http_url. The failure is described in #20434.

I understand now that is strange that my tests didn't run, because #19926 introduced the seek() method and the tests should have failed already in the past for that PR. Probably the issue here is more related to my setup than to the code base or the CI, but I'm not sure. Now I think, a better approach would have been to wait for somebody else to confirm the issue #20434 before creating this PR here, right?

jreback · 2018-03-24T22:20:41Z

if the test doesn't fail on CI, then either your fix doesn't actually fix anything, or the CI is not running it (or properly running it). so need to have a failure before it can get fixed.

TomAugspurger · 2018-03-28T16:48:01Z

@mcrot @jreback I suspect this is because we don't run slow tests on py3. So we can do a couple things

mock the test at the network layer and remove the slow marker
ensure that at least one of our slow jobs in running on 3.6

We should make sure that we run the slow test against 3.6 regardless of 2.7. I'm planning some cleanup of our CI setup, to make these easier to spot.

Regarding this PR, I'm fine with merging as is. I'm not sure how difficult mocking would be, but I'd like to get our tests passing again.

jreback · 2018-03-30T21:09:18Z

#20559

let's see if this fails

jreback · 2018-04-01T14:23:52Z

can you rebase this. happy to merge this.

TomAugspurger · 2018-04-02T12:38:55Z

Merged in master and re-ran. @mcrot ping on green.

mcrot · 2018-04-02T21:48:57Z

Green from my point of view. Thanks @jreback @TomAugspurger. If you still need me to do a rebase (not sure), I'm back in my office in about 15h.

Closes pandas-dev#20434. Back in pandas-dev#19779 a call of a seek() method was added. This call fails on HTTPResponse instances with an UnsupportedOperation exception, so for this case a try..except wrapper was added here.

TomAugspurger · 2018-04-03T19:24:01Z

Thanks for the fix, and for turning up the CI issues :)

TomAugspurger reviewed Mar 21, 2018

View reviewed changes

mcrot force-pushed the fix-unsupported-seek-for-httpresponse branch from 5398c6a to 7767703 Compare March 22, 2018 14:02

mcrot force-pushed the fix-unsupported-seek-for-httpresponse branch from 7767703 to f1e50d1 Compare March 22, 2018 14:57

mcrot mentioned this pull request Mar 22, 2018

Intermittent unexpected CI behaviour for py27 runs #20453

Closed

jreback added Testing pandas testing functions or related to the test suite IO Excel read_excel, to_excel labels Mar 22, 2018

jreback added this to the 0.23.0 milestone Apr 1, 2018

mcrot force-pushed the fix-unsupported-seek-for-httpresponse branch from 361e9c7 to 88d91f5 Compare April 3, 2018 14:14

TomAugspurger merged commit aa3fefc into pandas-dev:master Apr 3, 2018

mcrot deleted the fix-unsupported-seek-for-httpresponse branch April 3, 2018 19:42

onsimon mentioned this pull request Oct 7, 2019

BUG: Patch for skipping seek() when loading Excel files from urllib3.response.HTTPResponse object #28826

Closed

4 tasks

BUG: Patch for skipping seek() when loading Excel files from url #20437

BUG: Patch for skipping seek() when loading Excel files from url #20437

Conversation

mcrot commented Mar 21, 2018

codecov bot commented Mar 21, 2018 • edited Loading

Codecov Report

TomAugspurger commented Mar 21, 2018

WillAyd commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Mar 21, 2018 via email

jreback commented Mar 21, 2018

TomAugspurger commented Mar 21, 2018

jreback commented Mar 21, 2018

TomAugspurger commented Mar 21, 2018

mcrot commented Mar 21, 2018

mcrot commented Mar 22, 2018

TomAugspurger commented Mar 22, 2018

mcrot commented Mar 22, 2018

TomAugspurger commented Mar 22, 2018

mcrot commented Mar 22, 2018

TomAugspurger commented Mar 22, 2018 via email

mcrot commented Mar 22, 2018

jreback commented Mar 22, 2018

mcrot commented Mar 23, 2018

jreback commented Mar 24, 2018

TomAugspurger commented Mar 28, 2018

jreback commented Mar 30, 2018

jreback commented Apr 1, 2018

TomAugspurger commented Apr 2, 2018

mcrot commented Apr 2, 2018

TomAugspurger commented Apr 3, 2018

codecov bot commented Mar 21, 2018 •

edited

Loading