-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Patch for skipping seek() when loading Excel files from url #20437
BUG: Patch for skipping seek() when loading Excel files from url #20437
Conversation
Codecov Report
@@ Coverage Diff @@
## master #20437 +/- ##
==========================================
+ Coverage 91.82% 91.84% +0.02%
==========================================
Files 153 153
Lines 49256 49256
==========================================
+ Hits 45229 45241 +12
+ Misses 4027 4015 -12
Continue to review full report at Codecov.
|
cc @WillAyd |
Pretty simple change so lgtm - fine with it if you are Tom. Thanks for the contribution @mcrot |
pandas/io/excel.py
Outdated
@@ -10,6 +10,7 @@ | |||
import abc | |||
import warnings | |||
import numpy as np | |||
from http.client import HTTPResponse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this exist in py2? May have to import from pandas.compat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback, Tom. You're right, http.client
was named httplib
in py2, I forgot to check py2 compatibility, sorry. So maybe import pandas.compat.httplib as httplib
and using httplib.HTTPResponse
would work, but I will think about your and @WillAyd's comment first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this have failed on the AppVeyor 2.7 build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it worked .. I'm just trying to figure out why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I've no idea why the code worked for Python 2.7 - not only in AppVeyor, but also in TravisCI and CircleCI. In all of them pytest reports to use py 2.7.14. And the docs clearly state that http.client
has come up with py3. As example, http.client
is not available in my local py27 environment:
Python 2.7.14 | packaged by conda-forge | (default, Dec 25 2017, 01:16:05)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from http.client import HTTPResponse
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named http.client
How is this possible, is there some CI magic I haven't understood yet?
I've also searched for some other compatibility libs which may have used in the pandas code base in order to achieve this, but haven't found any.
I think it would be good for me to understand this before implementing a py2 compatible fix here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you able to set up a python2 dev environment and reproduce locally? That may be easier than going through the CI logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested on your branch in my py2 env
In [1]: import pandas as pd
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-af55e7023913> in <module>()
----> 1 import pandas as pd
/Users/taugspurger/sandbox/pandas-ip-2/pandas/pandas/__init__.pyc in <module>()
55
56 from pandas.util._print_versions import show_versions
---> 57 from pandas.io.api import *
58 from pandas.util._tester import test
59 import pandas.testing
/Users/taugspurger/sandbox/pandas-ip-2/pandas/pandas/io/api.py in <module>()
7 from pandas.io.parsers import read_csv, read_table, read_fwf
8 from pandas.io.clipboards import read_clipboard
----> 9 from pandas.io.excel import ExcelFile, ExcelWriter, read_excel
10 from pandas.io.pytables import HDFStore, get_store, read_hdf
11 from pandas.io.json import read_json
/Users/taugspurger/sandbox/pandas-ip-2/pandas/pandas/io/excel.py in <module>()
11 import warnings
12 import numpy as np
---> 13 from http.client import HTTPResponse
14
15 from pandas.core.dtypes.common import (
ImportError: No module named http.client
So something strange is going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if one of our dependencies installs an http
backport?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to use tox -e py27
in order to test with python 2.7 - that did not work, because the bigquery
dependency could not be satisfied. I'll try to create a separate directory and build with python 2.7, that's maybe a better idea.
I'll also have a look if there are dependecies to a http
backport...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I've built a local development environment mit py27 and the tests fail as expected, because http.client
cannot be imported. So it seems this is only related to the CI mechanism..
In order to check the CI dependencies, I've made a small py2 script which collects all dependencies from the files ci/requirements-2.7(_[A-Z]+)?\.(build|pip|run)
, imports them one by one (assuming the correct version is installed) and tries to import http.client
after each import. Haven't found a backport of http
by doing this.
I've added the script and my output here for reference. I had to manually install some of the collected dependencies before (couldn't find 'nomkl' btw). Not sure whether this helps.
import os
import re
import logging
logging.basicConfig(level=logging.INFO)
if __name__=='__main__':
DEPS_DIR = '/home/mcrot/prj/pandas-mcrot-py2/ci'
DEPS_REGEX = r'^requirements-2.7(_[A-Z]+)?\.(build|pip|run)$'
DEPS_BLACKLIST = ['python', 'ipython']
DEPS_IMPORT_NAMES = { 'PyCrypto': 'Crypto',
'pytables': 'tables',
'python-dateutil': 'dateutil',
'beautifulsoup4': 'bs4',
'pandas-gbq': 'pandas_gbq'}
deps = set()
for fn in os.listdir(DEPS_DIR):
if re.match(DEPS_REGEX, fn):
logging.info("Loading deps from file '%s'..", fn)
f = open(os.path.join(DEPS_DIR, fn))
new_deps = []
for s in f.readlines():
dep = s.split('=')[0].strip()
if dep in DEPS_BLACKLIST:
continue
# rename if possible
try:
dep = DEPS_IMPORT_NAMES[dep]
except KeyError:
pass
new_deps.append(dep)
deps = deps.union(new_deps)
deps = sorted(d for d in deps)
logging.info("Checking these dependencies: %s", deps)
for dep in deps:
try:
mod = __import__(dep)
except ImportError:
logging.warning("Cannot import module '%s'", dep)
continue
try:
from http.client import HTTPResponse
logging.error("Module '%s' DOES IMPORT http.", dep)
raise Exception("Backport of http detected in module '%s'" % str(mod))
except ImportError as exc:
# as expected for py2
logging.info("Module '%s' does not import http.", mod)
So I guess, I need a local CI in order to find out why http.client
can be imported?
pandas/io/excel.py
Outdated
# | ||
# Excluding the case http.client.HTTPResponse.seek() here | ||
# because it raises an UnsupportedOperation exception | ||
if not isinstance(io, HTTPResponse) and hasattr(io, 'seek'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry a bit about this hardcoding HTTPResponse
here.
IIUC, io
could be anything that implements the __fspath__
protocol. Not all of those objects will implement seek
.
What's the harm in cactching (OSError, ValueError)
here? (I don't really know why we seek in the first place)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back in #19779 we were getting very intermittent failures on ASV tests due to a TypeError
being thrown. tbh I'm still not entirely clear on why it was happening but I believe that intermittently that buffer was being read before this code was being executed. Adding in the conditional seek back to the beginning it prevented the TypeError
.
So it's certainly a fair question on your end to which I don't have a great answer. try...catch could be an option, but then we'd probably have to seek back to the start in the catch block and maybe still have the conditional (to distinguish from other TypeError
s) so not sure there is a ton to gain there
That sounds difficult. You might try adding a `conda list` to the end of
`ci/install_travis.sh` to print out the final version of all the packages.
My also throw a `python -c "import http.client"` there too, to ensure that
our hypothesis about a backport is correct.
…On Wed, Mar 21, 2018 at 11:43 AM, Michael Röttger ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/io/excel.py
<#20437 (comment)>:
> @@ -10,6 +10,7 @@
import abc
import warnings
import numpy as np
+from http.client import HTTPResponse
Hm, I've built a local development environment mit py27 and the tests fail
as expected, because http.client cannot be imported. So it seems this is
only related to the CI mechanism..
In order to check the CI dependencies, I've made a small py2 script which
collects all dependencies from the files ci/requirements-2.7(_[A-Z]+)?\
.(build|pip|run), imports them one by one (assuming the correct version
is installed) and tries to import http.client after each import. Haven't
found a backport of http by doing this.
I've added the script and my output here for reference. I had to manually
install some of the collected dependencies before (couldn't find 'nomkl'
btw). Not sure whether this helps.
import osimport re
import logging
logging.basicConfig(level=logging.INFO)
if __name__=='__main__':
DEPS_DIR = '/home/mcrot/prj/pandas-mcrot-py2/ci'
DEPS_REGEX = r'^requirements-2.7(_[A-Z]+)?\.(build|pip|run)$'
DEPS_BLACKLIST = ['python', 'ipython']
DEPS_IMPORT_NAMES = { 'PyCrypto': 'Crypto',
'pytables': 'tables',
'python-dateutil': 'dateutil',
'beautifulsoup4': 'bs4',
'pandas-gbq': 'pandas_gbq'}
deps = set()
for fn in os.listdir(DEPS_DIR):
if re.match(DEPS_REGEX, fn):
logging.info("Loading deps from file '%s'..", fn)
f = open(os.path.join(DEPS_DIR, fn))
new_deps = []
for s in f.readlines():
dep = s.split('=')[0].strip()
if dep in DEPS_BLACKLIST:
continue
# rename if possible
try:
dep = DEPS_IMPORT_NAMES[dep]
except KeyError:
pass
new_deps.append(dep)
deps = deps.union(new_deps)
deps = sorted(d for d in deps)
logging.info("Checking these dependencies: %s", deps)
for dep in deps:
try:
mod = __import__(dep)
except ImportError:
logging.warning("Cannot import module '%s'", dep)
continue
try:
from http.client import HTTPResponse
logging.error("Module '%s' DOES IMPORT http.", dep)
raise Exception("Backport of http detected in module '%s'" % str(mod))
except ImportError as exc:
# as expected for py2
logging.info("Module '%s' does not import http.", mod)
check-backport.txt
<https://github.com/pandas-dev/pandas/files/1834240/check-backport.txt>
So I guess, I need a local CI in order to find out why http.client can be
imported?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#20437 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHInpeAh5hpMqgPugPwF7Y70MGz9cyks5tgoNDgaJpZM4SzZlx>
.
|
this might simpler, in
so you are likely getting:https://docs.python.org/2/library/httplib.html in python 2 (assuming we are doing the import from compat, which we should be doing) |
Right, but the latest CI run did an unconditional |
its possible this test is skipped on a 2.7 run? |
Possibly, but I think that the import is hit by an |
The test is not skipped and the |
Ok, the TravisCI run my branch I've had a look at of a diff of TravisCI logs of both instances and haven't found a significant difference yet. The TravisCI on my repo had to built some wheels first and couldn't resuse Cython cache as the pandas-dev instance did. That shouldn't matter?! However I'm going to temporarily add the extra output of Concerning this PR here, maybe this py2 CI issue is slowly getting a little off-topic.. maybe I should be handled elsewhere in the long run, I'm not sure how you normally deal with CI issues like that. Or there is an easy explanation I cannot see, then sorry for the noise. |
Yes, feel free to just make the changes to fix the excel bug (ideally tested locally with Py2), and we'll fix the CI in a followup. |
Ok, thanks .. it's a little weird, because this time the tests here on pandas-dev also fail for py27 as expected. However I've continued with the PR and just catched Locally the tests run both for py27 and py36, I will push now. I hope this PR does not affect negatively affect ASV tests as mentioned by @WillAyd. |
5398c6a
to
7767703
Compare
That should be equivalent, right? I just assumed it wasn't around in python2, but if so the use that. |
7767703
to
f1e50d1
Compare
Ok, I'm catching |
Great.
Could you open an issue with a small summary to what we found here about
the CI issue?
…On Thu, Mar 22, 2018 at 10:00 AM, Michael Röttger ***@***.***> wrote:
Ok, I'm catching io.UnsupportedOperation now instead of (OSError,
ValueError). It's working locally for py36 and py27. I've pushed to this
branch.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20437 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIsROzV6mBjLbLc2acb7Y9fYe0Et4ks5tg7yMgaJpZM4SzZlx>
.
|
Done, see #20453. |
can you add the test from the issue? |
The test which reveals #20434 is already included in the code base. I openend #20434, because
I got failures for the test method I understand now that is strange that my tests didn't run, because #19926 introduced the seek() method and the tests should have failed already in the past for that PR. Probably the issue here is more related to my setup than to the code base or the CI, but I'm not sure. Now I think, a better approach would have been to wait for somebody else to confirm the issue #20434 before creating this PR here, right? |
if the test doesn't fail on CI, then either your fix doesn't actually fix anything, or the CI is not running it (or properly running it). so need to have a failure before it can get fixed. |
@mcrot @jreback I suspect this is because we don't run slow tests on py3. So we can do a couple things
We should make sure that we run the slow test against 3.6 regardless of 2.7. I'm planning some cleanup of our CI setup, to make these easier to spot. Regarding this PR, I'm fine with merging as is. I'm not sure how difficult mocking would be, but I'd like to get our tests passing again. |
let's see if this fails |
can you rebase this. happy to merge this. |
Merged in master and re-ran. @mcrot ping on green. |
Green from my point of view. Thanks @jreback @TomAugspurger. If you still need me to do a rebase (not sure), I'm back in my office in about 15h. |
Closes pandas-dev#20434. Back in pandas-dev#19779 a call of a seek() method was added. This call fails on HTTPResponse instances with an UnsupportedOperation exception, so for this case a try..except wrapper was added here.
361e9c7
to
88d91f5
Compare
Thanks for the fix, and for turning up the CI issues :) |
git diff upstream/master -u -- "*.py" | flake8 --diff
This is my first PR and I tried to follow the guideline. Sorry, if I overlooked something.
I don't have to add anything in whatsnew, because the bug isn't present in the current stable version 0.22, correct?