Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Patch for skipping seek() when loading Excel files from url #20437

Merged

Conversation

mcrot
Copy link
Contributor

@mcrot mcrot commented Mar 21, 2018

This is my first PR and I tried to follow the guideline. Sorry, if I overlooked something.
I don't have to add anything in whatsnew, because the bug isn't present in the current stable version 0.22, correct?

@codecov
Copy link

codecov bot commented Mar 21, 2018

Codecov Report

Merging #20437 into master will increase coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20437      +/-   ##
==========================================
+ Coverage   91.82%   91.84%   +0.02%     
==========================================
  Files         153      153              
  Lines       49256    49256              
==========================================
+ Hits        45229    45241      +12     
+ Misses       4027     4015      -12
Flag Coverage Δ
#multiple 90.23% <ø> (+0.02%) ⬆️
#single 41.9% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/plotting/_converter.py 66.81% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73cb32e...88d91f5. Read the comment docs.

@TomAugspurger
Copy link
Contributor

cc @WillAyd

@WillAyd
Copy link
Member

WillAyd commented Mar 21, 2018

Pretty simple change so lgtm - fine with it if you are Tom. Thanks for the contribution @mcrot

@@ -10,6 +10,7 @@
import abc
import warnings
import numpy as np
from http.client import HTTPResponse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this exist in py2? May have to import from pandas.compat.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback, Tom. You're right, http.client was named httplib in py2, I forgot to check py2 compatibility, sorry. So maybe import pandas.compat.httplib as httplib and using httplib.HTTPResponse would work, but I will think about your and @WillAyd's comment first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this have failed on the AppVeyor 2.7 build?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it worked .. I'm just trying to figure out why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've no idea why the code worked for Python 2.7 - not only in AppVeyor, but also in TravisCI and CircleCI. In all of them pytest reports to use py 2.7.14. And the docs clearly state that http.client has come up with py3. As example, http.client is not available in my local py27 environment:

Python 2.7.14 | packaged by conda-forge | (default, Dec 25 2017, 01:16:05) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from http.client import HTTPResponse
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named http.client

How is this possible, is there some CI magic I haven't understood yet?
I've also searched for some other compatibility libs which may have used in the pandas code base in order to achieve this, but haven't found any.

I think it would be good for me to understand this before implementing a py2 compatible fix here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to set up a python2 dev environment and reproduce locally? That may be easier than going through the CI logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested on your branch in my py2 env

In [1]: import pandas as pd
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-af55e7023913> in <module>()
----> 1 import pandas as pd

/Users/taugspurger/sandbox/pandas-ip-2/pandas/pandas/__init__.pyc in <module>()
     55
     56 from pandas.util._print_versions import show_versions
---> 57 from pandas.io.api import *
     58 from pandas.util._tester import test
     59 import pandas.testing

/Users/taugspurger/sandbox/pandas-ip-2/pandas/pandas/io/api.py in <module>()
      7 from pandas.io.parsers import read_csv, read_table, read_fwf
      8 from pandas.io.clipboards import read_clipboard
----> 9 from pandas.io.excel import ExcelFile, ExcelWriter, read_excel
     10 from pandas.io.pytables import HDFStore, get_store, read_hdf
     11 from pandas.io.json import read_json

/Users/taugspurger/sandbox/pandas-ip-2/pandas/pandas/io/excel.py in <module>()
     11 import warnings
     12 import numpy as np
---> 13 from http.client import HTTPResponse
     14
     15 from pandas.core.dtypes.common import (

ImportError: No module named http.client

So something strange is going on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if one of our dependencies installs an http backport?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to use tox -e py27 in order to test with python 2.7 - that did not work, because the bigquery dependency could not be satisfied. I'll try to create a separate directory and build with python 2.7, that's maybe a better idea.

I'll also have a look if there are dependecies to a http backport...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I've built a local development environment mit py27 and the tests fail as expected, because http.client cannot be imported. So it seems this is only related to the CI mechanism..

In order to check the CI dependencies, I've made a small py2 script which collects all dependencies from the files ci/requirements-2.7(_[A-Z]+)?\.(build|pip|run), imports them one by one (assuming the correct version is installed) and tries to import http.client after each import. Haven't found a backport of http by doing this.

I've added the script and my output here for reference. I had to manually install some of the collected dependencies before (couldn't find 'nomkl' btw). Not sure whether this helps.

import os
import re

import logging
logging.basicConfig(level=logging.INFO)
    
if __name__=='__main__':
    DEPS_DIR = '/home/mcrot/prj/pandas-mcrot-py2/ci'
    DEPS_REGEX = r'^requirements-2.7(_[A-Z]+)?\.(build|pip|run)$'

    DEPS_BLACKLIST = ['python', 'ipython']

    DEPS_IMPORT_NAMES = { 'PyCrypto': 'Crypto',
                          'pytables': 'tables',
                          'python-dateutil': 'dateutil',
                          'beautifulsoup4': 'bs4',
                          'pandas-gbq': 'pandas_gbq'}
        
    deps = set()
    
    for fn in os.listdir(DEPS_DIR):
        if re.match(DEPS_REGEX, fn):
            logging.info("Loading deps from file '%s'..", fn)

            f = open(os.path.join(DEPS_DIR, fn))

            new_deps = []
            for s in f.readlines():
                dep = s.split('=')[0].strip()

                if dep in DEPS_BLACKLIST:
                    continue
                
                # rename if possible
                try:
                    dep = DEPS_IMPORT_NAMES[dep]
                except KeyError:
                    pass

                new_deps.append(dep)
            
            deps = deps.union(new_deps)

    deps = sorted(d for d in deps)

    logging.info("Checking these dependencies: %s", deps)
                
    for dep in deps:

        try:
            mod = __import__(dep)
        except ImportError:
            logging.warning("Cannot import module '%s'", dep)
            continue
            
        try:
            from http.client import HTTPResponse
            logging.error("Module '%s' DOES IMPORT http.", dep)
            raise Exception("Backport of http detected in module '%s'" % str(mod))
        except ImportError as exc:
            # as expected for py2
            logging.info("Module '%s' does not import http.", mod)

check-backport.txt

So I guess, I need a local CI in order to find out why http.client can be imported?

#
# Excluding the case http.client.HTTPResponse.seek() here
# because it raises an UnsupportedOperation exception
if not isinstance(io, HTTPResponse) and hasattr(io, 'seek'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry a bit about this hardcoding HTTPResponse here.

IIUC, io could be anything that implements the __fspath__ protocol. Not all of those objects will implement seek.

What's the harm in cactching (OSError, ValueError) here? (I don't really know why we seek in the first place)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back in #19779 we were getting very intermittent failures on ASV tests due to a TypeError being thrown. tbh I'm still not entirely clear on why it was happening but I believe that intermittently that buffer was being read before this code was being executed. Adding in the conditional seek back to the beginning it prevented the TypeError.

So it's certainly a fair question on your end to which I don't have a great answer. try...catch could be an option, but then we'd probably have to seek back to the start in the catch block and maybe still have the conditional (to distinguish from other TypeErrors) so not sure there is a ton to gain there

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 21, 2018 via email

@jreback
Copy link
Contributor

jreback commented Mar 21, 2018

this might simpler, in pandas/compat/__init__.py we already do

try:
    import __builtin__ as builtins
    # not writeable when instantiated with string, doesn't handle unicode well
    from cStringIO import StringIO as cStringIO
    # always writeable
    from StringIO import StringIO
    BytesIO = StringIO
    import cPickle
    import httplib
except ImportError:
    import builtins
    from io import StringIO, BytesIO
    cStringIO = StringIO
    import pickle as cPickle
    import http.client as httplib

so you are likely getting:https://docs.python.org/2/library/httplib.html in python 2 (assuming we are doing the import from compat, which we should be doing)

@TomAugspurger
Copy link
Contributor

which we should be doing

Right, but the latest CI run did an unconditional import http.client, which should have failed on py2 (and does for us locally, but not on any of the CI runs).

@jreback
Copy link
Contributor

jreback commented Mar 21, 2018

its possible this test is skipped on a 2.7 run?

@TomAugspurger
Copy link
Contributor

Possibly, but I think that the import is hit by an import pandas, so that shouldn't matter. The entire test suite should fail (unless I'm misunderstanding something).

@mcrot
Copy link
Contributor Author

mcrot commented Mar 21, 2018

The test is not skipped and the import http.client statement is on top of pandas.io.excel and unconditional, so it is executed on every test which imports the excel modul.
I've created a new branch in my repo, activated TravisCI and added the suggested commands at the end of the install script. Let's see..

@mcrot
Copy link
Contributor Author

mcrot commented Mar 22, 2018

Ok, the TravisCI run my branch why-does-import-http-client-work-in-py2 failed already on the first import pandas as @TomAugspurger expected, see here. I don't understand why the pandas-dev TravisCI runs pass without problems - and also for the other two CI environments.

I've had a look at of a diff of TravisCI logs of both instances and haven't found a significant difference yet. The TravisCI on my repo had to built some wheels first and couldn't resuse Cython cache as the pandas-dev instance did. That shouldn't matter?!

However I'm going to temporarily add the extra output of conda list at the end of travis_install.sh also to this PR in order to see its output on pandas-dev TravisCI instance. Just to be sure we have the same package versions everywhere.

Concerning this PR here, maybe this py2 CI issue is slowly getting a little off-topic.. maybe I should be handled elsewhere in the long run, I'm not sure how you normally deal with CI issues like that. Or there is an easy explanation I cannot see, then sorry for the noise.

@TomAugspurger
Copy link
Contributor

Concerning this PR here, maybe this py2 CI issue is slowly getting a little off-topic.. maybe I should be handled elsewhere in the long run

Yes, feel free to just make the changes to fix the excel bug (ideally tested locally with Py2), and we'll fix the CI in a followup.

@mcrot
Copy link
Contributor Author

mcrot commented Mar 22, 2018

Ok, thanks .. it's a little weird, because this time the tests here on pandas-dev also fail for py27 as expected.

However I've continued with the PR and just catched OSError and ValueError as suggested by @TomAugspurger. Btw. would catching io.UnsupportedOperation be too specific? It is available in py2 and py3.

Locally the tests run both for py27 and py36, I will push now. I hope this PR does not affect negatively affect ASV tests as mentioned by @WillAyd.

@mcrot mcrot force-pushed the fix-unsupported-seek-for-httpresponse branch from 5398c6a to 7767703 Compare March 22, 2018 14:02
@TomAugspurger
Copy link
Contributor

Btw. would catching io.UnsupportedOperation be too specific? It is available in py2 and py3.

That should be equivalent, right? I just assumed it wasn't around in python2, but if so the use that.

@mcrot mcrot force-pushed the fix-unsupported-seek-for-httpresponse branch from 7767703 to f1e50d1 Compare March 22, 2018 14:57
@mcrot
Copy link
Contributor Author

mcrot commented Mar 22, 2018

Ok, I'm catching io.UnsupportedOperation now instead of (OSError, ValueError). It's working locally for py36 and py27. I've pushed to this branch.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 22, 2018 via email

@mcrot
Copy link
Contributor Author

mcrot commented Mar 22, 2018

Done, see #20453.

@jreback jreback added Testing pandas testing functions or related to the test suite IO Excel read_excel, to_excel labels Mar 22, 2018
@jreback
Copy link
Contributor

jreback commented Mar 22, 2018

can you add the test from the issue?

@mcrot
Copy link
Contributor Author

mcrot commented Mar 23, 2018

The test which reveals #20434 is already included in the code base. I openend #20434, because
I'd just setup a working environment following the guide Contributing to pandas and afterwards, when running the tests, especially

pytest pandas/tests/io/test_excel.py

I got failures for the test method TestXlrdReader.test_read_from_http_url. The failure is described in #20434.

I understand now that is strange that my tests didn't run, because #19926 introduced the seek() method and the tests should have failed already in the past for that PR. Probably the issue here is more related to my setup than to the code base or the CI, but I'm not sure. Now I think, a better approach would have been to wait for somebody else to confirm the issue #20434 before creating this PR here, right?

@jreback
Copy link
Contributor

jreback commented Mar 24, 2018

if the test doesn't fail on CI, then either your fix doesn't actually fix anything, or the CI is not running it (or properly running it). so need to have a failure before it can get fixed.

@TomAugspurger
Copy link
Contributor

@mcrot @jreback I suspect this is because we don't run slow tests on py3. So we can do a couple things

  • mock the test at the network layer and remove the slow marker
  • ensure that at least one of our slow jobs in running on 3.6

We should make sure that we run the slow test against 3.6 regardless of 2.7. I'm planning some cleanup of our CI setup, to make these easier to spot.

Regarding this PR, I'm fine with merging as is. I'm not sure how difficult mocking would be, but I'd like to get our tests passing again.

@jreback
Copy link
Contributor

jreback commented Mar 30, 2018

#20559

let's see if this fails

@jreback
Copy link
Contributor

jreback commented Apr 1, 2018

can you rebase this. happy to merge this.

@jreback jreback added this to the 0.23.0 milestone Apr 1, 2018
@TomAugspurger
Copy link
Contributor

Merged in master and re-ran. @mcrot ping on green.

@mcrot
Copy link
Contributor Author

mcrot commented Apr 2, 2018

Green from my point of view. Thanks @jreback @TomAugspurger. If you still need me to do a rebase (not sure), I'm back in my office in about 15h.

Closes pandas-dev#20434.

Back in pandas-dev#19779 a call of a seek() method was added. This call fails
on HTTPResponse instances with an UnsupportedOperation exception,
so for this case a try..except wrapper was added here.
@mcrot mcrot force-pushed the fix-unsupported-seek-for-httpresponse branch from 361e9c7 to 88d91f5 Compare April 3, 2018 14:14
@TomAugspurger TomAugspurger merged commit aa3fefc into pandas-dev:master Apr 3, 2018
@TomAugspurger
Copy link
Contributor

Thanks for the fix, and for turning up the CI issues :)

@mcrot mcrot deleted the fix-unsupported-seek-for-httpresponse branch April 3, 2018 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UnsupportedOperation 'seek' when loading excel files from url
4 participants