Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/ENH: Add fallback warnings and correctly handle leading whitespace in C parser #6889

Merged
merged 1 commit into from
Apr 23, 2014

Conversation

mcwitt
Copy link
Contributor

@mcwitt mcwitt commented Apr 16, 2014

closes #6607
closes #3374

Currently, specifying options that are incompatible with the C parser in read_csv and read_table causes a silent fallback to the python engine. This can be confusing if the user has also passed options that are only supported by the C engine, which are then silently ignored. (See #6607)

For example, the commonly used option sep='\s+' causes a fallback to python which could be avoided by automatically translating this to the equivalent delim_whitespace=True, which is supported by the C engine.

There are some issues with the C parser that need to be fixed in order not to break tests with sep='\s+' which previously fell back to python:

The C parser does not correctly handle leading whitespace with delim_whitespace=True (#3374).

There is a related bug when parsing files with \r-delimited lines and missing values:

In [5]: data = 'a b c\r2 3\r4 5 6'

In [6]: pd.read_table(StringIO(data), delim_whitespace=True)
Out[6]: 
    a  b  c
0   2  3  4
1 NaN  5  6

[2 rows x 3 columns]

Summary of changes

  • Raise ValueError when user specifies engine='c' with C-unsupported options:
In [1]: import pandas as pd

In [2]: from pandas.compat import StringIO

In [3]: data = ' a\tb c\n 1\t2 3\n 4\t5 6'

In [4]: pd.read_table(StringIO(data), engine='c', sep='\s')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
. . .
ValueError: the 'c' engine does not support regex separators
  • Raise ValueError when fallback to python parser causes python-unsupported options to be ignored:
In [5]: pd.read_table(StringIO(data), sep='\s', dtype={'a': float})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
. . .
ValueError: Falling back to the 'python' engine because the 'c' engine does not support regex separators, but this causes 'dtype' to be ignored as it is not supported by the 'python' engine. (Note the 'converters' option provides similar functionality.)
  • Warn (new class ParserWarning) when C-unsupported options cause a fallback to python:
In [6]: pd.read_table(StringIO(data), skip_footer=1)
pandas/io/parsers.py:615: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skip_footer; you can avoid this warning by specifying engine='python'.
  ParserWarning)
Out[6]: 
    a  b c
0   1  2 3

[1 rows x 2 columns]
  • Raise ValueError when the user specifies both sep and delim_whitespace=True:
In [7]: pd.read_table(StringIO(data), sep='\s', delim_whitespace=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
. . .
ValueError: Specified a delimiter with both sep and delim_whitespace=True; you can only specify one.
  • Translate sep='\s+' to delim_whitespace=True when there are no other C-unsupported options:
In [8]: pd.read_table(StringIO(data), sep='\s+', engine='c')
Out[8]: 
   a  b  c
0  1  2  3
1  4  5  6

[2 rows x 3 columns]
# Old behavior
In [3]: pd.__version__
Out[3]: '0.13.1-663-g21565a3'

In [4]: pd.read_table(StringIO(data), delim_whitespace=True)
Out[4]: 
   Unnamed: 0  a  b   c
0           1  2  3 NaN
1           4  5  6 NaN

[2 rows x 4 columns]
# New behavior
In [9]: pd.read_table(StringIO(data), delim_whitespace=True)
Out[9]: 
   a  b  c
0  1  2  3
1  4  5  6

[2 rows x 3 columns]
  • Fix bug in handling of \r-delimited files (add test)
    (Old behavior shown above)
# New behavior
In [10]: data = 'a b c\r2 3\r4 5 6'

In [11]: pd.read_table(StringIO(data), delim_whitespace=True)
Out[11]: 
   a  b   c
0  2  3 NaN
1  4  5   6

[2 rows x 3 columns]
  • Copy tests in ParserTests that fall back to python to TestPythonParser; leave copies of these tests in ParserTests with the assertion that they raise a ValueError when run under other engines
  • Add description of engine option to docstrings of read_table and read_csv

@jreback
Copy link
Contributor

jreback commented Apr 16, 2014

@mcwitt can u squash a but if possible)

will have to review this but looks good so far

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 16, 2014

I'm actually still a bit nervous about translating sep='\s+' to delim_whitespace=True because the C engine doesn't seem to understand multi-index:

In [4]: text = """                      A       B       C       D        E
one two three   four
a   b   10.0032 5    -0.5109 -2.3358 -0.4645  0.05076  0.3640
a   q   20      4     0.4473  1.4152  0.2834  1.00661  0.1744
x   q   30      3    -0.6662 -0.5243 -0.3580  0.89145  2.5838"""

In [5]: pd.read_table(StringIO(text), delim_whitespace=True)
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
. . .
CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 9

while the python engine at least gets it partially right:

In [6]: pd.read_table(StringIO(text), sep='\s+')
Out[6]: 
                           E
one two three   four        
a   b   10.0032 5     0.3640
    q   20.0000 4     0.1744
x   q   30.0000 3     2.5838

[3 rows x 1 columns]

So the translation could potentially break some user code that reads a multi-indexed data with sep='\s+'. I'm interested in everyone's thoughts on this...

@jreback
Copy link
Contributor

jreback commented Apr 16, 2014

almost all of the column MultiIndex handling is handled after the header is parsed (in either engine) and code is the same for both engines

so this has to be in the header parsing code itself (which isn't that long)

hmm no tests fails for this?

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 16, 2014

Interesting. This hadn't previously been covered for the C parser because the relevant tests just fell back to python due to sep='\s+'. When I added the translation several tests did fail (not in the io module) but I assumed this just hadn't been implemented in the C parser and specified engine='python'.

@jreback
Copy link
Contributor

jreback commented Apr 16, 2014

hmm their are a number of test for mi columns and they do specifically set the engine so their should be no fallback

maybe the changes did something?

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 16, 2014

maybe the changes did something?

I don't think so, the example above is using master.

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 16, 2014

Maybe this could be fixed in a separate PR? This one is getting a bit heavy as it is...

@jreback
Copy link
Contributor

jreback commented Apr 16, 2014

not a problem

pls open a new issue (and ref this pr)
showing a test

then easy to fix later

@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

@mcwitt looks good...can you do a sample session in ipython and show the changes (e.g. warnings produced, and exceptions and such), and put it in this PR description at the top (below what you have now)

any docs need to be updated? E.g. maybe put a sample ParserWarning explanation in io.rst?

pls add a release note(s) as appropriate in doc/source/release.rst (these are all bug fixes? ParserWarning I guess is an API change).

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 21, 2014

@jreback OK, done. Thanks for guiding me through the PR process! I'm looking forward to helping out where I can in the future.

I added a section to io.rst explaining the options that aren't supported in the C parser and introducing ParserWarning. Also added notes to the docstring that skipfooter is unsupported in the C parser and dtype is unsupported in the python parser and updated the release notes.

@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

docs look good!

pls rebase on master and push one more time (it should say can be automatically can be merged)

ping me on green

@jorisvandenbossche any comments?

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 22, 2014

@jreback all systems go!

back to python if C-unsupported options are specified. Currently, C-unsupported
options include:

- ``sep`` other than a single character (e.g. regex separators)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list does not need indentation, the - should just start at the beginning of the line

@jorisvandenbossche
Copy link
Member

Looks like a very solid PR! (just added two very minor comments)

But one more general comment: Is the warning needed if it is only a fallback (and nothing is silently ignored)? In the sense that: does a basic user need to know there is a python and c engine?
The original issue in #6607 was that the dtype argument was ignored. That is of course something that shouldn't gone silent. But the example you show in the PR description (first point of the summary of changes: 'Warn (new class ParserWarning) when C-unsupported options cause a fallback to python'), is about falling back, not about ignoring a provided keyword argument. It seems to me that it is more important to raise an error/warning for the latter than for the first.

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 22, 2014

But one more general comment: Is the warning needed if it is only a fallback (and nothing is silently ignored)? In the sense that: does a basic user need to know there is a python and c engine?

It seems to me that it is more important to raise an error/warning for the latter than for the first.

This definitely makes sense. I guess that before we added the sep='\s+'->delim_whitespace=True translation it made sense to warn the user in that case at least, since there was an equivalent option that didn't require a fallback. But for the other C-unsupported options like skipfooter I don't think it makes sense to warn unless there are also C-only options specified that would then be ignored. (Maybe a ValueError would even be more appropriate in this case?)

So maybe when a C-unsupported option is detected we can scan for python-unsupported options (and vice-versa) and only if one is found raise a warning/error?

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

hmm.... I think now that you are checking, I would RAISE on an unsupported option being passed when the engine is explicity given (prob rare for a user), and warn that you are falling back c->python if no engine is explicity given and the option is unsupported (prob just dtype,skipfooter)?

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 22, 2014

I would RAISE on an unsupported option being passed when the engine is explicity given (prob rare for a user)

Shouldn't we raise when there's a fallback and the engine is given explicitly, even if there are no options that aren't supported by python? Or maybe just a warning in this case?

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

yes; i presume that when the engine paramater is passed the user wants NO fallback (as that would be too much magic), so if an option is illegal for that engine it SHOULD raise.

hmm...so when do we warn? when falling back? but then that would mean that a passed option is ignored by the c-engine? is their anything that it doesn't do? (aside from sep, which is now fixed).
so maybe that's the only case, e.g. sep is passed and its multi-byte, HAVE to fallback, so a warning is appropriate

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 22, 2014

There are two other cases that the C engine doesn't handle: the skip_footer option, and sep=None with delim_whitespace=False (in the latter case I'm assuming the python engine is able to sniff the delimiter).

Here's what I'm doing currently if one of these options is encountered:

If engine='c' is set explicitly, raise a ValueError
If there are no python-unsupported options, silent fallback to python
If there are python-unsupported options, produce a ParserWarning and fall back to python

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

  • engine='c'' ok (means raising is good)
  • no python unsupported options - this should have a Parserwarning - user should know about this, but it should work
  • are python unsupported options - raise - as you are basically ignoring an option

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

pls rebase

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 22, 2014

@jreback OK, this is done and I'm working on getting the tests to pass again. Before we were leaving the tests that fall back in TestParsers with the assertion that they produced ParserWarnings, but with the new behavior they should instead raise ValueErrors since the various test classes specify engine. I'm inclined to just remove these tests completely from ParserTests, leaving a stub with pass and a comment linking them to the corresponding tests in TestPythonParser. Does this sound reasonable?

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

@mcwitt

you should self.assertRaises(ValueError, f)

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 22, 2014

@jreback The trouble is there are tests e.g. test_malformed that look for errors of a certain type, so they raise an AssertionError when they catch my ValueError. Thus wrapping the whole test in assertRaises doesn't seem to work. I'm not sure how to fix this without changing the body of the test.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

they look like valid c-engine options to me; why would they hit what you are changing?

can you post an example test?

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 22, 2014

skip_footer=1 in the second try...except block is causing the ValueError, which is not the error the test is expecting

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

oh....so that's trying to fallback then? hmm. then just change the test to pick up the ValueError when its a c-engine?

@jorisvandenbossche
Copy link
Member

BTW, something else. The engine keyword seems not documented in the docstring of read_csv

@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 23, 2014

@jorisvandenbossche I added a brief description of engine to the docstring of read_table and read_csv

@jreback OK, the tests should be passing and I've updated the summary of changes at the top.

@@ -113,7 +117,7 @@
chunksize : int, default None
Return TextFileReader object for iteration
skipfooter : int, default 0
Number of line at bottom of file to skip
Number of lines at bottom of file to skip (Unsupported with C parser)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@jorisvandenbossche
Copy link
Member

+1 to the very clear and informative error/warning messages (and the clear summary of the changes at the top). @mcwitt Thanks a lot for the effort!

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

looks good on docs and such

@mcwitt ping when green (I know you are fighting with some tests!)

-raise ValueError when engine='c' specified with unsupported options
-raise ValueError when fallback to python causes options to be ignored
-produce ParserWarning on fallback to python when no options ignored
-fix bug in C parser with leading whitespace and \r-delimited files
 (add test)
-translate sep='\s+' to delim_whitespace=True (add test)
-raise ValueError if the user specifies both `sep` and
 `delim_whitespace=True`
-specify engine='python' in tests with sep='\s+' and multiindex column
 input (work around GH 6893)
-add 'engine' option to docstring of read_csv and read_table
-copy tests that previously fell back to python from ParserTests to
 TestPythonParser and check that they raise ValueErrors when run
 under other engines
@mcwitt
Copy link
Contributor Author

mcwitt commented Apr 23, 2014

@jreback green!

jreback added a commit that referenced this pull request Apr 23, 2014
BUG/ENH: Add fallback warnings and correctly handle leading whitespace in C parser
@jreback jreback merged commit 7168d98 into pandas-dev:master Apr 23, 2014
@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

@mcwitt thanks!

this is excellent!

pls go for it with other issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants