BUG/ENH: Add fallback warnings and correctly handle leading whitespace in C parser #6889

mcwitt · 2014-04-16T01:16:37Z

closes #6607
closes #3374

Currently, specifying options that are incompatible with the C parser in read_csv and read_table causes a silent fallback to the python engine. This can be confusing if the user has also passed options that are only supported by the C engine, which are then silently ignored. (See #6607)

For example, the commonly used option sep='\s+' causes a fallback to python which could be avoided by automatically translating this to the equivalent delim_whitespace=True, which is supported by the C engine.

There are some issues with the C parser that need to be fixed in order not to break tests with sep='\s+' which previously fell back to python:

The C parser does not correctly handle leading whitespace with delim_whitespace=True (#3374).

There is a related bug when parsing files with \r-delimited lines and missing values:

In [5]: data = 'a b c\r2 3\r4 5 6'

In [6]: pd.read_table(StringIO(data), delim_whitespace=True)
Out[6]: 
    a  b  c
0   2  3  4
1 NaN  5  6

[2 rows x 3 columns]

Summary of changes

Raise ValueError when user specifies engine='c' with C-unsupported options:

In [1]: import pandas as pd

In [2]: from pandas.compat import StringIO

In [3]: data = ' a\tb c\n 1\t2 3\n 4\t5 6'

In [4]: pd.read_table(StringIO(data), engine='c', sep='\s')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
. . .
ValueError: the 'c' engine does not support regex separators

Raise ValueError when fallback to python parser causes python-unsupported options to be ignored:

In [5]: pd.read_table(StringIO(data), sep='\s', dtype={'a': float})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
. . .
ValueError: Falling back to the 'python' engine because the 'c' engine does not support regex separators, but this causes 'dtype' to be ignored as it is not supported by the 'python' engine. (Note the 'converters' option provides similar functionality.)

Warn (new class ParserWarning) when C-unsupported options cause a fallback to python:

In [6]: pd.read_table(StringIO(data), skip_footer=1)
pandas/io/parsers.py:615: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skip_footer; you can avoid this warning by specifying engine='python'.
  ParserWarning)
Out[6]: 
    a  b c
0   1  2 3

[1 rows x 2 columns]

Raise ValueError when the user specifies both sep and delim_whitespace=True:

In [7]: pd.read_table(StringIO(data), sep='\s', delim_whitespace=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
. . .
ValueError: Specified a delimiter with both sep and delim_whitespace=True; you can only specify one.

Translate sep='\s+' to delim_whitespace=True when there are no other C-unsupported options:

In [8]: pd.read_table(StringIO(data), sep='\s+', engine='c')
Out[8]: 
   a  b  c
0  1  2  3
1  4  5  6

[2 rows x 3 columns]

Fix handling of leading whitespace in the C parser (Handling leading whitespace with delim_whitespace #3374) (add test)

# Old behavior
In [3]: pd.__version__
Out[3]: '0.13.1-663-g21565a3'

In [4]: pd.read_table(StringIO(data), delim_whitespace=True)
Out[4]: 
   Unnamed: 0  a  b   c
0           1  2  3 NaN
1           4  5  6 NaN

[2 rows x 4 columns]

# New behavior
In [9]: pd.read_table(StringIO(data), delim_whitespace=True)
Out[9]: 
   a  b  c
0  1  2  3
1  4  5  6

[2 rows x 3 columns]

Fix bug in handling of \r-delimited files (add test)
(Old behavior shown above)

# New behavior
In [10]: data = 'a b c\r2 3\r4 5 6'

In [11]: pd.read_table(StringIO(data), delim_whitespace=True)
Out[11]: 
   a  b   c
0  2  3 NaN
1  4  5   6

[2 rows x 3 columns]

Copy tests in ParserTests that fall back to python to TestPythonParser; leave copies of these tests in ParserTests with the assertion that they raise a ValueError when run under other engines
Add description of engine option to docstrings of read_table and read_csv

jreback · 2014-04-16T19:15:57Z

@mcwitt can u squash a but if possible)

will have to review this but looks good so far

mcwitt · 2014-04-16T19:53:10Z

I'm actually still a bit nervous about translating sep='\s+' to delim_whitespace=True because the C engine doesn't seem to understand multi-index:

In [4]: text = """                      A       B       C       D        E
one two three   four
a   b   10.0032 5    -0.5109 -2.3358 -0.4645  0.05076  0.3640
a   q   20      4     0.4473  1.4152  0.2834  1.00661  0.1744
x   q   30      3    -0.6662 -0.5243 -0.3580  0.89145  2.5838"""

In [5]: pd.read_table(StringIO(text), delim_whitespace=True)
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
. . .
CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 9

while the python engine at least gets it partially right:

In [6]: pd.read_table(StringIO(text), sep='\s+')
Out[6]: 
                           E
one two three   four        
a   b   10.0032 5     0.3640
    q   20.0000 4     0.1744
x   q   30.0000 3     2.5838

[3 rows x 1 columns]

So the translation could potentially break some user code that reads a multi-indexed data with sep='\s+'. I'm interested in everyone's thoughts on this...

jreback · 2014-04-16T19:56:56Z

almost all of the column MultiIndex handling is handled after the header is parsed (in either engine) and code is the same for both engines

so this has to be in the header parsing code itself (which isn't that long)

hmm no tests fails for this?

mcwitt · 2014-04-16T20:01:13Z

Interesting. This hadn't previously been covered for the C parser because the relevant tests just fell back to python due to sep='\s+'. When I added the translation several tests did fail (not in the io module) but I assumed this just hadn't been implemented in the C parser and specified engine='python'.

jreback · 2014-04-16T20:03:28Z

hmm their are a number of test for mi columns and they do specifically set the engine so their should be no fallback

maybe the changes did something?

mcwitt · 2014-04-16T20:06:21Z

maybe the changes did something?

I don't think so, the example above is using master.

mcwitt · 2014-04-16T20:37:31Z

Maybe this could be fixed in a separate PR? This one is getting a bit heavy as it is...

jreback · 2014-04-16T20:58:54Z

not a problem

pls open a new issue (and ref this pr)
showing a test

then easy to fix later

jreback · 2014-04-21T12:35:52Z

@mcwitt looks good...can you do a sample session in ipython and show the changes (e.g. warnings produced, and exceptions and such), and put it in this PR description at the top (below what you have now)

any docs need to be updated? E.g. maybe put a sample ParserWarning explanation in io.rst?

pls add a release note(s) as appropriate in doc/source/release.rst (these are all bug fixes? ParserWarning I guess is an API change).

mcwitt · 2014-04-21T22:59:42Z

@jreback OK, done. Thanks for guiding me through the PR process! I'm looking forward to helping out where I can in the future.

I added a section to io.rst explaining the options that aren't supported in the C parser and introducing ParserWarning. Also added notes to the docstring that skipfooter is unsupported in the C parser and dtype is unsupported in the python parser and updated the release notes.

jreback · 2014-04-21T23:07:54Z

docs look good!

pls rebase on master and push one more time (it should say can be automatically can be merged)

ping me on green

@jorisvandenbossche any comments?

mcwitt · 2014-04-22T00:13:05Z

@jreback all systems go!

jorisvandenbossche · 2014-04-22T09:49:04Z

doc/source/io.rst

+back to python if C-unsupported options are specified. Currently, C-unsupported
+options include:
+
+  - ``sep`` other than a single character (e.g. regex separators)


The list does not need indentation, the - should just start at the beginning of the line

jorisvandenbossche · 2014-04-22T10:02:41Z

Looks like a very solid PR! (just added two very minor comments)

But one more general comment: Is the warning needed if it is only a fallback (and nothing is silently ignored)? In the sense that: does a basic user need to know there is a python and c engine?
The original issue in #6607 was that the dtype argument was ignored. That is of course something that shouldn't gone silent. But the example you show in the PR description (first point of the summary of changes: 'Warn (new class ParserWarning) when C-unsupported options cause a fallback to python'), is about falling back, not about ignoring a provided keyword argument. It seems to me that it is more important to raise an error/warning for the latter than for the first.

mcwitt · 2014-04-22T15:18:57Z

But one more general comment: Is the warning needed if it is only a fallback (and nothing is silently ignored)? In the sense that: does a basic user need to know there is a python and c engine?

It seems to me that it is more important to raise an error/warning for the latter than for the first.

This definitely makes sense. I guess that before we added the sep='\s+'->delim_whitespace=True translation it made sense to warn the user in that case at least, since there was an equivalent option that didn't require a fallback. But for the other C-unsupported options like skipfooter I don't think it makes sense to warn unless there are also C-only options specified that would then be ignored. (Maybe a ValueError would even be more appropriate in this case?)

So maybe when a C-unsupported option is detected we can scan for python-unsupported options (and vice-versa) and only if one is found raise a warning/error?

jreback · 2014-04-22T15:21:56Z

hmm.... I think now that you are checking, I would RAISE on an unsupported option being passed when the engine is explicity given (prob rare for a user), and warn that you are falling back c->python if no engine is explicity given and the option is unsupported (prob just dtype,skipfooter)?

mcwitt · 2014-04-22T19:48:41Z

I would RAISE on an unsupported option being passed when the engine is explicity given (prob rare for a user)

Shouldn't we raise when there's a fallback and the engine is given explicitly, even if there are no options that aren't supported by python? Or maybe just a warning in this case?

jreback · 2014-04-22T19:54:10Z

yes; i presume that when the engine paramater is passed the user wants NO fallback (as that would be too much magic), so if an option is illegal for that engine it SHOULD raise.

hmm...so when do we warn? when falling back? but then that would mean that a passed option is ignored by the c-engine? is their anything that it doesn't do? (aside from sep, which is now fixed).
so maybe that's the only case, e.g. sep is passed and its multi-byte, HAVE to fallback, so a warning is appropriate

mcwitt · 2014-04-22T20:09:21Z

There are two other cases that the C engine doesn't handle: the skip_footer option, and sep=None with delim_whitespace=False (in the latter case I'm assuming the python engine is able to sniff the delimiter).

Here's what I'm doing currently if one of these options is encountered:

If engine='c' is set explicitly, raise a ValueError
If there are no python-unsupported options, silent fallback to python
If there are python-unsupported options, produce a ParserWarning and fall back to python

jreback · 2014-04-22T21:12:13Z

engine='c'' ok (means raising is good)
no python unsupported options - this should have a Parserwarning - user should know about this, but it should work
are python unsupported options - raise - as you are basically ignoring an option

jreback · 2014-04-22T22:00:16Z

pls rebase

mcwitt · 2014-04-22T22:45:01Z

@jreback OK, this is done and I'm working on getting the tests to pass again. Before we were leaving the tests that fall back in TestParsers with the assertion that they produced ParserWarnings, but with the new behavior they should instead raise ValueErrors since the various test classes specify engine. I'm inclined to just remove these tests completely from ParserTests, leaving a stub with pass and a comment linking them to the corresponding tests in TestPythonParser. Does this sound reasonable?

jreback · 2014-04-22T22:50:55Z

@mcwitt

you should self.assertRaises(ValueError, f)

mcwitt · 2014-04-22T23:07:44Z

@jreback The trouble is there are tests e.g. test_malformed that look for errors of a certain type, so they raise an AssertionError when they catch my ValueError. Thus wrapping the whole test in assertRaises doesn't seem to work. I'm not sure how to fix this without changing the body of the test.

jreback · 2014-04-22T23:32:54Z

they look like valid c-engine options to me; why would they hit what you are changing?

can you post an example test?

mcwitt · 2014-04-22T23:38:59Z

skip_footer=1 in the second try...except block is causing the ValueError, which is not the error the test is expecting

jreback · 2014-04-22T23:49:29Z

oh....so that's trying to fallback then? hmm. then just change the test to pick up the ValueError when its a c-engine?

jorisvandenbossche · 2014-04-23T07:08:03Z

BTW, something else. The engine keyword seems not documented in the docstring of read_csv

mcwitt · 2014-04-23T20:09:00Z

@jorisvandenbossche I added a brief description of engine to the docstring of read_table and read_csv

@jreback OK, the tests should be passing and I've updated the summary of changes at the top.

jreback · 2014-04-23T20:09:59Z

pandas/io/parsers.py

@@ -113,7 +117,7 @@
 chunksize : int, default None
    Return TextFileReader object for iteration
 skipfooter : int, default 0
-    Number of line at bottom of file to skip
+    Number of lines at bottom of file to skip (Unsupported with C parser)


jorisvandenbossche · 2014-04-23T20:24:16Z

+1 to the very clear and informative error/warning messages (and the clear summary of the changes at the top). @mcwitt Thanks a lot for the effort!

jreback · 2014-04-23T21:38:36Z

looks good on docs and such

@mcwitt ping when green (I know you are fighting with some tests!)

-raise ValueError when engine='c' specified with unsupported options -raise ValueError when fallback to python causes options to be ignored -produce ParserWarning on fallback to python when no options ignored -fix bug in C parser with leading whitespace and \r-delimited files (add test) -translate sep='\s+' to delim_whitespace=True (add test) -raise ValueError if the user specifies both `sep` and `delim_whitespace=True` -specify engine='python' in tests with sep='\s+' and multiindex column input (work around GH 6893) -add 'engine' option to docstring of read_csv and read_table -copy tests that previously fell back to python from ParserTests to TestPythonParser and check that they raise ValueErrors when run under other engines

mcwitt · 2014-04-23T22:09:01Z

@jreback green!

BUG/ENH: Add fallback warnings and correctly handle leading whitespace in C parser

jreback · 2014-04-23T22:23:44Z

@mcwitt thanks!

this is excellent!

pls go for it with other issues!

mcwitt mentioned this pull request Apr 16, 2014

read_table fails with MultiIndex input and delim_whitespace=True #6893

Closed

jreback added Bug labels Apr 21, 2014

jreback added this to the 0.14.0 milestone Apr 21, 2014

jorisvandenbossche reviewed Apr 22, 2014
View reviewed changes

jreback reviewed Apr 23, 2014
View reviewed changes

jreback added a commit that referenced this pull request Apr 23, 2014

Merge pull request #6889 from mcwitt/fix-gh6607

7168d98

BUG/ENH: Add fallback warnings and correctly handle leading whitespace in C parser

jreback merged commit 7168d98 into pandas-dev:master Apr 23, 2014

mcwitt deleted the fix-gh6607 branch April 23, 2014 22:28

mcwitt mentioned this pull request Apr 26, 2014

BUG: Skipfooter disables decimal parameter #6971

Closed

mcwitt mentioned this pull request May 6, 2014

BUG: fix reading multi-index data in python parser #7029

Merged

jreback mentioned this pull request May 16, 2014

read_fwf does not support dtype argument #7141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/ENH: Add fallback warnings and correctly handle leading whitespace in C parser #6889

BUG/ENH: Add fallback warnings and correctly handle leading whitespace in C parser #6889

mcwitt commented Apr 16, 2014

jreback commented Apr 16, 2014

mcwitt commented Apr 16, 2014

jreback commented Apr 16, 2014

mcwitt commented Apr 16, 2014

jreback commented Apr 16, 2014

mcwitt commented Apr 16, 2014

mcwitt commented Apr 16, 2014

jreback commented Apr 16, 2014

jreback commented Apr 21, 2014

mcwitt commented Apr 21, 2014

jreback commented Apr 21, 2014

mcwitt commented Apr 22, 2014

jorisvandenbossche Apr 22, 2014

jorisvandenbossche commented Apr 22, 2014

mcwitt commented Apr 22, 2014

jreback commented Apr 22, 2014

mcwitt commented Apr 22, 2014

jreback commented Apr 22, 2014

mcwitt commented Apr 22, 2014

jreback commented Apr 22, 2014

jreback commented Apr 22, 2014

mcwitt commented Apr 22, 2014

jreback commented Apr 22, 2014

mcwitt commented Apr 22, 2014

jreback commented Apr 22, 2014

mcwitt commented Apr 22, 2014

jreback commented Apr 22, 2014

jorisvandenbossche commented Apr 23, 2014

mcwitt commented Apr 23, 2014

jreback Apr 23, 2014

jorisvandenbossche commented Apr 23, 2014

jreback commented Apr 23, 2014

mcwitt commented Apr 23, 2014

jreback commented Apr 23, 2014

BUG/ENH: Add fallback warnings and correctly handle leading whitespace in C parser #6889

BUG/ENH: Add fallback warnings and correctly handle leading whitespace in C parser #6889

Conversation

mcwitt commented Apr 16, 2014

Summary of changes

jreback commented Apr 16, 2014

mcwitt commented Apr 16, 2014

jreback commented Apr 16, 2014

mcwitt commented Apr 16, 2014

jreback commented Apr 16, 2014

mcwitt commented Apr 16, 2014

mcwitt commented Apr 16, 2014

jreback commented Apr 16, 2014

jreback commented Apr 21, 2014

mcwitt commented Apr 21, 2014

jreback commented Apr 21, 2014

mcwitt commented Apr 22, 2014

jorisvandenbossche Apr 22, 2014

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 22, 2014

mcwitt commented Apr 22, 2014

jreback commented Apr 22, 2014

mcwitt commented Apr 22, 2014

jreback commented Apr 22, 2014

mcwitt commented Apr 22, 2014

jreback commented Apr 22, 2014

jreback commented Apr 22, 2014

mcwitt commented Apr 22, 2014

jreback commented Apr 22, 2014

mcwitt commented Apr 22, 2014

jreback commented Apr 22, 2014

mcwitt commented Apr 22, 2014

jreback commented Apr 22, 2014

jorisvandenbossche commented Apr 23, 2014

mcwitt commented Apr 23, 2014

jreback Apr 23, 2014

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 23, 2014

jreback commented Apr 23, 2014

mcwitt commented Apr 23, 2014

jreback commented Apr 23, 2014