-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/ENH: Add fallback warnings and correctly handle leading whitespace in C parser #6889
Conversation
@mcwitt can u squash a but if possible) will have to review this but looks good so far |
I'm actually still a bit nervous about translating In [4]: text = """ A B C D E
one two three four
a b 10.0032 5 -0.5109 -2.3358 -0.4645 0.05076 0.3640
a q 20 4 0.4473 1.4152 0.2834 1.00661 0.1744
x q 30 3 -0.6662 -0.5243 -0.3580 0.89145 2.5838"""
In [5]: pd.read_table(StringIO(text), delim_whitespace=True)
---------------------------------------------------------------------------
CParserError Traceback (most recent call last)
. . .
CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 9 while the python engine at least gets it partially right: In [6]: pd.read_table(StringIO(text), sep='\s+')
Out[6]:
E
one two three four
a b 10.0032 5 0.3640
q 20.0000 4 0.1744
x q 30.0000 3 2.5838
[3 rows x 1 columns] So the translation could potentially break some user code that reads a multi-indexed data with |
almost all of the column MultiIndex handling is handled after the header is parsed (in either engine) and code is the same for both engines so this has to be in the header parsing code itself (which isn't that long) hmm no tests fails for this? |
Interesting. This hadn't previously been covered for the C parser because the relevant tests just fell back to python due to |
hmm their are a number of test for mi columns and they do specifically set the engine so their should be no fallback maybe the changes did something? |
I don't think so, the example above is using master. |
Maybe this could be fixed in a separate PR? This one is getting a bit heavy as it is... |
not a problem pls open a new issue (and ref this pr) then easy to fix later |
@mcwitt looks good...can you do a sample session in ipython and show the changes (e.g. warnings produced, and exceptions and such), and put it in this PR description at the top (below what you have now) any docs need to be updated? E.g. maybe put a sample ParserWarning explanation in io.rst? pls add a release note(s) as appropriate in doc/source/release.rst (these are all bug fixes? ParserWarning I guess is an API change). |
@jreback OK, done. Thanks for guiding me through the PR process! I'm looking forward to helping out where I can in the future. I added a section to |
docs look good! pls rebase on master and push one more time (it should say can be automatically can be merged) ping me on green @jorisvandenbossche any comments? |
@jreback all systems go! |
back to python if C-unsupported options are specified. Currently, C-unsupported | ||
options include: | ||
|
||
- ``sep`` other than a single character (e.g. regex separators) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list does not need indentation, the -
should just start at the beginning of the line
Looks like a very solid PR! (just added two very minor comments) But one more general comment: Is the warning needed if it is only a fallback (and nothing is silently ignored)? In the sense that: does a basic user need to know there is a python and c engine? |
This definitely makes sense. I guess that before we added the So maybe when a C-unsupported option is detected we can scan for python-unsupported options (and vice-versa) and only if one is found raise a warning/error? |
hmm.... I think now that you are checking, I would RAISE on an unsupported option being passed when the engine is explicity given (prob rare for a user), and warn that you are falling back c->python if no engine is explicity given and the option is unsupported (prob just |
Shouldn't we raise when there's a fallback and the engine is given explicitly, even if there are no options that aren't supported by python? Or maybe just a warning in this case? |
yes; i presume that when the engine paramater is passed the user wants NO fallback (as that would be too much magic), so if an option is illegal for that engine it SHOULD raise. hmm...so when do we warn? when falling back? but then that would mean that a passed option is ignored by the c-engine? is their anything that it doesn't do? (aside from sep, which is now fixed). |
There are two other cases that the C engine doesn't handle: the Here's what I'm doing currently if one of these options is encountered: If |
|
pls rebase |
@jreback OK, this is done and I'm working on getting the tests to pass again. Before we were leaving the tests that fall back in |
you should |
@jreback The trouble is there are tests e.g. |
they look like valid c-engine options to me; why would they hit what you are changing? can you post an example test? |
|
oh....so that's trying to fallback then? hmm. then just change the test to pick up the ValueError when its a c-engine? |
BTW, something else. The |
@jorisvandenbossche I added a brief description of @jreback OK, the tests should be passing and I've updated the summary of changes at the top. |
@@ -113,7 +117,7 @@ | |||
chunksize : int, default None | |||
Return TextFileReader object for iteration | |||
skipfooter : int, default 0 | |||
Number of line at bottom of file to skip | |||
Number of lines at bottom of file to skip (Unsupported with C parser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
+1 to the very clear and informative error/warning messages (and the clear summary of the changes at the top). @mcwitt Thanks a lot for the effort! |
looks good on docs and such @mcwitt ping when green (I know you are fighting with some tests!) |
-raise ValueError when engine='c' specified with unsupported options -raise ValueError when fallback to python causes options to be ignored -produce ParserWarning on fallback to python when no options ignored -fix bug in C parser with leading whitespace and \r-delimited files (add test) -translate sep='\s+' to delim_whitespace=True (add test) -raise ValueError if the user specifies both `sep` and `delim_whitespace=True` -specify engine='python' in tests with sep='\s+' and multiindex column input (work around GH 6893) -add 'engine' option to docstring of read_csv and read_table -copy tests that previously fell back to python from ParserTests to TestPythonParser and check that they raise ValueErrors when run under other engines
@jreback green! |
BUG/ENH: Add fallback warnings and correctly handle leading whitespace in C parser
@mcwitt thanks! this is excellent! pls go for it with other issues! |
closes #6607
closes #3374
Currently, specifying options that are incompatible with the C parser in
read_csv
andread_table
causes a silent fallback to the python engine. This can be confusing if the user has also passed options that are only supported by the C engine, which are then silently ignored. (See #6607)For example, the commonly used option
sep='\s+'
causes a fallback to python which could be avoided by automatically translating this to the equivalentdelim_whitespace=True
, which is supported by the C engine.There are some issues with the C parser that need to be fixed in order not to break tests with
sep='\s+'
which previously fell back to python:The C parser does not correctly handle leading whitespace with
delim_whitespace=True
(#3374).There is a related bug when parsing files with \r-delimited lines and missing values:
Summary of changes
ValueError
when user specifiesengine='c'
with C-unsupported options:ValueError
when fallback to python parser causes python-unsupported options to be ignored:sep
anddelim_whitespace=True
:sep='\s+'
todelim_whitespace=True
when there are no other C-unsupported options:(Old behavior shown above)
ParserTests
that fall back to python toTestPythonParser
; leave copies of these tests inParserTests
with the assertion that they raise aValueError
when run under other enginesengine
option to docstrings ofread_table
andread_csv