-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_table() ignores dtype argument when multi-character separator is specified #6607
Comments
this might be a dup of #4363 would you like to take a stab at fixing it? |
Sure, I would like to take a look at it. |
https://github.com/pydata/pandas/wiki lots of tests in io/tests/test_parser.py that u can model like this might affect python and/or c parsers lmk if u need help |
OK, thanks for the info. I'll get started looking at the wiki. On Tue, Mar 11, 2014 at 5:08 PM, jreback [email protected] wrote:
|
I guess I'm running into the same thing
|
It appears that the C parser isn't able to handle regular expressions (yet?) and when a multi-character separator is passed if sep is None and not delim_whitespace:
if engine == 'c':
engine = 'python'
elif sep is not None and len(sep) > 1:
# wait until regex engine integrated
if engine not in ('python', 'python-fwf'):
engine = 'python' But the Python engine doesn't understand
In my case, where fields are delimited by whitespace, I can replace I'm thinking it would be polite to warn the user when falling back to the Python parser causes options to be ignored, but maybe this isn't worth the trouble if the C engine will be able to handle regexps in the future. |
no the PR is to fix this in the python parser. c-parser cannot handle a regex. Not easy to fix, so no time soon on this. |
OK, agreed. On Mon, Mar 17, 2014 at 5:14 AM, jreback [email protected] wrote:
|
I've made some progress on this. As mentioned above, the C engine is silently failing over to python in the case of a regexp separator (and several other cases). Consequently options specific to the C parser (
(I'm also thinking it might be nice to alias So far I've made some progress on (2), but am having second thoughts about whether this is the best way to go. Would appreciate any advice. |
good analysis, I think doing a combination would be worthwhile
great job on finding the issues! |
OK, I've implemented all of this except instead of translating
(I had added the translation but this was producing errors in several tests in parser_tests.py (e.g. I have also added |
I think you can do the translation, and just check that the c-engine output is == to the python-engine output (you will have to do this in a separate test class probably). The other test classes uses I think this would help with your problme about triggering warnings. Essentially have it run only once, and have the tests explicity pass an engine (as you are explicity testing certain engine behavior) |
OK, maybe There are several tests in The test |
go ahead and create TestCompareParsers if u can take tests that currently just fallback (eg the skip footer one), put a copy of the test that falls back and assert for the parser warning - so when the issue is eventually fixed that test will no longer produce a warning and will then fail (alerting the person changing the code) you can put a big comment around the test as well. what I find is that I insert code to fix an issue rather than trying to find if their is an existing test for the issue (which there usually is not ) - so this will provide essentially a nice alert |
so bottom line is that a test that currently falls back on a failed option should be moved to python parser |
@mcwitt how's this coming? |
Sorry for the delay. The warnings and translation of On Wed, Apr 9, 2014 at 1:03 PM, jreback [email protected] wrote:
|
great! this would definintly be a welcome addition! |
OK, all of the tests are passing except for several that cause an issue with the C parser:
The failures seem to be caused by the C parser adding an extra column when lines begin with spaces. For example In [3]: data = ' a b c\n 1 2 3\n 4 5 6'
In [4]: pd.read_table(StringIO(data), engine='c', delim_whitespace=True)
Out[4]:
Unnamed: 0 a b c
0 NaN 1 2 3
1 NaN 4 5 6
[2 rows x 4 columns] Previously these tests fell back to the python parser (due to In [5]: pd.read_table(StringIO(data), engine='python', sep='\s+')
Out[5]:
a b c
0 1 2 3
1 4 5 6
[2 rows x 3 columns] |
hmm. seem that the c parser is wrong there (though of course their maybe some that works around this wrongness!). it makes more sense to do what the python parser is doing. are the tests actually different? or is it the c-parser? |
Looks like #3374 is relevant.
Not sure what you mean... The tests expect the behavior of the python parser because previously they fell back to python due to |
yep...looks like #3374 should be closed by you as well. ok... can you fix the c-parser then? (I meant that sometimes you know that something is broken so you write a test that checks the broken behavior), sort of wrong but you need a test, so it happens. |
and definitiely need to raise of BOTH |
I'll look at this.
Agreed. I think a |
yep |
I realize this looks like a mess... unfortunately I don't know how it can be made any cleaner. Copying tests that fall thru to python from Maybe it would be best to remove these tests from ParserTests for now, leaving just a comment? I have added a dedicated test that a |
related #4363
closes #3374
Here is a minimal example:
Here the dtype argument behaves as expected, and column A has type float. However with sep='\s' the dtype argument appears to be ignored:
Version information
The text was updated successfully, but these errors were encountered: