-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support malformed row handling in Python engine #15925
ENH: Support malformed row handling in Python engine #15925
Conversation
e0d5d4e
to
6ab5602
Compare
Codecov Report
@@ Coverage Diff @@
## master #15925 +/- ##
==========================================
+ Coverage 90.96% 90.97% +<.01%
==========================================
Files 145 145
Lines 49557 49576 +19
==========================================
+ Hits 45081 45100 +19
Misses 4476 4476
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #15925 +/- ##
==========================================
- Coverage 90.99% 90.99% -0.01%
==========================================
Files 145 145
Lines 49520 49540 +20
==========================================
+ Hits 45061 45077 +16
- Misses 4459 4463 +4
Continue to review full report at Codecov.
|
@@ -2657,42 +2684,57 @@ def _get_index_name(self, columns): | |||
return index_name, orig_names, columns | |||
|
|||
def _rows_to_cols(self, content): | |||
if self.skipfooter < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also is this validated to be an integer? (and tested)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Negative numbers tested, yes.
- Verified as an integer, no. Can do in a follow-up (refactored)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
self.read_csv(StringIO(data), error_bad_lines=True) | ||
|
||
stderr = sys.stderr | ||
expected = DataFrame({'a': [1, 4]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pytest has some facilities to make this a bit easier (the capturing of stderr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually took this setup from elsewhere in the code. You are correct that is a pytest
facility of doing this (see here). However, if I'm going to make that change, I'd rather apply it to all places where we do this stderr = sys.stderr
.
Could this also be in a follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also add a context manager in utils/testing.py for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's possible, though not sure how if that would integrate well with pytest
's fixture setup that I mentioned earlier.
can you run asv on the parsers to verify no regressions? not 100% sure we actually have one for python engine though...... |
@jreback : No noticeable regressions AFAICT. We actually do have Python engine benchmarks FYI in the |
if ret: | ||
line = ret[0] | ||
break | ||
elif self._empty(orig_line) or line: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the difference between_empty
and _check_empty
? (I would prefer just the _check_empty
as more meanigful name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are. Admittedly, the naming is not clear. _check_empty
doesn't just check for empty lines (_empty
does though). It also removes them. A renaming + documentation would be good as a (third!) follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, pls do so (in followup)
pandas/tests/io/parser/common.py
Outdated
@@ -1695,3 +1695,40 @@ class InvalidBuffer(object): | |||
|
|||
with tm.assertRaisesRegexp(ValueError, msg): | |||
self.read_csv(mock.Mock()) | |||
|
|||
def test_skip_bad_lines(self): | |||
data = 'a\n1\n1,2,3\n4\n5,6,7' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done.
6ab5602
to
9973a0a
Compare
9973a0a
to
e3a8cca
Compare
@jreback : Everything is green and ready to go. |
thanks! |
Support
warn_bad_lines
anderror_bad_lines
for the Python engine.xref #12686 (master tracker)
Inspired by #15910 (comment)
In addition, the Python parser now raises
pandas.error.ParserError
, which is in line with what the C engine would do.