ENH: Support malformed row handling in Python engine #15925

gfyoung · 2017-04-06T18:28:34Z

Support warn_bad_lines and error_bad_lines for the Python engine.

xref #12686 (master tracker)

In addition, the Python parser now raises pandas.error.ParserError, which is in line with what the C engine would do.

codecov · 2017-04-06T20:05:51Z

Codecov Report

Merging #15925 into master will increase coverage by <.01%.
The diff coverage is 96.42%.

@@            Coverage Diff             @@
##           master   #15925      +/-   ##
==========================================
+ Coverage   90.96%   90.97%   +<.01%     
==========================================
  Files         145      145              
  Lines       49557    49576      +19     
==========================================
+ Hits        45081    45100      +19     
  Misses       4476     4476

Flag	Coverage Δ
#multiple	`88.73% <96.42%> (ø)`	⬆️
#single	`40.6% <3.57%> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.64% <96.42%> (-0.02%)`	⬇️
pandas/core/common.py	`91.03% <0%> (+0.34%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4502e82...6ab5602. Read the comment docs.

codecov · 2017-04-06T20:05:58Z

Codecov Report

Merging #15925 into master will decrease coverage by <.01%.
The diff coverage is 96.42%.

@@            Coverage Diff             @@
##           master   #15925      +/-   ##
==========================================
- Coverage   90.99%   90.99%   -0.01%     
==========================================
  Files         145      145              
  Lines       49520    49540      +20     
==========================================
+ Hits        45061    45077      +16     
- Misses       4459     4463       +4

Flag	Coverage Δ
#multiple	`88.75% <96.42%> (-0.01%)`	⬇️
#single	`40.6% <3.57%> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.64% <96.42%> (-0.02%)`	⬇️
pandas/core/common.py	`90.68% <0%> (-0.35%)`	⬇️
pandas/util/testing.py	`80.66% <0%> (-0.19%)`	⬇️
pandas/io/pytables.py	`93.06% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f478e4f...e3a8cca. Read the comment docs.

jreback · 2017-04-06T22:37:37Z

pandas/io/parsers.py

@@ -2657,42 +2684,57 @@ def _get_index_name(self, columns):
        return index_name, orig_names, columns

    def _rows_to_cols(self, content):
+        if self.skipfooter < 0:


also is this validated to be an integer? (and tested)?

Negative numbers tested, yes.

Verified as an integer, no. Can do in a follow-up (refactored)

jreback · 2017-04-06T22:40:45Z

pandas/tests/io/parser/common.py

+            self.read_csv(StringIO(data), error_bad_lines=True)
+
+        stderr = sys.stderr
+        expected = DataFrame({'a': [1, 4]})


I think pytest has some facilities to make this a bit easier (the capturing of stderr)

I actually took this setup from elsewhere in the code. You are correct that is a pytest facility of doing this (see here). However, if I'm going to make that change, I'd rather apply it to all places where we do this stderr = sys.stderr.

Could this also be in a follow-up?

could also add a context manager in utils/testing.py for this

That's possible, though not sure how if that would integrate well with pytest 's fixture setup that I mentioned earlier.

jreback · 2017-04-06T22:42:06Z

can you run asv on the parsers to verify no regressions? not 100% sure we actually have one for python engine though......

gfyoung · 2017-04-07T03:31:35Z

@jreback : No noticeable regressions AFAICT. We actually do have Python engine benchmarks FYI in the io_bench.py file under asv_bench.

jreback · 2017-04-07T12:23:31Z

pandas/io/parsers.py

+                        if ret:
+                            line = ret[0]
+                            break
+                    elif self._empty(orig_line) or line:


what is the difference between_empty and _check_empty? (I would prefer just the _check_empty as more meanigful name)

Yes, there are. Admittedly, the naming is not clear. _check_empty doesn't just check for empty lines (_empty does though). It also removes them. A renaming + documentation would be good as a (third!) follow-up.

ok, pls do so (in followup)

jreback · 2017-04-07T12:24:19Z

pandas/tests/io/parser/common.py

@@ -1695,3 +1695,40 @@ class InvalidBuffer(object):

            with tm.assertRaisesRegexp(ValueError, msg):
                self.read_csv(mock.Mock())
+
+    def test_skip_bad_lines(self):
+        data = 'a\n1\n1,2,3\n4\n5,6,7'


add the issue number

gfyoung · 2017-04-07T17:49:26Z

@jreback : Everything is green and ready to go.

jreback · 2017-04-07T19:47:35Z

thanks!

gfyoung force-pushed the malformed-lines-python branch 4 times, most recently from e0d5d4e to 6ab5602 Compare April 6, 2017 20:05

jreback reviewed Apr 6, 2017

View reviewed changes

kawochen mentioned this pull request Apr 6, 2017

ENH/DOC/CLN: Document arguments and reconcile C and Python engines for read_csv #12686

Open

22 tasks

jreback added IO CSV read_csv, to_csv Enhancement Error Reporting Incorrect or improved errors from pandas labels Apr 6, 2017

jreback reviewed Apr 7, 2017

View reviewed changes

gfyoung force-pushed the malformed-lines-python branch from 6ab5602 to 9973a0a Compare April 7, 2017 14:20

ENH: Support malformed row handling in Python engine

e3a8cca

gfyoung force-pushed the malformed-lines-python branch from 9973a0a to e3a8cca Compare April 7, 2017 15:20

jreback added this to the 0.20.0 milestone Apr 7, 2017

jreback merged commit 5d17a94 into pandas-dev:master Apr 7, 2017

gfyoung deleted the malformed-lines-python branch April 7, 2017 19:58

This was referenced Apr 7, 2017

BUG: Validate the skipfooter parameter in read_csv #15945

Merged

MAINT: Refactor Python engine empty line funcs #15946

Merged

TST: Add test decorators for redirecting stdout and stderr #15952

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Support malformed row handling in Python engine #15925

ENH: Support malformed row handling in Python engine #15925

gfyoung commented Apr 6, 2017 •

edited

Loading

codecov bot commented Apr 6, 2017

codecov bot commented Apr 6, 2017 •

edited

Loading

jreback Apr 6, 2017

jreback Apr 6, 2017

gfyoung Apr 7, 2017 •

edited

Loading

jreback Apr 7, 2017

jreback Apr 6, 2017

gfyoung Apr 7, 2017

jreback Apr 7, 2017

jreback Apr 7, 2017

gfyoung Apr 7, 2017

jreback commented Apr 6, 2017

gfyoung commented Apr 7, 2017

jreback Apr 7, 2017

gfyoung Apr 7, 2017 •

edited

Loading

jreback Apr 7, 2017

jreback Apr 7, 2017

gfyoung Apr 7, 2017

gfyoung commented Apr 7, 2017

jreback commented Apr 7, 2017

ENH: Support malformed row handling in Python engine #15925

ENH: Support malformed row handling in Python engine #15925

Conversation

gfyoung commented Apr 6, 2017 • edited Loading

codecov bot commented Apr 6, 2017

Codecov Report

codecov bot commented Apr 6, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Apr 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 6, 2017

gfyoung commented Apr 7, 2017

Choose a reason for hiding this comment

gfyoung Apr 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Apr 7, 2017

jreback commented Apr 7, 2017

gfyoung commented Apr 6, 2017 •

edited

Loading

codecov bot commented Apr 6, 2017 •

edited

Loading

gfyoung Apr 7, 2017 •

edited

Loading

gfyoung Apr 7, 2017 •

edited

Loading