Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support malformed row handling in Python engine #15925

Merged
merged 1 commit into from
Apr 7, 2017

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Apr 6, 2017

Support warn_bad_lines and error_bad_lines for the Python engine.

xref #12686 (master tracker)

Inspired by #15910 (comment)

In addition, the Python parser now raises pandas.error.ParserError, which is in line with what the C engine would do.

@gfyoung gfyoung force-pushed the malformed-lines-python branch 4 times, most recently from e0d5d4e to 6ab5602 Compare April 6, 2017 20:05
@codecov
Copy link

codecov bot commented Apr 6, 2017

Codecov Report

Merging #15925 into master will increase coverage by <.01%.
The diff coverage is 96.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15925      +/-   ##
==========================================
+ Coverage   90.96%   90.97%   +<.01%     
==========================================
  Files         145      145              
  Lines       49557    49576      +19     
==========================================
+ Hits        45081    45100      +19     
  Misses       4476     4476
Flag Coverage Δ
#multiple 88.73% <96.42%> (ø) ⬆️
#single 40.6% <3.57%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.64% <96.42%> (-0.02%) ⬇️
pandas/core/common.py 91.03% <0%> (+0.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4502e82...6ab5602. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 6, 2017

Codecov Report

Merging #15925 into master will decrease coverage by <.01%.
The diff coverage is 96.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15925      +/-   ##
==========================================
- Coverage   90.99%   90.99%   -0.01%     
==========================================
  Files         145      145              
  Lines       49520    49540      +20     
==========================================
+ Hits        45061    45077      +16     
- Misses       4459     4463       +4
Flag Coverage Δ
#multiple 88.75% <96.42%> (-0.01%) ⬇️
#single 40.6% <3.57%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.64% <96.42%> (-0.02%) ⬇️
pandas/core/common.py 90.68% <0%> (-0.35%) ⬇️
pandas/util/testing.py 80.66% <0%> (-0.19%) ⬇️
pandas/io/pytables.py 93.06% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f478e4f...e3a8cca. Read the comment docs.

@@ -2657,42 +2684,57 @@ def _get_index_name(self, columns):
return index_name, orig_names, columns

def _rows_to_cols(self, content):
if self.skipfooter < 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also is this validated to be an integer? (and tested)?

Copy link
Member Author

@gfyoung gfyoung Apr 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Negative numbers tested, yes.
  2. Verified as an integer, no. Can do in a follow-up (refactored)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

self.read_csv(StringIO(data), error_bad_lines=True)

stderr = sys.stderr
expected = DataFrame({'a': [1, 4]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pytest has some facilities to make this a bit easier (the capturing of stderr)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually took this setup from elsewhere in the code. You are correct that is a pytest facility of doing this (see here). However, if I'm going to make that change, I'd rather apply it to all places where we do this stderr = sys.stderr.

Could this also be in a follow-up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also add a context manager in utils/testing.py for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's possible, though not sure how if that would integrate well with pytest 's fixture setup that I mentioned earlier.

@jreback jreback added IO CSV read_csv, to_csv Enhancement Error Reporting Incorrect or improved errors from pandas labels Apr 6, 2017
@jreback
Copy link
Contributor

jreback commented Apr 6, 2017

can you run asv on the parsers to verify no regressions? not 100% sure we actually have one for python engine though......

@gfyoung
Copy link
Member Author

gfyoung commented Apr 7, 2017

@jreback : No noticeable regressions AFAICT. We actually do have Python engine benchmarks FYI in the io_bench.py file under asv_bench.

if ret:
line = ret[0]
break
elif self._empty(orig_line) or line:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between_empty and _check_empty? (I would prefer just the _check_empty as more meanigful name)

Copy link
Member Author

@gfyoung gfyoung Apr 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are. Admittedly, the naming is not clear. _check_empty doesn't just check for empty lines (_empty does though). It also removes them. A renaming + documentation would be good as a (third!) follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, pls do so (in followup)

@@ -1695,3 +1695,40 @@ class InvalidBuffer(object):

with tm.assertRaisesRegexp(ValueError, msg):
self.read_csv(mock.Mock())

def test_skip_bad_lines(self):
data = 'a\n1\n1,2,3\n4\n5,6,7'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

@gfyoung gfyoung force-pushed the malformed-lines-python branch from 6ab5602 to 9973a0a Compare April 7, 2017 14:20
@gfyoung gfyoung force-pushed the malformed-lines-python branch from 9973a0a to e3a8cca Compare April 7, 2017 15:20
@gfyoung
Copy link
Member Author

gfyoung commented Apr 7, 2017

@jreback : Everything is green and ready to go.

@jreback jreback added this to the 0.20.0 milestone Apr 7, 2017
@jreback jreback merged commit 5d17a94 into pandas-dev:master Apr 7, 2017
@jreback
Copy link
Contributor

jreback commented Apr 7, 2017

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants