Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_csv skipfooter fails with invalid quoted line #15910

Closed
chris-b1 opened this issue Apr 5, 2017 · 13 comments
Closed

BUG: read_csv skipfooter fails with invalid quoted line #15910

chris-b1 opened this issue Apr 5, 2017 · 13 comments
Labels
Bug Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Milestone

Comments

@chris-b1
Copy link
Contributor

chris-b1 commented Apr 5, 2017

Code Sample, a copy-pastable example if possible

from pandas.compat import StringIO

pd.read_csv(StringIO('''Date,Value
1/1/2012,100.00
1/2/2012,102.00
"a quoted junk row"morejunk'''),  skipfooter=1)

Out[21]
ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 20))

---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
<ipython-input-34-d8dff6b9f4a7> in <module>()
      2 1/1/2012,100.00
      3 1/2/2012,102.00
----> 4 "a quoted junk row" '''),  skipfooter=1)

C:\Users\chris.bartak\Documents\python-dev\pandas\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    651                     skip_blank_lines=skip_blank_lines)
    652 
--> 653         return _read(filepath_or_buffer, kwds)
    654 
    655     parser_f.__name__ = name

C:\Users\chris.bartak\Documents\python-dev\pandas\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    404 
    405     try:
--> 406         data = parser.read()
    407     finally:
    408         parser.close()

C:\Users\chris.bartak\Documents\python-dev\pandas\pandas\io\parsers.py in read(self, nrows)
    977                 raise ValueError('skipfooter not supported for iteration')
    978 
--> 979         ret = self._engine.read(nrows)
    980 
    981         if self.options.get('as_recarray'):

C:\Users\chris.bartak\Documents\python-dev\pandas\pandas\io\parsers.py in read(self, rows)
   2066     def read(self, rows=None):
   2067         try:
-> 2068             content = self._get_lines(rows)
   2069         except StopIteration:
   2070             if self._first_chunk:

C:\Users\chris.bartak\Documents\python-dev\pandas\pandas\io\parsers.py in _get_lines(self, rows)
   2717                         while True:
   2718                             try:
-> 2719                                 new_rows.append(next(source))
   2720                                 rows += 1
   2721                             except csv.Error as inst:

Error: ',' expected after '"'

Problem description

This error only happens if the last row has quoting, and is invalid - e.g. delete the morejunk above and it does not error.

Expected Output

successful parse

pandas 0.19.2

@chris-b1 chris-b1 added Bug IO CSV read_csv, to_csv labels Apr 5, 2017
@chris-b1 chris-b1 added this to the Next Major Release milestone Apr 5, 2017
@chris-b1
Copy link
Contributor Author

chris-b1 commented Apr 5, 2017

Hmm, I guess this is the same as #13879 - although the PR to improve the error message doesn't seem to have caught this case cc @gfyoung

@chris-b1 chris-b1 added the Error Reporting Incorrect or improved errors from pandas label Apr 5, 2017
@gfyoung
Copy link
Member

gfyoung commented Apr 5, 2017

@chris-b1 : Could you post the full the stacktrace? I presume that that error message is coming from Python's csv library but would like to double check (no access to computer ATM).

@chris-b1
Copy link
Contributor Author

chris-b1 commented Apr 5, 2017

yep, edited in the top comment

@gfyoung
Copy link
Member

gfyoung commented Apr 5, 2017

Awesome. Yep, I think your diagnosis is correct. I can quickly patch that.

@gfyoung
Copy link
Member

gfyoung commented Apr 5, 2017

Deeper analysis indicates that you can successfully parse this with the C engine on master:

pd.read_csv(StringIO('''Date,Value
1/1/2012,100.00
1/2/2012,102.00
"a quoted junk row"morejunk''')

                        Date  Value
0                   1/1/2012  100.0
1                   1/2/2012  102.0
2  a quoted junk rowmorejunk    NaN

However, the Python cannot read this correctly (with or without the skipfooter argument). I'm not sure why the Python engine would complain about this. This parsing seems correct from the C engine.

@chris-b1 : What do you think?

@gfyoung
Copy link
Member

gfyoung commented Apr 5, 2017

Here's a simpler example that we can use:

>>> data = 'a\n1\n"a"b'
>>> read_csv(StringIO(data), engine='c')
    a
0   1
1  ab
>>>
>>> read_csv(StringIO(data), engine='python')
...
_csv.Error: ',' expected after '"'
>>>
>>> read_csv(StringIO(data), engine='python', skipfooter=1)
...
_csv.Error: ',' expected after '"'

@gfyoung
Copy link
Member

gfyoung commented Apr 5, 2017

This inconsistency notwithstanding, it would still be worthwhile to properly catch errors there at that try-except block. A PR can go up for that at the very least.

@chris-b1
Copy link
Contributor Author

chris-b1 commented Apr 5, 2017

Yeah, it does seem like that should parse. builtin csv reader doesn't complain

import csv
data = 'a\n1\n"a"b'
list(csv.reader(StringIO(data)))

Out[16]: [['a'], ['1'], ['ab']]

@gfyoung
Copy link
Member

gfyoung commented Apr 5, 2017

Oh, interesting...does your original example work with csv.reader(StringIO(...)) ? Maybe try passing in strict=True to csv.reader as well?

@chris-b1
Copy link
Contributor Author

chris-b1 commented Apr 5, 2017

It does using defaults, but not with strict=True

@gfyoung
Copy link
Member

gfyoung commented Apr 5, 2017

Ah, that's the reason then. Hmmm...seems like we wouldn't consider that malformed though. Well, as we can't "fix" the Python parser, I think we can add the test at least though.

@gfyoung
Copy link
Member

gfyoung commented Apr 6, 2017

Actually, here's a "fix" (it just goes to show how broken regex splitting in the Python engine is):

>>> data = 'a\n1\n"a"b'
>>> read_csv(StringIO(data), engine='python', sep='pandas')
    a
0   1
1  ab

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 6, 2017
@jreback jreback modified the milestones: 0.20.0, Next Major Release Apr 6, 2017
@shivampatel16
Copy link

Here's a simpler example that we can use:

>>> data = 'a\n1\n"a"b'
>>> read_csv(StringIO(data), engine='c')
    a
0   1
1  ab
>>>
>>> read_csv(StringIO(data), engine='python')
...
_csv.Error: ',' expected after '"'
>>>
>>> read_csv(StringIO(data), engine='python', skipfooter=1)
...
_csv.Error: ',' expected after '"'

**_

engine='c' does the job for me. Finally got my task working after a huge but simple hurdle.

Thank you!

_**

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

4 participants