read_csv character encoding bug? #2741

hayd · 2013-01-23T21:45:55Z

This is a weird one from StackOverflow, this file has some \x00s which seem to be ignored when printing but confuse read_csv:

x = 'x,y\n \x00\x00\x00,Reg\n \x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
X = StringIO(x)

In [3]: pd.read_csv(X)
Out[3]: 
     x    y
0          
1  NaN  NaN
2    I  Swp
3    I  Swp

In [4]: print x
x,y
 ,Reg
 ,Reg
I,Swp
I,Swp

The text was updated successfully, but these errors were encountered:

wesm · 2013-01-23T22:26:37Z

Yes. The tokenizer uses null terminators in a couple of places as a marker, I'll have to look to see exactly why this is failing.

wesm · 2013-03-29T04:28:13Z

In complete fairness the csv module doesn't handle NULL bytes:

In [6]: import csv; f = csv.reader(StringIO(x))

In [7]: next(f)
Out[7]: ['x', 'y']

In [8]: next(f)
---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
<ipython-input-8-468f0afdf1b9> in <module>()
----> 1 next(f)

Error: line contains NULL byte

Pushing this issue to some other day (not 0.11)

msiler · 2016-01-14T16:13:22Z

I know this is an old issue, but I'd like to give it a little bump. I have customers giving us a dump of some of their production database tables. The files we are getting are tab delimited and values that were null in the database are null bytes in the text file. I don't know how to get pandas to read this without having to do some manual munging first.

wesm · 2016-01-14T21:27:25Z

I'm sorry this never got fixed! I have also seen null bytes used to code nulls. @jreback @jorisvandenbossche let's slate this for 0.18?

wesm · 2016-01-14T21:42:18Z

I'd like to make some changes to the CSV tokenizer to hopefully improve its performance, that will also allow us to get rid of the null terminators that are complicating issues like this, but it would make most sense to understand for 1.0 / libpandas. Question is whether there is a quick fix for this particular issue with the existing tokenization strategy

gfyoung · 2016-07-31T17:45:03Z

I don't believe this is an issue with the C engine anymore:

>>> from pandas.compat import StringIO
>>> from pandas import read_csv
>>> data = 'x,y\n\x00\x00\x00,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
>>> read_csv(StringIO(data), engine='c')
     x    y
0  NaN  Reg
1  NaN  Reg
2    I  Swp
3    I  Swp

Unfortunately, as @wesm pointed out here, it still does fail with the Python engine:

>>> read_csv(StringIO(data), engine='python')
...
_csv.Error: line contains NULL byte

This issue however, seems beyond our control, so I'm not sure if we should still classify this as a BUG on the pandas end if the issue is originating in Python's csv module.

jreback · 2016-07-31T18:33:22Z

if u want to put up tests for the c engine and s nice error message Python engine then can close

Provides a nicer error message for the Python engine in read_csv when the data contains a NULL byte. Closes pandas-devgh-2741.

wesm · 2016-08-01T18:09:38Z

Relatedly: what is the current implementation gap between the C and pure Python CSV parsers?

gfyoung · 2016-08-01T18:34:58Z

@wesm: We have a list of known differences in #12686. That alone indicates a pretty noticeable gap.

wesm · 2016-08-01T19:19:58Z

Seems like if you can address the regex delimiter problem (easier said than done) then it may be possible to deprecate the Python engine. This would be easier in the possible pandas 2.0 future in which we might add libre2 to the build / development toolchain

Provides a nicer error message for the Python engine in read_csv when the data contains a NULL byte. Closes gh-2741.

wesm modified the milestones: 0.18.0, Someday Jan 14, 2016

jreback modified the milestones: Next Major Release, 0.18.0 Jan 30, 2016

jreback added Difficulty Intermediate labels Jan 30, 2016

gfyoung added a commit to forking-repos/pandas that referenced this issue Jul 31, 2016

MAINT: Nicer error msg for NULL byte in read_csv

bce0b6b

Provides a nicer error message for the Python engine in read_csv when the data contains a NULL byte. Closes pandas-devgh-2741.

gfyoung mentioned this issue Jul 31, 2016

MAINT: Nicer error msg for NULL byte in read_csv #13859

Merged

jreback closed this as completed in #13859 Aug 1, 2016

jreback pushed a commit that referenced this issue Aug 1, 2016

MAINT: Nicer error msg for NULL byte in read_csv (#13859)

d4f95fd

Provides a nicer error message for the Python engine in read_csv when the data contains a NULL byte. Closes gh-2741.

jreback modified the milestones: 0.19.0, Next Major Release Aug 1, 2016

smsaladi mentioned this issue Feb 24, 2018

data after null character dropped in read_csv #19886

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv character encoding bug? #2741

read_csv character encoding bug? #2741

hayd commented Jan 23, 2013

wesm commented Jan 23, 2013

wesm commented Mar 29, 2013

msiler commented Jan 14, 2016

wesm commented Jan 14, 2016

wesm commented Jan 14, 2016

gfyoung commented Jul 31, 2016 •

edited

Loading

jreback commented Jul 31, 2016

wesm commented Aug 1, 2016

gfyoung commented Aug 1, 2016

wesm commented Aug 1, 2016

read_csv character encoding bug? #2741

read_csv character encoding bug? #2741

Comments

hayd commented Jan 23, 2013

wesm commented Jan 23, 2013

wesm commented Mar 29, 2013

msiler commented Jan 14, 2016

wesm commented Jan 14, 2016

wesm commented Jan 14, 2016

gfyoung commented Jul 31, 2016 • edited Loading

jreback commented Jul 31, 2016

wesm commented Aug 1, 2016

gfyoung commented Aug 1, 2016

wesm commented Aug 1, 2016

gfyoung commented Jul 31, 2016 •

edited

Loading