Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv character encoding bug? #2741

Closed
hayd opened this issue Jan 23, 2013 · 10 comments
Closed

read_csv character encoding bug? #2741

hayd opened this issue Jan 23, 2013 · 10 comments
Labels
Bug IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@hayd
Copy link
Contributor

hayd commented Jan 23, 2013

This is a weird one from StackOverflow, this file has some \x00s which seem to be ignored when printing but confuse read_csv:

x = 'x,y\n \x00\x00\x00,Reg\n \x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
X = StringIO(x)

In [3]: pd.read_csv(X)
Out[3]: 
     x    y
0          
1  NaN  NaN
2    I  Swp
3    I  Swp

In [4]: print x
x,y
 ,Reg
 ,Reg
I,Swp
I,Swp
@wesm
Copy link
Member

wesm commented Jan 23, 2013

Yes. The tokenizer uses null terminators in a couple of places as a marker, I'll have to look to see exactly why this is failing.

@wesm
Copy link
Member

wesm commented Mar 29, 2013

In complete fairness the csv module doesn't handle NULL bytes:

In [6]: import csv; f = csv.reader(StringIO(x))

In [7]: next(f)
Out[7]: ['x', 'y']

In [8]: next(f)
---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
<ipython-input-8-468f0afdf1b9> in <module>()
----> 1 next(f)

Error: line contains NULL byte

Pushing this issue to some other day (not 0.11)

@msiler
Copy link

msiler commented Jan 14, 2016

I know this is an old issue, but I'd like to give it a little bump. I have customers giving us a dump of some of their production database tables. The files we are getting are tab delimited and values that were null in the database are null bytes in the text file. I don't know how to get pandas to read this without having to do some manual munging first.

@wesm
Copy link
Member

wesm commented Jan 14, 2016

I'm sorry this never got fixed! I have also seen null bytes used to code nulls. @jreback @jorisvandenbossche let's slate this for 0.18?

@wesm wesm modified the milestones: 0.18.0, Someday Jan 14, 2016
@wesm
Copy link
Member

wesm commented Jan 14, 2016

I'd like to make some changes to the CSV tokenizer to hopefully improve its performance, that will also allow us to get rid of the null terminators that are complicating issues like this, but it would make most sense to understand for 1.0 / libpandas. Question is whether there is a quick fix for this particular issue with the existing tokenization strategy

@jreback jreback modified the milestones: Next Major Release, 0.18.0 Jan 30, 2016
@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2016

I don't believe this is an issue with the C engine anymore:

>>> from pandas.compat import StringIO
>>> from pandas import read_csv
>>> data = 'x,y\n\x00\x00\x00,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
>>> read_csv(StringIO(data), engine='c')
     x    y
0  NaN  Reg
1  NaN  Reg
2    I  Swp
3    I  Swp

Unfortunately, as @wesm pointed out here, it still does fail with the Python engine:

>>> read_csv(StringIO(data), engine='python')
...
_csv.Error: line contains NULL byte

This issue however, seems beyond our control, so I'm not sure if we should still classify this as a BUG on the pandas end if the issue is originating in Python's csv module.

@jreback
Copy link
Contributor

jreback commented Jul 31, 2016

if u want to put up tests for the c engine and s nice error message Python engine then can close

gfyoung added a commit to forking-repos/pandas that referenced this issue Jul 31, 2016
Provides a nicer error message for the Python engine
in read_csv when the data contains a NULL byte.

Closes pandas-devgh-2741.
@wesm
Copy link
Member

wesm commented Aug 1, 2016

Relatedly: what is the current implementation gap between the C and pure Python CSV parsers?

@gfyoung
Copy link
Member

gfyoung commented Aug 1, 2016

@wesm: We have a list of known differences in #12686. That alone indicates a pretty noticeable gap.

@wesm
Copy link
Member

wesm commented Aug 1, 2016

Seems like if you can address the regex delimiter problem (easier said than done) then it may be possible to deprecate the Python engine. This would be easier in the possible pandas 2.0 future in which we might add libre2 to the build / development toolchain

jreback pushed a commit that referenced this issue Aug 1, 2016
Provides a nicer error message for the Python engine
in read_csv when the data contains a NULL byte.

Closes gh-2741.
@jreback jreback modified the milestones: 0.19.0, Next Major Release Aug 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

5 participants