-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv character encoding bug? #2741
Comments
Yes. The tokenizer uses null terminators in a couple of places as a marker, I'll have to look to see exactly why this is failing. |
In complete fairness the
Pushing this issue to some other day (not 0.11) |
I know this is an old issue, but I'd like to give it a little bump. I have customers giving us a dump of some of their production database tables. The files we are getting are tab delimited and values that were null in the database are null bytes in the text file. I don't know how to get pandas to read this without having to do some manual munging first. |
I'm sorry this never got fixed! I have also seen null bytes used to code nulls. @jreback @jorisvandenbossche let's slate this for 0.18? |
I'd like to make some changes to the CSV tokenizer to hopefully improve its performance, that will also allow us to get rid of the null terminators that are complicating issues like this, but it would make most sense to understand for 1.0 / libpandas. Question is whether there is a quick fix for this particular issue with the existing tokenization strategy |
I don't believe this is an issue with the C engine anymore: >>> from pandas.compat import StringIO
>>> from pandas import read_csv
>>> data = 'x,y\n\x00\x00\x00,Reg\n\x00\x00\x00,Reg\nI,Swp\nI,Swp\n'
>>> read_csv(StringIO(data), engine='c')
x y
0 NaN Reg
1 NaN Reg
2 I Swp
3 I Swp Unfortunately, as @wesm pointed out here, it still does fail with the Python engine: >>> read_csv(StringIO(data), engine='python')
...
_csv.Error: line contains NULL byte This issue however, seems beyond our control, so I'm not sure if we should still classify this as a BUG on the |
if u want to put up tests for the c engine and s nice error message Python engine then can close |
Provides a nicer error message for the Python engine in read_csv when the data contains a NULL byte. Closes pandas-devgh-2741.
Relatedly: what is the current implementation gap between the C and pure Python CSV parsers? |
Seems like if you can address the regex delimiter problem (easier said than done) then it may be possible to deprecate the Python engine. This would be easier in the possible pandas 2.0 future in which we might add libre2 to the build / development toolchain |
Provides a nicer error message for the Python engine in read_csv when the data contains a NULL byte. Closes gh-2741.
This is a weird one from StackOverflow, this file has some
\x00
s which seem to be ignored when printing but confuseread_csv
:The text was updated successfully, but these errors were encountered: