Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic character type encoding? #804

Closed
pdeffebach opened this issue Jan 29, 2021 · 6 comments
Closed

Automatic character type encoding? #804

pdeffebach opened this issue Jan 29, 2021 · 6 comments

Comments

@pdeffebach
Copy link

I'm working with Japanese data right now, and it's encoded in SHIFT_JIS, except the metadata of the file doesn't know that. So it's gibberish when you try to open it in, say, a text editor.

CSV fails to read it, giving columns of byte arrays instead of strings. But uchardet worked successfully for at least one of the files. So maybe there is room for improvement.

It failed on R, on linux but not windows, curiously.

@quinnj
Copy link
Member

quinnj commented Jan 29, 2021

CSV fails to read it, giving columns of byte arrays instead of strings.

Can you explain this a little more? I think the values probably are Strings, but I think the Base Julia behavior is to print them like byte arrays if the bytes aren't valid UTF8.

@pdeffebach
Copy link
Author

pdeffebach commented Jan 29, 2021

You are right, here is the output

   Row │ Column1  id     address                            Facility\x83R\x81[\x83h  \x92n\x95\xfb\x8c\U ⋯
       │ Int64    Int64  String                             Int64                    Int64               ⋯
───────┼──────────────────────────────────────────────────────────────────────────────────────────────────
     1 │       1      1  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        1                      ⋯
     2 │       2      2  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        2
     3 │       3      3  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        3
     4 │       4      4  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        4
     5 │       5      5  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        5                      ⋯
     6 │       6      7  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        7
     7 │       7      8  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        8
     8 │       8      9  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        9
     9 │       9     10  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                       10                      ⋯

@nalimilan
Copy link
Member

There are two separate issues here:

  • Read the file by specifying the right encoding. This can be achieved easily with StringEncodings.
  • Detect the encoding automatically. ICU.jl already supports this, but it's a heavy dependency. Also I'm not sure it's very reliable, so doing that by default would be risky.

@quinnj
Copy link
Member

quinnj commented Feb 27, 2021

It doesn't seem like there's anything actionable for CSV.jl here, right?

@nalimilan
Copy link
Member

Well you could add a dependency on StringEncodings and on an encoding detector, but I guess we don't want to do that. Maybe the day optional dependencies will be supported...

@quinnj
Copy link
Member

quinnj commented Mar 1, 2021

Well, I was going off the comment of "auto string encoding detection isn't very reliable", so I figured we wouldn't want to get into that.

@quinnj quinnj closed this as completed Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants