Automatic character type encoding? #804

pdeffebach · 2021-01-29T15:01:25Z

I'm working with Japanese data right now, and it's encoded in SHIFT_JIS, except the metadata of the file doesn't know that. So it's gibberish when you try to open it in, say, a text editor.

CSV fails to read it, giving columns of byte arrays instead of strings. But uchardet worked successfully for at least one of the files. So maybe there is room for improvement.

It failed on R, on linux but not windows, curiously.

The text was updated successfully, but these errors were encountered:

quinnj · 2021-01-29T15:40:08Z

CSV fails to read it, giving columns of byte arrays instead of strings.

Can you explain this a little more? I think the values probably are Strings, but I think the Base Julia behavior is to print them like byte arrays if the bytes aren't valid UTF8.

pdeffebach · 2021-01-29T16:10:39Z

You are right, here is the output

   Row │ Column1  id     address                            Facility\x83R\x81[\x83h  \x92n\x95\xfb\x8c\U ⋯
       │ Int64    Int64  String                             Int64                    Int64               ⋯
───────┼──────────────────────────────────────────────────────────────────────────────────────────────────
     1 │       1      1  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        1                      ⋯
     2 │       2      2  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        2
     3 │       3      3  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        3
     4 │       4      4  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        4
     5 │       5      5  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        5                      ⋯
     6 │       6      7  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        7
     7 │       7      8  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        8
     8 │       8      9  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                        9
     9 │       9     10  \x8eD\x96y\x8es\x92\x86\x89\x9b\…                       10                      ⋯

nalimilan · 2021-02-14T15:37:46Z

There are two separate issues here:

Read the file by specifying the right encoding. This can be achieved easily with StringEncodings.
Detect the encoding automatically. ICU.jl already supports this, but it's a heavy dependency. Also I'm not sure it's very reliable, so doing that by default would be risky.

quinnj · 2021-02-27T20:27:09Z

It doesn't seem like there's anything actionable for CSV.jl here, right?

nalimilan · 2021-02-28T18:43:59Z

Well you could add a dependency on StringEncodings and on an encoding detector, but I guess we don't want to do that. Maybe the day optional dependencies will be supported...

quinnj · 2021-03-01T18:14:18Z

Well, I was going off the comment of "auto string encoding detection isn't very reliable", so I figured we wouldn't want to get into that.

quinnj closed this as completed Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic character type encoding? #804

Automatic character type encoding? #804

pdeffebach commented Jan 29, 2021

quinnj commented Jan 29, 2021

pdeffebach commented Jan 29, 2021 •

edited

Loading

nalimilan commented Feb 14, 2021

quinnj commented Feb 27, 2021

nalimilan commented Feb 28, 2021

quinnj commented Mar 1, 2021

Automatic character type encoding? #804

Automatic character type encoding? #804

Comments

pdeffebach commented Jan 29, 2021

quinnj commented Jan 29, 2021

pdeffebach commented Jan 29, 2021 • edited Loading

nalimilan commented Feb 14, 2021

quinnj commented Feb 27, 2021

nalimilan commented Feb 28, 2021

quinnj commented Mar 1, 2021

pdeffebach commented Jan 29, 2021 •

edited

Loading