reading / converting from files in different encodings #7834

WestleyArgentum · 2014-08-04T20:42:50Z

I've got a file in latin1 and I've been trying to work out the best way to bring it in and work with it.

The most straightforward approach doesn't quite work:

julia> readall("latin1.txt")
"aaa�aaa"

So right now I take the long road - reading in each individual character and then converting them to a string.

julia> f = open("latin1.txt")
IOStream(<file cmtes12.txt>)

julia> buff = Char[]
0-element Array{Char,1}

julia> while !eof(f)
          push!(buff, read(f, Uint8))
       end

julia> buff
7-element Array{Char,1}:
 'a'
 'a'
 'a'
 'ÿ'
 'a'
 'a'
 'a'

julia> string(buff...)
"aaaÿaaa"

Would it be possible / reasonable to add a keyword arg to readall (and company) that would do the same sort of conversion as string behind the scenes?

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2014-08-04T21:56:41Z

mostly dup of #1792

We unfortunately lack a Latin1String. But this can be done manually by using readbytes, then converting to a Uint32 array, then wrapping as a UTF32String.

JeffBezanson · 2014-08-05T04:35:52Z

We should probably add Latin1String to Base and leave other encodings to packages.

ivarne · 2014-08-05T05:35:56Z

Why should Latin1String be special? If we should have a package for encoding, Latin1 seems like a good start.

JeffBezanson · 2014-08-05T05:51:32Z

Ok, fair enough. Always happy to add fewer things to Base :)

nalimilan · 2014-08-05T09:12:43Z

Please don't add Latin1String. If you do that, I'm going to ask for Latin9String, and everybody is going to ask for his own encoding. My experience with text mining in R is that allowing non-Unicode encodings, which are not always supported depending on the system locale and OS type, is a nightmare. The only reasonable approach is to convert everything to Unicode on input.

ivarne · 2014-08-05T10:10:33Z

That is definitely why I want it in a package. It would be nice if we could make things work with different encodings in a package tough. As long a getindex(::String, ::Int) returns a Char with a unicode code point, it seems possible to handle things transparent?

nalimilan · 2014-08-05T13:29:02Z

@ivarne Are you really sure it would be useful to support non-Unicode encodings as separate String types, even in a package? It really sounds like much trouble for no real gain. Serious people work with Unicode nowadays. And if getindex needs to return a Unichar anyway, better do the conversion directly on import, rather than on the fly (and possibly several times for the same sequence).

jiahao · 2014-08-05T13:35:47Z

The only reason why OP's code works at all is that there is an exact embedding of Latin-1 codepoints in the U+00:U+FF code point range in Unicode. So Latin-1 is privileged in a way that other encodings are not. (This doesn't necessarily mean we should special-case it in a String type though.)

jiahao · 2014-08-05T13:38:35Z

The only reasonable approach is to convert everything to Unicode on input.

💯

ivarne · 2014-08-05T13:56:13Z

@nalimilan I'm not sure how useful it will be, but Base does not seem like the right place for it. I'm more doodling possible interfaces, than anything else.

We currently have a few charsets supported ASCIIString, UTF8String, UTF32String and so on. All of them uses a backing array, and returns a Char on indexing. That seems to generalize nicely to other encodings as well.

The only reasonable approach is to convert everything to Unicode on input.

I would still argue that the conversion code could be written in Julia, and then you will need some types to tag things correctly. Then you will need a Char->Unichar mapping, and it seems right to let the user decide whether he needs a copy (in UTF8/16/32) of the whole string or not.

JeffBezanson · 2014-08-05T14:32:34Z

The intent of the Char type is to refer to unicode code points, and a String is a sequence of Chars, so technically Julia only supports unicode. This is necessary to make code like c::Char == 'a' a correct way to look for a certain character. A Char type without a definite character set doesn't really make sense.

However you can implement String any way you want, so the underlying data can be anything as long as it presents unicode Chars. Latin1String makes sense because it can do this very efficiently, but that's not a requirement. I/O is a totally different story --- in the real world there are files in non-unicode encodings. But I agree the best approach is to validate and convert on input.

nalimilan · 2014-08-05T14:33:55Z

I would still argue that the conversion code could be written in Julia, and then you will need some types to tag things correctly.

Better keep the data in a Vector{Uint8} until decode is called on it, as UnicodeExtras works (https://github.com/nolta/UnicodeExtras.jl#file-encoding) and similar to how Python's bytearray is used. Else you're going to need tens of String subtypes to support all existing encodings (have a look at the list at https://www.gnu.org/software/libiconv/).

Then you will need a Char->Unichar mapping, and it seems right to let the user decide whether he needs a copy (in UTF8/16/32) of the whole string or not.

I don't think creating a copy is a real problem. In virtually all cases, you don't need to keep two copies of the strings: you just read a small chunk of a text (e.g. one line), convert it to Unicode, and reuse the buffer to read another chunk. The overhead is not that big, and people with big databases will use Unicode anyway.

JeffBezanson · 2014-08-07T02:12:19Z

Closing as dup of #1792, feel free to continue the discussion there.

JeffBezanson closed this as completed Aug 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reading / converting from files in different encodings #7834

reading / converting from files in different encodings #7834

WestleyArgentum commented Aug 4, 2014

JeffBezanson commented Aug 4, 2014

JeffBezanson commented Aug 5, 2014

ivarne commented Aug 5, 2014

JeffBezanson commented Aug 5, 2014

nalimilan commented Aug 5, 2014

ivarne commented Aug 5, 2014

nalimilan commented Aug 5, 2014

jiahao commented Aug 5, 2014

jiahao commented Aug 5, 2014

ivarne commented Aug 5, 2014

JeffBezanson commented Aug 5, 2014

nalimilan commented Aug 5, 2014

JeffBezanson commented Aug 7, 2014

reading / converting from files in different encodings #7834

reading / converting from files in different encodings #7834

Comments

WestleyArgentum commented Aug 4, 2014

JeffBezanson commented Aug 4, 2014

JeffBezanson commented Aug 5, 2014

ivarne commented Aug 5, 2014

JeffBezanson commented Aug 5, 2014

nalimilan commented Aug 5, 2014

ivarne commented Aug 5, 2014

nalimilan commented Aug 5, 2014

jiahao commented Aug 5, 2014

jiahao commented Aug 5, 2014

ivarne commented Aug 5, 2014

JeffBezanson commented Aug 5, 2014

nalimilan commented Aug 5, 2014

JeffBezanson commented Aug 7, 2014