Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading / converting from files in different encodings #7834

Closed
WestleyArgentum opened this issue Aug 4, 2014 · 13 comments
Closed

reading / converting from files in different encodings #7834

WestleyArgentum opened this issue Aug 4, 2014 · 13 comments

Comments

@WestleyArgentum
Copy link
Member

I've got a file in latin1 and I've been trying to work out the best way to bring it in and work with it.

The most straightforward approach doesn't quite work:

julia> readall("latin1.txt")
"aaa�aaa"

So right now I take the long road - reading in each individual character and then converting them to a string.

julia> f = open("latin1.txt")
IOStream(<file cmtes12.txt>)

julia> buff = Char[]
0-element Array{Char,1}

julia> while !eof(f)
          push!(buff, read(f, Uint8))
       end

julia> buff
7-element Array{Char,1}:
 'a'
 'a'
 'a'
 'ÿ'
 'a'
 'a'
 'a'

julia> string(buff...)
"aaaÿaaa"

Would it be possible / reasonable to add a keyword arg to readall (and company) that would do the same sort of conversion as string behind the scenes?

@JeffBezanson
Copy link
Member

mostly dup of #1792

We unfortunately lack a Latin1String. But this can be done manually by using readbytes, then converting to a Uint32 array, then wrapping as a UTF32String.

@JeffBezanson
Copy link
Member

We should probably add Latin1String to Base and leave other encodings to packages.

@ivarne
Copy link
Member

ivarne commented Aug 5, 2014

Why should Latin1String be special? If we should have a package for encoding, Latin1 seems like a good start.

@JeffBezanson
Copy link
Member

Ok, fair enough. Always happy to add fewer things to Base :)

@nalimilan
Copy link
Member

Please don't add Latin1String. If you do that, I'm going to ask for Latin9String, and everybody is going to ask for his own encoding. My experience with text mining in R is that allowing non-Unicode encodings, which are not always supported depending on the system locale and OS type, is a nightmare. The only reasonable approach is to convert everything to Unicode on input.

@ivarne
Copy link
Member

ivarne commented Aug 5, 2014

That is definitely why I want it in a package. It would be nice if we could make things work with different encodings in a package tough. As long a getindex(::String, ::Int) returns a Char with a unicode code point, it seems possible to handle things transparent?

@nalimilan
Copy link
Member

@ivarne Are you really sure it would be useful to support non-Unicode encodings as separate String types, even in a package? It really sounds like much trouble for no real gain. Serious people work with Unicode nowadays. And if getindex needs to return a Unichar anyway, better do the conversion directly on import, rather than on the fly (and possibly several times for the same sequence).

@jiahao
Copy link
Member

jiahao commented Aug 5, 2014

The only reason why OP's code works at all is that there is an exact embedding of Latin-1 codepoints in the U+00:U+FF code point range in Unicode. So Latin-1 is privileged in a way that other encodings are not. (This doesn't necessarily mean we should special-case it in a String type though.)

@jiahao
Copy link
Member

jiahao commented Aug 5, 2014

The only reasonable approach is to convert everything to Unicode on input.

💯

@ivarne
Copy link
Member

ivarne commented Aug 5, 2014

@nalimilan I'm not sure how useful it will be, but Base does not seem like the right place for it. I'm more doodling possible interfaces, than anything else.

We currently have a few charsets supported ASCIIString, UTF8String, UTF32String and so on. All of them uses a backing array, and returns a Char on indexing. That seems to generalize nicely to other encodings as well.

The only reasonable approach is to convert everything to Unicode on input.

I would still argue that the conversion code could be written in Julia, and then you will need some types to tag things correctly. Then you will need a Char->Unichar mapping, and it seems right to let the user decide whether he needs a copy (in UTF8/16/32) of the whole string or not.

@JeffBezanson
Copy link
Member

The intent of the Char type is to refer to unicode code points, and a String is a sequence of Chars, so technically Julia only supports unicode. This is necessary to make code like c::Char == 'a' a correct way to look for a certain character. A Char type without a definite character set doesn't really make sense.

However you can implement String any way you want, so the underlying data can be anything as long as it presents unicode Chars. Latin1String makes sense because it can do this very efficiently, but that's not a requirement. I/O is a totally different story --- in the real world there are files in non-unicode encodings. But I agree the best approach is to validate and convert on input.

@nalimilan
Copy link
Member

I would still argue that the conversion code could be written in Julia, and then you will need some types to tag things correctly.

Better keep the data in a Vector{Uint8} until decode is called on it, as UnicodeExtras works (https://github.com/nolta/UnicodeExtras.jl#file-encoding) and similar to how Python's bytearray is used. Else you're going to need tens of String subtypes to support all existing encodings (have a look at the list at https://www.gnu.org/software/libiconv/).

Then you will need a Char->Unichar mapping, and it seems right to let the user decide whether he needs a copy (in UTF8/16/32) of the whole string or not.

I don't think creating a copy is a real problem. In virtually all cases, you don't need to keep two copies of the strings: you just read a small chunk of a text (e.g. one line), convert it to Unicode, and reuse the buffer to read another chunk. The overhead is not that big, and people with big databases will use Unicode anyway.

@JeffBezanson
Copy link
Member

Closing as dup of #1792, feel free to continue the discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants