readall() silently produces invalid UTF8Strings. #5977

pygy · 2014-02-27T16:27:33Z

Probably related to #1792.

txt = readall(foo) always produces UTF8Strings, with no regards for the incoming byte sequence. It can result in invalid strings (is_valid_utf8(txt) fails).

Having a RawString type, and a package that deals with non-(UTF-8/ASCII) encodings may help.

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2014-02-27T16:57:07Z

I don't see how is_valid_utf8 can return true on the same data that convert to UTF8String fails on, since the conversion uses that exact same function.

pygy · 2014-02-27T17:03:19Z

Notice the .data in the convert call.

I hadn't tried is_valid_utf8 on the array, but it fails as well.

JeffBezanson · 2014-02-27T17:20:04Z

I still don't understand:

julia> txt = UTF8String(Uint8[0xff])
"�"

julia> is_valid_utf8(txt)
false

julia> convert(UTF8String,txt.data)
ERROR: invalid UTF-8 sequence
 in convert at utf8.jl:164

is_valid_utf8 does not return true. Your overall point is quite valid, but I don't understand this part of it.

pao · 2014-02-27T17:26:26Z

@pygy Do you have a test vector for this? I suspect that might help.

pygy · 2014-02-27T17:37:12Z

Indeed, my bad.

I hadn't tested is_valid_utf8(txt) in isolation. I was doing the check in foo(str::String), and there was a stray foo(str::UTF8String) in my code, bypassing the check.

I've edited the bug report.

Back to the point, how about:

abstract Bytestring

immutable (UTF8 | ASCII | Raw)String <: ByteString
    data::Array{UInt8,1}
end

It would allow the hypothetical encodings.jl to define other ByteStrings.

JeffBezanson · 2014-02-27T17:44:31Z

Ok, that's a relief.

I think I'd prefer the UTF8String constructor, or perhaps the I/O functions, to throw an error instead of returning an invalid string. Then you would have to set an encoding for the stream, or pass it, e.g. readall(f, UTF8String). Instead of RawString, I think we should just use byte vectors, since without knowing the encoding a RawString cannot really work like a String. What do you think?

pygy · 2014-02-27T17:52:50Z

I edited my former post while you were posting, but I agree with you about RawStrings. Passing the encoding to I/O functions would be better.

Having UTF8String() doing the validation by default would be the right thing, but, if you already know that the array is valid, it would be nice to be able to skip the check. Something like this, perhaps?

immutable UTF8String <: ByteString
    data::Array{UInt8,1}
    function UTF8String(data::Array{UInt8,1}; validate::Bool = true)
        if validate && !is_valid_utf8(data) throw(ErrorException("invalid UTF-8 sequence")) end
        new(data)
    end
end

JeffBezanson · 2014-02-27T20:51:20Z

closing as dup of #1792.

JeffBezanson closed this as completed Feb 27, 2014

pygy mentioned this issue Feb 27, 2014

reading invalid text data, other encodings #1792

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readall() silently produces invalid UTF8Strings. #5977

readall() silently produces invalid UTF8Strings. #5977

pygy commented Feb 27, 2014

JeffBezanson commented Feb 27, 2014

pygy commented Feb 27, 2014

JeffBezanson commented Feb 27, 2014

pao commented Feb 27, 2014

pygy commented Feb 27, 2014

JeffBezanson commented Feb 27, 2014

pygy commented Feb 27, 2014

JeffBezanson commented Feb 27, 2014

readall() silently produces invalid UTF8Strings. #5977

readall() silently produces invalid UTF8Strings. #5977

Comments

pygy commented Feb 27, 2014

JeffBezanson commented Feb 27, 2014

pygy commented Feb 27, 2014

JeffBezanson commented Feb 27, 2014

pao commented Feb 27, 2014

pygy commented Feb 27, 2014

JeffBezanson commented Feb 27, 2014

pygy commented Feb 27, 2014

JeffBezanson commented Feb 27, 2014