Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readall() silently produces invalid UTF8Strings. #5977

Closed
pygy opened this issue Feb 27, 2014 · 8 comments
Closed

readall() silently produces invalid UTF8Strings. #5977

pygy opened this issue Feb 27, 2014 · 8 comments

Comments

@pygy
Copy link
Contributor

pygy commented Feb 27, 2014

Probably related to #1792.

txt = readall(foo) always produces UTF8Strings, with no regards for the incoming byte sequence. It can result in invalid strings (is_valid_utf8(txt) fails).

Having a RawString type, and a package that deals with non-(UTF-8/ASCII) encodings may help.

@JeffBezanson
Copy link
Member

I don't see how is_valid_utf8 can return true on the same data that convert to UTF8String fails on, since the conversion uses that exact same function.

@pygy
Copy link
Contributor Author

pygy commented Feb 27, 2014

Notice the .data in the convert call.

I hadn't tried is_valid_utf8 on the array, but it fails as well.

@JeffBezanson
Copy link
Member

I still don't understand:

julia> txt = UTF8String(Uint8[0xff])
"�"

julia> is_valid_utf8(txt)
false

julia> convert(UTF8String,txt.data)
ERROR: invalid UTF-8 sequence
 in convert at utf8.jl:164

is_valid_utf8 does not return true. Your overall point is quite valid, but I don't understand this part of it.

@pao
Copy link
Member

pao commented Feb 27, 2014

@pygy Do you have a test vector for this? I suspect that might help.

@pygy
Copy link
Contributor Author

pygy commented Feb 27, 2014

Indeed, my bad.

I hadn't tested is_valid_utf8(txt) in isolation. I was doing the check in foo(str::String), and there was a stray foo(str::UTF8String) in my code, bypassing the check.

I've edited the bug report.

Back to the point, how about:

abstract Bytestring

immutable (UTF8 | ASCII | Raw)String <: ByteString
    data::Array{UInt8,1}
end

It would allow the hypothetical encodings.jl to define other ByteStrings.

@JeffBezanson
Copy link
Member

Ok, that's a relief.

I think I'd prefer the UTF8String constructor, or perhaps the I/O functions, to throw an error instead of returning an invalid string. Then you would have to set an encoding for the stream, or pass it, e.g. readall(f, UTF8String). Instead of RawString, I think we should just use byte vectors, since without knowing the encoding a RawString cannot really work like a String. What do you think?

@pygy
Copy link
Contributor Author

pygy commented Feb 27, 2014

I edited my former post while you were posting, but I agree with you about RawStrings. Passing the encoding to I/O functions would be better.

Having UTF8String() doing the validation by default would be the right thing, but, if you already know that the array is valid, it would be nice to be able to skip the check. Something like this, perhaps?

immutable UTF8String <: ByteString
    data::Array{UInt8,1}
    function UTF8String(data::Array{UInt8,1}; validate::Bool = true)
        if validate && !is_valid_utf8(data) throw(ErrorException("invalid UTF-8 sequence")) end
        new(data)
    end
end

@JeffBezanson
Copy link
Member

closing as dup of #1792.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants