Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading invalid text data, other encodings #1792

Closed
JeffBezanson opened this issue Dec 20, 2012 · 24 comments
Closed

reading invalid text data, other encodings #1792

JeffBezanson opened this issue Dec 20, 2012 · 24 comments
Labels
needs decision A decision on this change is needed
Milestone

Comments

@JeffBezanson
Copy link
Member

We probably need to change 1 or 2 behaviors when reading invalid data or unknown encodings. There are two cases: Char, and things like readuntil/readline.

readuntil can be reasonably defined in terms of bytes: just read everything until a certain value. This is good because then you can at least get the data without explicit support for every encoding. Currently we might return an invalid UTF8String, from which you can get the (unaltered) data. I don't know whether that is the best approach. Maybe there should be a lower-level routine that returns a byte array. We also need functions that do the same for different fixed-width encodings (16-bit, 32-bit).

Reading a Char I don't think can be done reasonably without knowing the encoding. The best immediate change I can think of is to give an error for invalid data while trying to read a UTF-8-encoded Char.

@JeffBezanson
Copy link
Member Author

We could have readuntil(io, c) where the type of c determines what kind of units to read --- byte at a time, Char at a time, Int16 at a time, etc.

@pao
Copy link
Member

pao commented Dec 20, 2012

UTF-8 is a variable-width encoding, though, right? If I'm trying to read up to a letter "l"/0x6C in a text that also happens to contain the simplified Chinese character ma3 ("horse"), which is 马/0x9A6C, that might be a problem if I'm not decoding it as UTF-8.

Does there need to be a fast byte-oriented readuntil() and a Unicode-aware readuntil()?

(Note: I do not know any Chinese. This just happens to be a frequent example character on Language Log, and also contains a byte in the low ASCII range.)

@StefanKarpinski
Copy link
Member

I really think we need a layered approach here where you can get raw bytes and then build strings from those. Searching for a character without specifying an encoding makes no sense. Auto-detecting the encoding by default seems like it would be sufficient. It would ideally be done lazily in the sense that you could start with ASCII and promote a stream to Latin-1 or UTF-8 if and when a character that gives a clue is encountered (i.e. invalid UTF-8 implies Latin-1).

@JeffBezanson
Copy link
Member Author

Yes, I was proposing a low-level readuntil that reads int8, int16, etc., since those can be useful independent of any text-related issues. Text reading functions can be built on top. Using readuntil with a Char delimiter is a different case. It has to know the encoding up front. But, for example, reading up to an ASCII character from a UTF-8 stream can use the byte-based readuntil.

@pao
Copy link
Member

pao commented Dec 21, 2012

The point of the above Chinese character digression was that reading up to an ASCII character from a UTF-8 stream cannot use the byte-based readuntil, since ASCII characters appear in the middle of UTF-8 byte sequences which represent non-ASCII characters.

@nolta
Copy link
Member

nolta commented Dec 21, 2012

On Fri, Dec 21, 2012 at 12:45 AM, pao [email protected] wrote:

The point of the above Chinese character digression was that reading up to
an ASCII character from a UTF-8 stream cannot use the byte-based readuntil,
since ASCII characters appear in the middle of UTF-8 byte sequences which
represent non-ASCII characters.

"马" is [e9,a9,ac] in utf8.

-Mike


Reply to this email directly or view it on GitHubhttps://github.com//issues/1792#issuecomment-11603393.

@JeffBezanson
Copy link
Member Author

In fact I believe searching for ascii characters in this way was one of the
major design considerations for utf-8.
On Dec 21, 2012 12:55 AM, "Mike Nolta" [email protected] wrote:

On Fri, Dec 21, 2012 at 12:45 AM, pao [email protected] wrote:

The point of the above Chinese character digression was that reading up
to
an ASCII character from a UTF-8 stream cannot use the byte-based
readuntil,
since ASCII characters appear in the middle of UTF-8 byte sequences
which
represent non-ASCII characters.

"马" is [e9,a9,ac] in utf8.

-Mike


Reply to this email directly or view it on GitHub<
https://github.com/JuliaLang/julia/issues/1792#issuecomment-11603393>.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1792#issuecomment-11603512.

@Keno
Copy link
Member

Keno commented Dec 21, 2012

@pao I think you might be confusing the unicode codepoint with the actual utf-8 encoding

@pao
Copy link
Member

pao commented Dec 21, 2012

Apparently so. Nothing to see here, move along. Argh Unicode, so useful yet complicated.

@JeffBezanson
Copy link
Member Author

Come to think of it, is there a good reason not to have Latin1String instead of ASCIIString? It is also a subset of unicode, and gives us more characters for almost no additional effort. It seems like a better default in case invalid utf-8 is encountered, since every byte array is valid latin-1.

@StefanKarpinski
Copy link
Member

ASCII is a sub-encoding of UTF-8 whereas Latin-1 is not. Thus, you can concatenate ASCII and UTF-8 data and the result is the UTF-8 encoding of the input strings. With Latin-1 this doesn't work; instead you have to transcode the Latin-1 data first, expanding high bytes to UTF-8 two-bytes and then concatenate the data. You may recall that we originally had Latin1String and UTF8String but changed to ASCIIString for exactly these reasons.

@JeffBezanson
Copy link
Member Author

Ok, maybe we should just add a function to convert latin-1 to utf-8.
On Dec 28, 2012 6:11 PM, "Stefan Karpinski" [email protected]
wrote:

ASCII is a sub-encoding of UTF-8 whereas Latin-1 is not. Thus, you can
concatenate ASCII and UTF-8 data and the result is the UTF-8 encoding of
the concatenation of the inputs. With Latin-1, this doesn't work; instead
you have to transcode the Latin-1 data first, expanding high bytes to UTF-8
two-byte characters before concatenating the data. You may recall that we
originally had Latin1String and UTF8String but changed to ASCIIString for
exactly these reasons.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1792#issuecomment-11744046.

@StefanKarpinski
Copy link
Member

That doesn't really seem to solve the kinds of problems we've been seeing. Mostly we've been having problems with UTF-8 strings that contain invalid UTF-8 data like 0xff bytes (which is not the start of any valid UTF-8 character). Since 346bf5e we handle this pretty well, but regexes still don't handle it. For readuntil(io, c::Char) where io is UTF-8 encoded and c is an ASCII character, we should search for uint8(c) instead because that will a) be faster, b) handle cases where the UTF-8 data is invalid reasonably, and c) be close to working if the data is really Latin-1 that we're misinterpreting as UTF-8.

@JeffBezanson
Copy link
Member Author

We already support reading until an int8; I will just reshuffle the
interface a bit. Adding the ability to interpret as latin1 is orthogonal to
the other stuff. Reading latin1 as invalid utf-8 is one of our problems.
On Dec 28, 2012 6:38 PM, "Stefan Karpinski" [email protected]
wrote:

That doesn't really seem to solve the kinds of problems we've been seeing.
Mostly we've been having problems with UTF-8 strings that contain invalid
UTF-8 data like 0xff bytes (which is not the start of any valid UTF-8
character). Since 346bf5ehttps://github.com/JuliaLang/julia/commit/346bf5eb2ffb279bf3e2b4102ffd7e7dfd9b34acwe handle this pretty well, but regexes still don't handle it. For readuntil(io,
c::Char) where io is UTF-8 encoded and c is an ASCII character, we should
search for uint8(c) instead because that will a) be faster, b) handle
cases where the UTF-8 data is invalid reasonably, and c) be close to
working if the data is really Latin-1 that we're misinterpreting as UTF-8.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1792#issuecomment-11744443.

@StefanKarpinski
Copy link
Member

To make things really smooth, we probably need to do some degree of encoding autodetection. UTF-32 and UTF-16 are easy to autodetect; Latin-1 and UTF-8 are unfortunately both common and hard to distinguish. I was thinking that the way to go would be to start out with ASCII and promote the encoding whenever something that is unequivocally Latin-1 or UTF-8 occurs, which implies a design where the encoding of a stream can change on the fly.

@JeffBezanson
Copy link
Member Author

Autodetection is not that easy; it requires a decent chunk of sample text and possibly some statistical analysis. We are currently auto-selecting ascii or utf-8 for whole lines, which can probably be extended to utf-16, utf-32, and latin-1. However, this does not work well for single characters. I'm not sure we want the state of a stream to change under you, such that the behavior of read(file, Char) is different at different times. So we need a good manually-controlled API (with a possible added benefit of better type info) at some level.

We should have an official approach to corrupt utf-8. One possible answer is you get an unspecified type of string whose .data has the original data. Or we could allow a user-passed function to call on invalid data to decide what to do with it. Or leave it totally undefined. Or the approach might vary based on whether the encoding was explicitly specified or autodetected.

JeffBezanson added a commit that referenced this issue Dec 30, 2012
returns a string for Char delim, otherwise an array
readuntil(s, uint8('\n')) provides a way to read a line without an encoding check
streams are encouraged to provide an efficient readuntil(s, ::Uint8),
  then reading until ASCII delimiters is fast, leaving the encoding
  logic to higher-level layers in io.jl and string.jl.
for #1792
@nalimilan
Copy link
Member

My two cents from somebody having worked with text data in French: it is quite common to get a few invalid UTF-8 characters, for example because a content provider got the encoding of one text wrong in the middle of a very large corpus. Rather than getting an error, a warning with an automatic replacement of invalid sequences would be more practical. It makes it easier to spot the incorrect characters after loading the data, for example, if you load texts stored in XML, you'd better read the parsed result than the source to find where the problem comes from.

@pygy
Copy link
Contributor

pygy commented Feb 27, 2014

Summary of the #5977 discussion (about readall(), but it applies here too):

  • Passing the encoding as a parameter to read* functions would sidestep the auto-detection problem. Default to a byte array in the absence of encoding (this would cover @nalimilan's use case).
  • Define ByteString as the parent to (UTF8|ASCII)String, and deal with more exotic encodings (exended ASCII, UTF-16/UCS2, EBEDIC ;-], and what not) in an independent package (Encodings.jl?). There you could define other ByteString subtypes.
  • By default, have the *String constructors validate their input. Add an optional named argument validate::Bool = true to the constructors to bypass validation of known valid arrays.

@JeffBezanson
Copy link
Member Author

I just want to point out that we currently incur the overhead of an encoding check on input to select ASCIIString or UTF8String, but we do not take advantage of this by writing faster UTF-8 code that can assume valid data. The majority of the code in getindex(s::UTF8String, i::Int) is for error checking.

@johnmyleswhite
Copy link
Member

As a follow-up to my comment about moving between immutable strings and raw bytes, it would good to make sure we can create strings without doing any encoding checks in the next iteration of string types.

@StefanKarpinski
Copy link
Member

Ideally, we can merge ASCIIString and UTF8String and keep doing the runtime
validity checks but make the valid case a very efficient "fast path". IIRC,
we had a version of this some time ago but there were a bunch of compiler
optimizations that were needed to make it fast enough. Perhaps we have
those at this point?

On Saturday, July 5, 2014, Jeff Bezanson [email protected] wrote:

I just want to point out that we currently incur the overhead of an
encoding check on input to select ASCIIString or UTF8String, but we do
not take advantage of this by writing faster UTF-8 code that can assume
valid data. The majority of the code in getindex(s::UTF8String, i::Int)
is for error checking.


Reply to this email directly or view it on GitHub
#1792 (comment).

@JeffBezanson
Copy link
Member Author

You just can't get the best performance without assuming valid data. For example, if you expect 2 continuation bytes you could just fetch the next 2 bytes and do the math, but to validate you'd have to check that 2 bytes are available and that they are actually continuation bytes. That necessarily means more branches. In fact we aren't even validating to this extent now. For example

julia> s = UTF8String(Uint8[0xc0,0x00,0x00,0x00,0x00])
"\Uffffff80\0\0\0"

julia> s[1]
'\Uffffff80'

julia> is_valid_utf8(s)
false

UTF-8 is not amenable to a fast path; to validate you need extra checks for every byte. The complexity of the required procedure is simply unreasonable to do on every character access.

@JeffBezanson JeffBezanson changed the title reading invalid text data reading invalid text data, other encodings Aug 5, 2014
@nalimilan
Copy link
Member

An update on this old issue for people who might end up here: all string types are going to be replaced with a single String type stored in UTF-8 (#14383). I've just created a package to convert text from/to different encodings: https://github.com/nalimilan/StringEncodings.jl. Finally, additional string types storing data in arbitrary encodings will likely be supported in another package (or maybe in StringEncodings.jl).

So I'd say this issue can be closed.

@tkelman
Copy link
Contributor

tkelman commented May 12, 2016

subsumed by #16107

@tkelman tkelman closed this as completed May 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed
Projects
None yet
Development

No branches or pull requests

9 participants