reading invalid text data, other encodings #1792

JeffBezanson · 2012-12-20T06:12:43Z

We probably need to change 1 or 2 behaviors when reading invalid data or unknown encodings. There are two cases: Char, and things like readuntil/readline.

readuntil can be reasonably defined in terms of bytes: just read everything until a certain value. This is good because then you can at least get the data without explicit support for every encoding. Currently we might return an invalid UTF8String, from which you can get the (unaltered) data. I don't know whether that is the best approach. Maybe there should be a lower-level routine that returns a byte array. We also need functions that do the same for different fixed-width encodings (16-bit, 32-bit).

Reading a Char I don't think can be done reasonably without knowing the encoding. The best immediate change I can think of is to give an error for invalid data while trying to read a UTF-8-encoded Char.

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2012-12-20T06:19:06Z

We could have readuntil(io, c) where the type of c determines what kind of units to read --- byte at a time, Char at a time, Int16 at a time, etc.

pao · 2012-12-20T13:56:10Z

UTF-8 is a variable-width encoding, though, right? If I'm trying to read up to a letter "l"/0x6C in a text that also happens to contain the simplified Chinese character ma3 ("horse"), which is 马/0x9A6C, that might be a problem if I'm not decoding it as UTF-8.

Does there need to be a fast byte-oriented readuntil() and a Unicode-aware readuntil()?

(Note: I do not know any Chinese. This just happens to be a frequent example character on Language Log, and also contains a byte in the low ASCII range.)

StefanKarpinski · 2012-12-20T17:30:10Z

I really think we need a layered approach here where you can get raw bytes and then build strings from those. Searching for a character without specifying an encoding makes no sense. Auto-detecting the encoding by default seems like it would be sufficient. It would ideally be done lazily in the sense that you could start with ASCII and promote a stream to Latin-1 or UTF-8 if and when a character that gives a clue is encountered (i.e. invalid UTF-8 implies Latin-1).

JeffBezanson · 2012-12-20T23:25:34Z

Yes, I was proposing a low-level readuntil that reads int8, int16, etc., since those can be useful independent of any text-related issues. Text reading functions can be built on top. Using readuntil with a Char delimiter is a different case. It has to know the encoding up front. But, for example, reading up to an ASCII character from a UTF-8 stream can use the byte-based readuntil.

pao · 2012-12-21T05:45:05Z

The point of the above Chinese character digression was that reading up to an ASCII character from a UTF-8 stream cannot use the byte-based readuntil, since ASCII characters appear in the middle of UTF-8 byte sequences which represent non-ASCII characters.

nolta · 2012-12-21T05:55:03Z

On Fri, Dec 21, 2012 at 12:45 AM, pao [email protected] wrote:

The point of the above Chinese character digression was that reading up to
an ASCII character from a UTF-8 stream cannot use the byte-based readuntil,
since ASCII characters appear in the middle of UTF-8 byte sequences which
represent non-ASCII characters.

"马" is [e9,a9,ac] in utf8.

-Mike

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/1792#issuecomment-11603393.

JeffBezanson · 2012-12-21T05:58:19Z

In fact I believe searching for ascii characters in this way was one of the
major design considerations for utf-8.
On Dec 21, 2012 12:55 AM, "Mike Nolta" [email protected] wrote:

On Fri, Dec 21, 2012 at 12:45 AM, pao [email protected] wrote:

The point of the above Chinese character digression was that reading up
to
an ASCII character from a UTF-8 stream cannot use the byte-based
readuntil,
since ASCII characters appear in the middle of UTF-8 byte sequences
which
represent non-ASCII characters.

"马" is [e9,a9,ac] in utf8.

-Mike

—
Reply to this email directly or view it on GitHub<
https://github.com/JuliaLang/julia/issues/1792#issuecomment-11603393>.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/1792#issuecomment-11603512.

Keno · 2012-12-21T06:08:34Z

@pao I think you might be confusing the unicode codepoint with the actual utf-8 encoding

pao · 2012-12-21T12:44:33Z

Apparently so. Nothing to see here, move along. Argh Unicode, so useful yet complicated.

JeffBezanson · 2012-12-28T22:47:33Z

Come to think of it, is there a good reason not to have Latin1String instead of ASCIIString? It is also a subset of unicode, and gives us more characters for almost no additional effort. It seems like a better default in case invalid utf-8 is encountered, since every byte array is valid latin-1.

StefanKarpinski · 2012-12-28T23:11:02Z

ASCII is a sub-encoding of UTF-8 whereas Latin-1 is not. Thus, you can concatenate ASCII and UTF-8 data and the result is the UTF-8 encoding of the input strings. With Latin-1 this doesn't work; instead you have to transcode the Latin-1 data first, expanding high bytes to UTF-8 two-bytes and then concatenate the data. You may recall that we originally had Latin1String and UTF8String but changed to ASCIIString for exactly these reasons.

JeffBezanson · 2012-12-28T23:24:11Z

Ok, maybe we should just add a function to convert latin-1 to utf-8.
On Dec 28, 2012 6:11 PM, "Stefan Karpinski" [email protected]
wrote:

ASCII is a sub-encoding of UTF-8 whereas Latin-1 is not. Thus, you can
concatenate ASCII and UTF-8 data and the result is the UTF-8 encoding of
the concatenation of the inputs. With Latin-1, this doesn't work; instead
you have to transcode the Latin-1 data first, expanding high bytes to UTF-8
two-byte characters before concatenating the data. You may recall that we
originally had Latin1String and UTF8String but changed to ASCIIString for
exactly these reasons.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/1792#issuecomment-11744046.

StefanKarpinski · 2012-12-28T23:38:23Z

That doesn't really seem to solve the kinds of problems we've been seeing. Mostly we've been having problems with UTF-8 strings that contain invalid UTF-8 data like 0xff bytes (which is not the start of any valid UTF-8 character). Since 346bf5e we handle this pretty well, but regexes still don't handle it. For readuntil(io, c::Char) where io is UTF-8 encoded and c is an ASCII character, we should search for uint8(c) instead because that will a) be faster, b) handle cases where the UTF-8 data is invalid reasonably, and c) be close to working if the data is really Latin-1 that we're misinterpreting as UTF-8.

JeffBezanson · 2012-12-28T23:57:37Z

We already support reading until an int8; I will just reshuffle the
interface a bit. Adding the ability to interpret as latin1 is orthogonal to
the other stuff. Reading latin1 as invalid utf-8 is one of our problems.
On Dec 28, 2012 6:38 PM, "Stefan Karpinski" [email protected]
wrote:

That doesn't really seem to solve the kinds of problems we've been seeing.
Mostly we've been having problems with UTF-8 strings that contain invalid
UTF-8 data like 0xff bytes (which is not the start of any valid UTF-8
character). Since 346bf5ehttps://github.com/JuliaLang/julia/commit/346bf5eb2ffb279bf3e2b4102ffd7e7dfd9b34acwe handle this pretty well, but regexes still don't handle it. For readuntil(io,
c::Char) where io is UTF-8 encoded and c is an ASCII character, we should
search for uint8(c) instead because that will a) be faster, b) handle
cases where the UTF-8 data is invalid reasonably, and c) be close to
working if the data is really Latin-1 that we're misinterpreting as UTF-8.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/1792#issuecomment-11744443.

StefanKarpinski · 2012-12-29T00:05:05Z

To make things really smooth, we probably need to do some degree of encoding autodetection. UTF-32 and UTF-16 are easy to autodetect; Latin-1 and UTF-8 are unfortunately both common and hard to distinguish. I was thinking that the way to go would be to start out with ASCII and promote the encoding whenever something that is unequivocally Latin-1 or UTF-8 occurs, which implies a design where the encoding of a stream can change on the fly.

JeffBezanson · 2012-12-29T02:55:06Z

Autodetection is not that easy; it requires a decent chunk of sample text and possibly some statistical analysis. We are currently auto-selecting ascii or utf-8 for whole lines, which can probably be extended to utf-16, utf-32, and latin-1. However, this does not work well for single characters. I'm not sure we want the state of a stream to change under you, such that the behavior of read(file, Char) is different at different times. So we need a good manually-controlled API (with a possible added benefit of better type info) at some level.

We should have an official approach to corrupt utf-8. One possible answer is you get an unspecified type of string whose .data has the original data. Or we could allow a user-passed function to call on invalid data to decide what to do with it. Or leave it totally undefined. Or the approach might vary based on whether the encoding was explicitly specified or autodetected.

returns a string for Char delim, otherwise an array readuntil(s, uint8('\n')) provides a way to read a line without an encoding check streams are encouraged to provide an efficient readuntil(s, ::Uint8), then reading until ASCII delimiters is fast, leaving the encoding logic to higher-level layers in io.jl and string.jl. for #1792

nalimilan · 2014-02-27T22:04:35Z

My two cents from somebody having worked with text data in French: it is quite common to get a few invalid UTF-8 characters, for example because a content provider got the encoding of one text wrong in the middle of a very large corpus. Rather than getting an error, a warning with an automatic replacement of invalid sequences would be more practical. It makes it easier to spot the incorrect characters after loading the data, for example, if you load texts stored in XML, you'd better read the parsed result than the source to find where the problem comes from.

pygy · 2014-02-27T23:26:12Z

Summary of the #5977 discussion (about readall(), but it applies here too):

Passing the encoding as a parameter to read* functions would sidestep the auto-detection problem. Default to a byte array in the absence of encoding (this would cover @nalimilan's use case).
Define ByteString as the parent to (UTF8|ASCII)String, and deal with more exotic encodings (exended ASCII, UTF-16/UCS2, EBEDIC ;-], and what not) in an independent package (Encodings.jl?). There you could define other ByteString subtypes.
By default, have the *String constructors validate their input. Add an optional named argument validate::Bool = true to the constructors to bypass validation of known valid arrays.

JeffBezanson · 2014-07-05T17:59:51Z

I just want to point out that we currently incur the overhead of an encoding check on input to select ASCIIString or UTF8String, but we do not take advantage of this by writing faster UTF-8 code that can assume valid data. The majority of the code in getindex(s::UTF8String, i::Int) is for error checking.

johnmyleswhite · 2014-07-06T16:35:35Z

As a follow-up to my comment about moving between immutable strings and raw bytes, it would good to make sure we can create strings without doing any encoding checks in the next iteration of string types.

StefanKarpinski · 2014-07-07T00:07:13Z

Ideally, we can merge ASCIIString and UTF8String and keep doing the runtime
validity checks but make the valid case a very efficient "fast path". IIRC,
we had a version of this some time ago but there were a bunch of compiler
optimizations that were needed to make it fast enough. Perhaps we have
those at this point?

On Saturday, July 5, 2014, Jeff Bezanson [email protected] wrote:

I just want to point out that we currently incur the overhead of an
encoding check on input to select ASCIIString or UTF8String, but we do
not take advantage of this by writing faster UTF-8 code that can assume
valid data. The majority of the code in getindex(s::UTF8String, i::Int)
is for error checking.

—
Reply to this email directly or view it on GitHub
#1792 (comment).

JeffBezanson · 2014-07-08T21:41:08Z

You just can't get the best performance without assuming valid data. For example, if you expect 2 continuation bytes you could just fetch the next 2 bytes and do the math, but to validate you'd have to check that 2 bytes are available and that they are actually continuation bytes. That necessarily means more branches. In fact we aren't even validating to this extent now. For example

julia> s = UTF8String(Uint8[0xc0,0x00,0x00,0x00,0x00])
"\Uffffff80\0\0\0"

julia> s[1]
'\Uffffff80'

julia> is_valid_utf8(s)
false

UTF-8 is not amenable to a fast path; to validate you need extra checks for every byte. The complexity of the required procedure is simply unreasonable to do on every character access.

nalimilan · 2016-02-20T14:13:59Z

An update on this old issue for people who might end up here: all string types are going to be replaced with a single String type stored in UTF-8 (#14383). I've just created a package to convert text from/to different encodings: https://github.com/nalimilan/StringEncodings.jl. Finally, additional string types storing data in arbitrary encodings will likely be supported in another package (or maybe in StringEncodings.jl).

So I'd say this issue can be closed.

tkelman · 2016-05-12T16:53:39Z

subsumed by #16107

JeffBezanson mentioned this issue Apr 15, 2013

escape_string / unescape_string not symmetric #2867

Closed

JeffBezanson mentioned this issue Jun 30, 2013

getfield performance #3588

Closed

pygy mentioned this issue Feb 27, 2014

readall() silently produces invalid UTF8Strings. #5977

Closed

JeffBezanson mentioned this issue Jul 5, 2014

strip() fails on string read from non-utf8 file #7525

Closed

JeffBezanson mentioned this issue Aug 4, 2014

reading / converting from files in different encodings #7834

Closed

JeffBezanson changed the title ~~reading invalid text data~~ reading invalid text data, other encodings Aug 5, 2014

JeffBezanson mentioned this issue Jun 18, 2015

Invalid UTF8 strings can be created but not displayed on the repl? #11744

Closed

JeffBezanson mentioned this issue Dec 5, 2015

Performance improvements with text processing #14196

Closed

tkelman closed this as completed May 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reading invalid text data, other encodings #1792

reading invalid text data, other encodings #1792

JeffBezanson commented Dec 20, 2012

JeffBezanson commented Dec 20, 2012

pao commented Dec 20, 2012

StefanKarpinski commented Dec 20, 2012

JeffBezanson commented Dec 20, 2012

pao commented Dec 21, 2012

nolta commented Dec 21, 2012

JeffBezanson commented Dec 21, 2012

Keno commented Dec 21, 2012

pao commented Dec 21, 2012

JeffBezanson commented Dec 28, 2012

StefanKarpinski commented Dec 28, 2012

JeffBezanson commented Dec 28, 2012

StefanKarpinski commented Dec 28, 2012

JeffBezanson commented Dec 28, 2012

StefanKarpinski commented Dec 29, 2012

JeffBezanson commented Dec 29, 2012

nalimilan commented Feb 27, 2014

pygy commented Feb 27, 2014

JeffBezanson commented Jul 5, 2014

johnmyleswhite commented Jul 6, 2014

StefanKarpinski commented Jul 7, 2014

JeffBezanson commented Jul 8, 2014

nalimilan commented Feb 20, 2016

tkelman commented May 12, 2016

reading invalid text data, other encodings #1792

reading invalid text data, other encodings #1792

Comments

JeffBezanson commented Dec 20, 2012

JeffBezanson commented Dec 20, 2012

pao commented Dec 20, 2012

StefanKarpinski commented Dec 20, 2012

JeffBezanson commented Dec 20, 2012

pao commented Dec 21, 2012

nolta commented Dec 21, 2012

JeffBezanson commented Dec 21, 2012

Keno commented Dec 21, 2012

pao commented Dec 21, 2012

JeffBezanson commented Dec 28, 2012

StefanKarpinski commented Dec 28, 2012

JeffBezanson commented Dec 28, 2012

StefanKarpinski commented Dec 28, 2012

JeffBezanson commented Dec 28, 2012

StefanKarpinski commented Dec 29, 2012

JeffBezanson commented Dec 29, 2012

nalimilan commented Feb 27, 2014

pygy commented Feb 27, 2014

JeffBezanson commented Jul 5, 2014

johnmyleswhite commented Jul 6, 2014

StefanKarpinski commented Jul 7, 2014

JeffBezanson commented Jul 8, 2014

nalimilan commented Feb 20, 2016

tkelman commented May 12, 2016