-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reading invalid text data, other encodings #1792
Comments
We could have |
UTF-8 is a variable-width encoding, though, right? If I'm trying to read up to a letter "l"/0x6C in a text that also happens to contain the simplified Chinese character ma3 ("horse"), which is 马/0x9A6C, that might be a problem if I'm not decoding it as UTF-8. Does there need to be a fast byte-oriented (Note: I do not know any Chinese. This just happens to be a frequent example character on Language Log, and also contains a byte in the low ASCII range.) |
I really think we need a layered approach here where you can get raw bytes and then build strings from those. Searching for a character without specifying an encoding makes no sense. Auto-detecting the encoding by default seems like it would be sufficient. It would ideally be done lazily in the sense that you could start with ASCII and promote a stream to Latin-1 or UTF-8 if and when a character that gives a clue is encountered (i.e. invalid UTF-8 implies Latin-1). |
Yes, I was proposing a low-level |
The point of the above Chinese character digression was that reading up to an ASCII character from a UTF-8 stream cannot use the byte-based |
On Fri, Dec 21, 2012 at 12:45 AM, pao [email protected] wrote:
-Mike
|
In fact I believe searching for ascii characters in this way was one of the
|
@pao I think you might be confusing the unicode codepoint with the actual utf-8 encoding |
Apparently so. Nothing to see here, move along. Argh Unicode, so useful yet complicated. |
Come to think of it, is there a good reason not to have |
ASCII is a sub-encoding of UTF-8 whereas Latin-1 is not. Thus, you can concatenate ASCII and UTF-8 data and the result is the UTF-8 encoding of the input strings. With Latin-1 this doesn't work; instead you have to transcode the Latin-1 data first, expanding high bytes to UTF-8 two-bytes and then concatenate the data. You may recall that we originally had |
Ok, maybe we should just add a function to convert latin-1 to utf-8.
|
That doesn't really seem to solve the kinds of problems we've been seeing. Mostly we've been having problems with UTF-8 strings that contain invalid UTF-8 data like |
We already support reading until an int8; I will just reshuffle the
|
To make things really smooth, we probably need to do some degree of encoding autodetection. UTF-32 and UTF-16 are easy to autodetect; Latin-1 and UTF-8 are unfortunately both common and hard to distinguish. I was thinking that the way to go would be to start out with ASCII and promote the encoding whenever something that is unequivocally Latin-1 or UTF-8 occurs, which implies a design where the encoding of a stream can change on the fly. |
Autodetection is not that easy; it requires a decent chunk of sample text and possibly some statistical analysis. We are currently auto-selecting ascii or utf-8 for whole lines, which can probably be extended to utf-16, utf-32, and latin-1. However, this does not work well for single characters. I'm not sure we want the state of a stream to change under you, such that the behavior of We should have an official approach to corrupt utf-8. One possible answer is you get an unspecified type of string whose |
returns a string for Char delim, otherwise an array readuntil(s, uint8('\n')) provides a way to read a line without an encoding check streams are encouraged to provide an efficient readuntil(s, ::Uint8), then reading until ASCII delimiters is fast, leaving the encoding logic to higher-level layers in io.jl and string.jl. for #1792
My two cents from somebody having worked with text data in French: it is quite common to get a few invalid UTF-8 characters, for example because a content provider got the encoding of one text wrong in the middle of a very large corpus. Rather than getting an error, a warning with an automatic replacement of invalid sequences would be more practical. It makes it easier to spot the incorrect characters after loading the data, for example, if you load texts stored in XML, you'd better read the parsed result than the source to find where the problem comes from. |
Summary of the #5977 discussion (about
|
I just want to point out that we currently incur the overhead of an encoding check on input to select |
As a follow-up to my comment about moving between immutable strings and raw bytes, it would good to make sure we can create strings without doing any encoding checks in the next iteration of string types. |
Ideally, we can merge ASCIIString and UTF8String and keep doing the runtime On Saturday, July 5, 2014, Jeff Bezanson [email protected] wrote:
|
You just can't get the best performance without assuming valid data. For example, if you expect 2 continuation bytes you could just fetch the next 2 bytes and do the math, but to validate you'd have to check that 2 bytes are available and that they are actually continuation bytes. That necessarily means more branches. In fact we aren't even validating to this extent now. For example
UTF-8 is not amenable to a fast path; to validate you need extra checks for every byte. The complexity of the required procedure is simply unreasonable to do on every character access. |
An update on this old issue for people who might end up here: all string types are going to be replaced with a single So I'd say this issue can be closed. |
subsumed by #16107 |
We probably need to change 1 or 2 behaviors when reading invalid data or unknown encodings. There are two cases:
Char
, and things likereaduntil
/readline
.readuntil
can be reasonably defined in terms of bytes: just read everything until a certain value. This is good because then you can at least get the data without explicit support for every encoding. Currently we might return an invalidUTF8String
, from which you can get the (unaltered) data. I don't know whether that is the best approach. Maybe there should be a lower-level routine that returns a byte array. We also need functions that do the same for different fixed-width encodings (16-bit, 32-bit).Reading a
Char
I don't think can be done reasonably without knowing the encoding. The best immediate change I can think of is to give an error for invalid data while trying to read a UTF-8-encoded Char.The text was updated successfully, but these errors were encountered: