-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Range of the Char #317
Comments
I was expecting that |
Good catch! The strings in Crystal are always represented in memory as UTF-8. That's why the character FF is represented by two bytes. |
I think this line of CharReader also have been affected. But we may have to be a bit more specific, by checking the first two bytes: if first < 0xf4 || first == 0xf4 && second < 0x90 |
👍 |
@yous You are right. We should actually copy the algorithm of the unsigned read_code_point_from_utf8()
{
int code_unit1, code_unit2, code_unit3, code_unit4;
code_unit1 = getchar();
if (code_unit1 < 0x80) {
return code_unit1;
} else if (code_unit1 < 0xC2) {
/* continuation or overlong 2-byte sequence */
goto ERROR1;
} else if (code_unit1 < 0xE0) {
/* 2-byte sequence */
code_unit2 = getchar();
if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
return (code_unit1 << 6) + code_unit2 - 0x3080;
} else if (code_unit1 < 0xF0) {
/* 3-byte sequence */
code_unit2 = getchar();
if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
if (code_unit1 == 0xE0 && code_unit2 < 0xA0) goto ERROR2; /* overlong */
code_unit3 = getchar();
if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
return (code_unit1 << 12) + (code_unit2 << 6) + code_unit3 - 0xE2080;
} else if (code_unit1 < 0xF5) {
/* 4-byte sequence */
code_unit2 = getchar();
if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
if (code_unit1 == 0xF0 && code_unit2 < 0x90) goto ERROR2; /* overlong */
if (code_unit1 == 0xF4 && code_unit2 >= 0x90) goto ERROR2; /* > U+10FFFF */
code_unit3 = getchar();
if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
code_unit4 = getchar();
if ((code_unit4 & 0xC0) != 0x80) goto ERROR4;
return (code_unit1 << 18) + (code_unit2 << 12) + (code_unit3 << 6) + code_unit4 - 0x3C82080;
} else {
/* > U+10FFFF */
goto ERROR1;
}
ERROR4:
ungetc(code_unit4, stdin);
ERROR3:
ungetc(code_unit3, stdin);
ERROR2:
ungetc(code_unit2, stdin);
ERROR1:
return code_unit1 + 0xDC00;
} I think it catches more invalid sequences than what we have right now (and I'd like to have a spec for each invalid case). |
@yous Could you take a look at the last commit to see if it's ok? It's more or less Wikipedia's code with some common code refactored. I see Wikipeadia's code returns an invalid code point if it encounters an invalid byte sequence, instead of throwing an exception. I wonder if that's what we really need to do. Another thing that we don't do very well is this: puts 0x1FFFF.chr # assume utf-8 encoding, works but maybe should raise But in Ruby: puts 0x110000.chr(Encoding::UTF_8) Gives:
Do you think we should raise as well? |
@asterite I think the commit is okay. For the first thing, seeing Codepage layout:
I think we don't need to think about encoding 7-bit ASCII value using 2 bytes. So raising makes sense. See Overlong encodings for further details:
Are we already raising an error for 0x1FFFF.chr(Encoding::UTF_8) If you are thinking about pass the encoding to >> 0x80.chr.encoding
=> #<Encoding:ASCII-8BIT>
>> 0x80.chr.bytes
=> [128]
>> 0x80.chr(Encoding::UTF_8).encoding
=> #<Encoding:UTF-8>
>> 0x80.chr(Encoding::UTF_8).bytes
=> [194, 128] |
Thanks for the detailed answer!
About your last comment, yes, Char always represents a codepoint in the UTF-8 encoding, because when you ask |
Seeing this line, a
Char
can have itsord
at most0x1FFFFF
. But for UTF-8, it ends at0x10FFFF
by RFC 3629. At first it ends at0x1FFFFF
, but restricted on November 2003.Invalid byte sequences indicates:
Also the
write_utf8
method of its sample code is as same as oureach_byte
, but0x10FFFF
not0x1FFFFF
.Is
Char
support UTF-16 for0x10FFFF < ord <= 0x1FFFFF
, or is this a mistake?The text was updated successfully, but these errors were encountered: