-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add missing rand(::AbstractRNG, ::Type{Char}) method
use simple rejection sampling over valid codepoint range
- Loading branch information
1 parent
58c16b6
commit 5986e58
Showing
2 changed files
with
15 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the function should be instead:
Much simpler, and your code misses two valid code points... 0x10fffe and 0x10ffff.
5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, http://www.unicode.org/faq/private_use.html says that
0x0010f7ff
and0x0010f7fd
are not characters, so should not be generated as random characters. Maybe there are more traps.5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it says they are perfectly valid code points... so Char() should allow them. There are 66 characters that are considered “non character” code points.
It just needs to disallow 0xd800-0xdfff, and >0x10ffff.
5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not say that they are not codepoints. Anyway, what to choose depends entirely on if you want random codepoints or random characters. For example
nan
is not a number, and is a completely valid contents of a Float64 and still not a reasonable return value for a function giving random floating point numbers.5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the documentation of the code itself says the following: >valid Char codepoint range.
According to the Unicode standard, Julia's implementation of is_valid_char is itself incorrect.
I'll have to create an issue for that...
5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be better to fix the function in
utf8proc
which is called byis_valid_char
.5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It makes the name of the function
is_valid_char
a bit unfortunate though.5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakebolewski No, actually it would be better to do it the real Julian way, and just do it in Julia (as I've learned recently!) ;-)
5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, do we want that a sequence of random unicode chars from this function form valid unicode? Then we should exclude combining characters.
5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ref: #11171
IIRC the original intention of this function was to sample things that can be turned into Unicode strings. In which case the thing we want is actually Unicode scalar values, not characters, and not code points.
5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiahao Ok, thank you. Please correct: That means that a single combining character is a scalar and a wellformed sequence can have a combining character as first letter.
5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mschauer that is my understanding also. There is no requirement that a valid Unicode string must be correctly decodable into a sequence of characters, only that it must be decodable into a sequence of Unicode scalar values. Otherwise you couldn't concatenate two Unicode strings like "a" and "¨" (combining umlaut) to produce "ä".
5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading http://unicode.org/faq/char_combmark.html I think one can even stronger say that COMBINING DIAERESIS has a grapheme
¨
. The combination"a"*"¨"
hasä
as grapheme. So that concatenating sequences only changes the graphemes.5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mschauer you're addressing the concept of a grapheme cluster, which is also treated in Chapter 3 of the standard. It's very absorbing reading; I highly recommend it.
5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiahao That is somehow the funniest statement I heard today, and you are right. :-)
5986e58
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And coming back to the issue this means that
rand(::AbstractRNG, ::Type{Char})
should indeed return a random unicode scalar value.