Skip to content

Commit

Permalink
Add missing rand(::AbstractRNG, ::Type{Char}) method
Browse files Browse the repository at this point in the history
use simple rejection sampling over valid codepoint range
  • Loading branch information
jakebolewski committed May 6, 2015
1 parent 58c16b6 commit 5986e58
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 4 deletions.
13 changes: 12 additions & 1 deletion base/random.jl
Original file line number Diff line number Diff line change
Expand Up @@ -248,10 +248,21 @@ end
rand(r::MersenneTwister, ::Type{Int64}) = reinterpret(Int64, rand(r, UInt64))
rand(r::MersenneTwister, ::Type{Int128}) = reinterpret(Int128, rand(r, UInt128))

## random complex values
## random Complex values

rand{T<:Real}(r::AbstractRNG, ::Type{Complex{T}}) = complex(rand(r, T), rand(r, T))

# random Char values
# use simple rejection sampling over valid Char codepoint range
function rand(r::AbstractRNG, ::Type{Char})
while true
c = rand(0x00000000:0x0010fffd)
if is_valid_char(c)
return reinterpret(Char,c)
end
end
end

## Arrays of random numbers

rand(r::AbstractRNG, dims::Dims) = rand(r, Float64, dims)
Expand Down
6 changes: 3 additions & 3 deletions test/random.jl
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ srand(0); rand(); x = rand(384);
@test -10 <= rand(-10:-5) <= -5
@test -10 <= rand(-10:5) <= 5
@test minimum([rand(Int32(1):Int32(7^7)) for i = 1:100000]) > 0
@test(typeof(rand(false:true)) == Bool)

@test(typeof(rand(false:true)) === Bool)
@test(typeof(rand(Char)) === Char)
@test length(randn(4, 5)) == 20
@test length(bitrand(4, 5)) == 20

Expand Down Expand Up @@ -292,7 +292,7 @@ for rng in ([], [MersenneTwister()], [RandomDevice()])
rand!(rng..., BitArray(5)) ::BitArray{1}
rand!(rng..., BitArray(2, 3)) ::BitArray{2}

for T in [Base.IntTypes..., Bool, Float16, Float32, Float64]
for T in [Base.IntTypes..., Bool, Char, Float16, Float32, Float64]
a0 = rand(rng..., T) ::T
a1 = rand(rng..., T, 5) ::Vector{T}
a2 = rand(rng..., T, 2, 3) ::Array{T, 2}
Expand Down

16 comments on commit 5986e58

@ScottPJones
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the function should be instead:

function rand(r::AbstractRNG, ::Type{Char})
   v = rand(0x00000000:0x0010f7ff)
   (v < 0xd800) ? Char(v) : Char(v+0x800)
end

Much simpler, and your code misses two valid code points... 0x10fffe and 0x10ffff.

@mschauer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, http://www.unicode.org/faq/private_use.html says that 0x0010f7ff and 0x0010f7fd are not characters, so should not be generated as random characters. Maybe there are more traps.

@ScottPJones
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it says they are perfectly valid code points... so Char() should allow them. There are 66 characters that are considered “non character” code points.

It just needs to disallow 0xd800-0xdfff, and >0x10ffff.

Q: Are noncharacters invalid in Unicode strings and UTFs?

A: Absolutely not. Noncharacters do not cause a Unicode string to be ill-formed in any UTF. This can be seen explicitly in the table above, where every noncharacter code point has a well-formed representation in UTF-32, in UTF-16, and in UTF-8. An implementation which converts noncharacter code points between one UTF representation and another must preserve these values correctly. The fact that they are called "noncharacters" and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid.

@mschauer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not say that they are not codepoints. Anyway, what to choose depends entirely on if you want random codepoints or random characters. For example nan is not a number, and is a completely valid contents of a Float64 and still not a reasonable return value for a function giving random floating point numbers.

@ScottPJones
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the documentation of the code itself says the following: >valid Char codepoint range.
According to the Unicode standard, Julia's implementation of is_valid_char is itself incorrect.
I'll have to create an issue for that...

@jakebolewski
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be better to fix the function in utf8proc which is called by is_valid_char.

@mschauer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It makes the name of the function is_valid_char a bit unfortunate though.

@ScottPJones
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakebolewski No, actually it would be better to do it the real Julian way, and just do it in Julia (as I've learned recently!) ;-)

@mschauer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, do we want that a sequence of random unicode chars from this function form valid unicode? Then we should exclude combining characters.

@jiahao
Copy link
Member

@jiahao jiahao commented on 5986e58 May 6, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ref: #11171

IIRC the original intention of this function was to sample things that can be turned into Unicode strings. In which case the thing we want is actually Unicode scalar values, not characters, and not code points.

@mschauer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiahao Ok, thank you. Please correct: That means that a single combining character is a scalar and a wellformed sequence can have a combining character as first letter.

@jiahao
Copy link
Member

@jiahao jiahao commented on 5986e58 May 6, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mschauer that is my understanding also. There is no requirement that a valid Unicode string must be correctly decodable into a sequence of characters, only that it must be decodable into a sequence of Unicode scalar values. Otherwise you couldn't concatenate two Unicode strings like "a" and "¨" (combining umlaut) to produce "ä".

@mschauer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading http://unicode.org/faq/char_combmark.html I think one can even stronger say that COMBINING DIAERESIS has a grapheme ¨. The combination "a"*"¨" has ä as grapheme. So that concatenating sequences only changes the graphemes.

@jiahao
Copy link
Member

@jiahao jiahao commented on 5986e58 May 6, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mschauer you're addressing the concept of a grapheme cluster, which is also treated in Chapter 3 of the standard. It's very absorbing reading; I highly recommend it.

@mschauer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiahao That is somehow the funniest statement I heard today, and you are right. :-)

@mschauer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And coming back to the issue this means that rand(::AbstractRNG, ::Type{Char}) should indeed return a random unicode scalar value.

Please sign in to comment.