Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Char * String and Char * Char #22532

Merged
merged 10 commits into from
Jul 11, 2017
Merged

Conversation

adamslc
Copy link
Contributor

@adamslc adamslc commented Jun 25, 2017

Solves #22512. I am definitely a Github novice, so let me know if I made a stupid mistake somewhere...

@KristofferC
Copy link
Member

Perhaps you could also add some tests of the kind 'a' * "b" * 'c' etc?

@kshyatt kshyatt added the strings "Strings!" label Jun 25, 2017
Luke Adams added 2 commits June 25, 2017 11:22
@adamslc
Copy link
Contributor Author

adamslc commented Jun 25, 2017

I've added a few more tests. I tried to squash the extra commit, but I just made a horrible mess. How should I do that?

@KristofferC
Copy link
Member

It's OK, it can be squashed on merge with the github ui.

@@ -68,6 +68,9 @@ julia> "Hello " * "world"
```
"""
(*)(s1::AbstractString, ss::AbstractString...) = string(s1, ss...)
(*)(c::Char, s::AbstractString) = string(c, s)
(*)(s::AbstractString, c::Char) = string(s, c)
(*)(c1::Char, c2::Char) = string(c1, c2)
Copy link
Contributor

@pabloferz pabloferz Jun 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might subsume all these with

(*)(s::AbstractString, ss::Union{Char,AbstractString}...) = string(s, ss...)
(*)(c::Char, ss::Union{Char,AbstractString}...) = string(c, ss...)

and take advantage of the string(::Union{Char,String}...) method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced it further to one method (see next commit)

@andyferris
Copy link
Member

andyferris commented Jun 25, 2017

I was wondering about concatenating characters to make this operation more complete, e.g. char * char and things like 'a' * 'b' * "c" vs "a" * 'b' * 'c'.

Sorry I misread :)

@andyferris
Copy link
Member

PS ++ rules 😜

@tkelman
Copy link
Contributor

tkelman commented Jun 26, 2017

docstring should be updated here - it isn't formatted correctly on master, signatures should be indented, not backtick fenced

```
*(s::AbstractString, t::AbstractString)
```
*(s::Union{Char, AbstractString}, t::Union{Char, AbstractString})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... on the second input

@adamslc
Copy link
Contributor Author

adamslc commented Jun 26, 2017

Does this need anything else?

@pabloferz
Copy link
Contributor

There was a timeout on linux 32-bit so I restated the build (gist backed-up here https://gist.github.com/pabloferz/4cb263e0967ba9a5c3256bc53d3619ee)


Concatenate strings. The `*` operator is an alias to this function.
Concatenate strings and characters. The `*` operator is an alias to this function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"[...] and characters to a [`String`](@ref)." perhaps?

Also, this is the * function, its an alias for itself!?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it like this:

"""
    *(s::Union{AbstractString, Char}, t::Union{AbstractString, Char})

Concatenate strings and/or characters, producing a [`String`](@ref). This is equivalent
to calling the [`string`](@ref) function on the arguments.
"""

Concatenate strings and/or characters, producing a [`String`](@ref). This is equivalent
to calling the [`string`](@ref) function on the arguments.

# Examples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete the blank line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had been putting lines between the headers and contents in docstring?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recently moving towards getting rid of them everywhere

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K, thanks for the heads up

```
"""
(*)(s1::AbstractString, ss::AbstractString...) = string(s1, ss...)
(*)(s1::Union{Char, AbstractString}, ss::Union{Char, AbstractString}...) = string(s1, ss...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from some of the linalg code, typically there isn't a space after the comma in Union. Probably best to keep consistency within this file and alike.

@StefanKarpinski
Copy link
Member

Unless there are objections, I plan to merge this in 24 hours.

@@ -56,9 +56,12 @@ sizeof(s::AbstractString) = error("type $(typeof(s)) has no canonical binary rep
eltype(::Type{<:AbstractString}) = Char

"""
*(s::Union{Char, AbstractString}, t::Union{Char, AbstractString}...)
*(s::Union{AbstractString, Char}, t::Union{AbstractString, Char})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why where the three dots removed?

@ararslan
Copy link
Member

Once the ... is added back to the docstring, LGTM.

@KristofferC
Copy link
Member

Just add it back (with collaborator access to the branch) and merge?

[ci skip]
@ararslan
Copy link
Member

Good idea, @KristofferC. Done.

@ararslan ararslan merged commit f98b857 into JuliaLang:master Jul 11, 2017
@ararslan
Copy link
Member

Thanks for the contribution, @adamslc! Nice work here.

ararslan added a commit that referenced this pull request Jul 11, 2017
@stevengj
Copy link
Member

stevengj commented Jul 12, 2017

Two problems:

  • Since it defines Char * Char = String, it should also define one(::Type{Char}) = "".

  • Should probably have a specialized Char^Integer method similar to the String^Integer method.

@ararslan
Copy link
Member

Regarding one, isn't it typically assumed that one(::Type{T}) has type T? Defining that method for Char would break that, and AFAIK it would be the only exception. (There may be others that I don't know about though.)

@stevengj
Copy link
Member

stevengj commented Jul 12, 2017

@ararslan, no, that is not correct. e.g. one for a dimensionful quantity returns a different type. (If you want the same type, you call oneunit, which isn't defined here.)

One the other hand, it is true that "" isn't really a multiplicative identity for Char, since "" * 'x' == "x", not 'x'. That makes me think we shouldn't define one after all.

@stevengj
Copy link
Member

stevengj commented Jul 12, 2017

We should definitely have a specialized ^, however. The default one isn't type-stable for Char and is grossly inefficient for this type anyway.

@musm
Copy link
Contributor

musm commented Jul 12, 2017

@stevengj

The following specialized version of repeat, which ^ calls, seems to work fine.

function repeat(s::Char, r::Integer)
    r < 0 && throw(ArgumentError("can't repeat a char $r times"))
    out = _string_n(r)
    ccall(:memset, Ptr{Void}, (Ptr{UInt8}, Cint, Csize_t), out, s, r)
    return out
end

There isn't much speed improvement over repeat(s::String,r::Integer)

@stevengj
Copy link
Member

stevengj commented Jul 12, 2017

(That only works for isascii(s). For non-ascii I would just call string(s)^r as a fallback.)

@musm
Copy link
Contributor

musm commented Jul 12, 2017

Calling repeat(string(s), 3) would allocate and makes it about twice as slow.

@stevengj
Copy link
Member

stevengj commented Jul 12, 2017

@musm, I understand that, but since char^integer and string^integer are almost exclusively used for ASCII chars (mainly for repeating spaces), I think it is fine to optimize mainly the ASCII case of char^integer and leave the non-ASCII case to a slower fallback for now.

@StefanKarpinski
Copy link
Member

To implement an efficient character repeating operator, it's sufficient to figure out what 1-4 byte pattern the character produces in UTF-8 and then copy that as many times as the character needs to be repeated. Not entirely straightforward, but not crazy to implement either.

@ararslan
Copy link
Member

Is there something relevant already implemented in utf8proc?

@stevengj
Copy link
Member

Stefan, I know its possible, but I don't think it is worth the trouble

@StefanKarpinski
Copy link
Member

Sure, can always be done as an optimization in the future some time.

ararslan added a commit that referenced this pull request Jul 15, 2017
ararslan added a commit that referenced this pull request Jul 15, 2017
jeffwong pushed a commit to jeffwong/julia that referenced this pull request Jul 24, 2017
@musm
Copy link
Contributor

musm commented Sep 20, 2017

@ScottPJones sent me the following version which does not allocate a while back. I don't think he has had the time to open a PR on his branch so I am posting this here in the hopes that someone opens a PR with the change

function repeat(c::Char, r::Integer)
    r < 0 && throw(ArgumentError("can't repeat a character $r times"))
    r == 0 && return ""
    ch = UInt(c)
    if ch < 0x80
        out = Base._string_n(r)
        ccall(:memset, Ptr{Void}, (Ptr{UInt8}, Cint, Csize_t), out, c, r)
    elseif ch < 0x800
        out = _string_n(2r)
        p16 = reinterpret(Ptr{UInt16}, pointer(out))
        u16 = ((ch >> 0x6) | (ch & 0x3f) << 0x8) % UInt16 | 0x80c0
        @inbounds for i = 1:r
            unsafe_store!(p16, u16, i)
        end
    elseif ch < 0x10000
        (0xd800  ch  0xdfff) || throw(ArgumentError("invalid character 0x$(hex(ch))"))
        out = _string_n(3r)
        p = pointer(out)
        b1 = (ch >> 0xc) % UInt8 | 0xe0
        b2 = ((ch >> 0x6) & 0x3f) % UInt8 | 0x80
        b3 = (ch & 0x3f) % UInt8 | 0x80
        @inbounds for i = 1:r
            unsafe_store!(p, b1)
            unsafe_store!(p, b2, 2)
            unsafe_store!(p, b3, 3)
            p += 3
        end
    elseif ch < 0x110000
        out = _string_n(4r)
        p32 = reinterpret(Ptr{UInt32}, pointer(out))
        u32 = ((ch >> 0x12) | ((ch >> 0x4) & 0x03f00) |
            ((ch << 0xa) & 0x3f0000) | ((ch & 0x3f) << 0x18)) % UInt32 | 0x808080f0
        @inbounds for i = 1:r
            unsafe_store!(p32, u32)
            p32 += 4
        end
    else
        throw(ArgumentError("invalid character 0x$(hex(ch))"))
    end
    return out
end

@StefanKarpinski
Copy link
Member

Thanks, PR: #23787.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.