Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String constructor truncates data. #32528

Open
MasonProtter opened this issue Jul 8, 2019 · 17 comments
Open

String constructor truncates data. #32528

MasonProtter opened this issue Jul 8, 2019 · 17 comments
Labels
speculative Whether the change will be implemented is speculative strings "Strings!"
Milestone

Comments

@MasonProtter
Copy link
Contributor

So a discussion came un on slack that this behaviour

julia> x = UInt8[73, 74]
2-element Array{UInt8,1}:
 0x49
 0x4a

julia> String(x)
"IJ"

julia> x
0-element Array{UInt8,1}

is fairly surprising.

One normally expects that function calls be explicit if they mutate data and I think cosntructors like String should be held to similar standards. I see there was some discussion in #26093, and the conclusion seemed to be that since there is no type String! then naming the constructor String! doesn't make sense, especially because (according to those more knowledgable than I) nearly 100% of uses of String want the memory stealing behaviour.

I think it is worth revisiting this issue. I'd argue that even though one usually wants memory stealing, it's best to always be explicit if that is happening, either through a keyword argument, e.g. String(x, steal=true), or through a different constructor name String!.

The current behaviour seems like a real footgun to me, even if it's convenient when you know what you're doing.

@ararslan ararslan added speculative Whether the change will be implemented is speculative strings "Strings!" labels Jul 8, 2019
@StefanKarpinski
Copy link
Member

There is the issue that changing the current stealing behavior is technically breaking. It may be that no one is relying on this behavior so that it could be considered a "minor change" but we'd have to test that out. I'm guessing that the most common case is that after a byte vector has been "stolen" it is never used again so the code won't care if it's truncated or not.

@KristofferC
Copy link
Member

String(take!(io)) is the de facto way of being the performant way of getting a string out of an io object (cf the docs for String):

When possible, the memory of v will be used without copying when the String object is created. This is guaranteed to be the case for byte vectors returned by take! on a writable IOBuffer and by calls to read(io, nb). This allows zero-copy conversion of I/O data to strings.

Breaking this guarantee is to me "breaking" even if the code still (eventually) produces the same answer.

@MasonProtter
Copy link
Contributor Author

I wouldn't be surprised or even necessarily upset if this was considered too big a change for a minor release (though if possible, I'd still prefer it in a minor release). However, I think this is confusing semantics that goes against standard naming conventions and should at minimum be considered for 2.0.

Telling people that mutating functions are marked by a ! and then not following it in seemingly arbitrary cases is worse than not having the convention at all since it lures one into a false sense of security when they read code.

@JeffBezanson
Copy link
Member

It's worth remembering what we did before, which was share data (if possible) but not truncate the vector. After all, truncating the vector is not necessary. Then you just need to tell people not to mutate the vector later. Frankly I liked that a bit better, since String(x) was indeed non-mutating, and mutating a vector after turning it into a string is really rare. Of course the objection was that if you did that, it would cause hard-to-find bugs. That was "solved" by making the behavior predictably bad every time so the potential problem is easier to catch. I get the reasoning, but that's still not a philosophy I fully agree with.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jul 8, 2019

There was also an intermediate time when this would sometimes truncate the byte array and sometimes not, depending on details of how the byte array was constructed. For 2.0, this would be a great use case for @Keno's freeze/thaw stuff since you could just say that String requires a frozen vector and have the array and the string safely share memory. If someone wanted to mutate the byte array afterwards, they'd have to thaw it, which would at that point create a copy.

@yurivish
Copy link
Contributor

yurivish commented Jul 8, 2019

How much of the surprise factor here would be cured if the function were called String! instead?

@ghost
Copy link

ghost commented Jul 8, 2019

This exact behavior once cost me an hour of debugging, before I finally figured out what was going on 😄. I would much prefer if String(::Vector{UInt8}) made a copy instead, with the explicit option to move and truncate the underlying bytes. This would technically be "merely" a performance regression, though I acknowledge that it may be big enough in practice to be considered breaking.

@NHDaly
Copy link
Member

NHDaly commented Jul 9, 2019

Would it be crazy to just deprecate (and eventually delete) the String(::Array) constructor, and have String! be the only string constructor for arrays of chars? This would continue the current behavior (never copying) and just amount to a rename to make the behavior clearer?

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jul 9, 2019

That's definitely a breaking change, so unfortunately, yes, it's "crazy" for a 1.x release. We could, however, leave String, change it to never truncate and introduce String! with the current behavior of String. That has the issue that it makes String(take!(io)) copying and therefore slow.

@NHDaly
Copy link
Member

NHDaly commented Jul 9, 2019

That's definitely a breaking change, so unfortunately, yes, it's "crazy" for a 1.x release.

Ah, yes, sorry, I was predicating that on saving this for 2.0, per @MasonProtter's point.

That has the issue that it makes String(take!(io)) copying and therefore slow

Yeah, that's what I was hoping to avoid by forcing everyone to do the replacement from String to String! for those cases.

@StefanKarpinski
Copy link
Member

If we're willing to wait until 2.0 to do this then yes, we can deprecate String for String! and then later reintroduce String to always make a copy. But for 2.0, as I said above, we may well have the ability to freeze and thaw arrays and require passing a frozen byte array to String.

@MasonProtter
Copy link
Contributor Author

One thing to keep in mind with this discussion is that people often automatically generate constructors via a where T and then do T(data).

On one hand, we don't want to suddenly make that pattern less performant when applied to strings, but on the other hand, I think it's fair to say that I would be pretty baffled if I wrote something like T(data) and suddenly my data disappeared. So the freeze-thaw thing (#31630) does kinda sound like the best of both worlds.

@vtjnash
Copy link
Member

vtjnash commented Jul 11, 2019

requires a frozen vector

In Keno's proposal you get that by making a copy of the vector. So we could also just use the existing copy function, if that's what we want. In the future (independent of #31630), in some cases, we can implement a copy-elision pass. It doesn't make much difference to the compiler if you call it copy-elision or freeze-elision—in this case they're probably much the same computation since the result escapes (aka, is returned from a function). In the string-builder case, the compiler already has a really hard time figuring out types (which are always immutable). Realistically, I'm not confident the compiler will be ever able to elide this particular copy in the cases we most care about.

@tpapp
Copy link
Contributor

tpapp commented Oct 15, 2019

Can this be marked for the 2.0 milestone so that it is not missed accidentally?

@jariji
Copy link
Contributor

jariji commented May 10, 2023

String(::Vector) can't be removed in 1.x, but what about deprecate without removing it, and introduce String!?

@tpapp
Copy link
Contributor

tpapp commented May 11, 2023

Could we pretend that (or implement it that way)

struct String
    utf8_codeunits::Vector{UInt8}
end

from which String(::Vector{UInt8}) just follows as the constructor?

Not being different from any other struct that wraps a mutable container, with all the relevant consequences.

@KristofferC
Copy link
Member

KristofferC commented May 11, 2023

That's how it used to be with the issue that any mutation to the input vector (which in this case you will still have a reference to) is basically undefined behavior since strings are assumed to be immutable. The alternative is unconditionally copying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
speculative Whether the change will be implemented is speculative strings "Strings!"
Projects
None yet
Development

No branches or pull requests

10 participants