-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reads of undef memory must not cause the behavior to be undefined in general. #30500
Comments
/cc @rust-lang/lang |
I'm not sure what you mean here; the fact that rustc sometimes generates a call to memcpy is mostly irrelevant to the semantics of Rust code.
The LLVM |
|
The reason for the current rule seems to be: "I don't know how to formulate the real rules so I'll simply disallow it completely." |
There isn't any fundamental need for it to be legal to write memcpy in Rust... it's part of the runtime. Granted, it would be convenient in some cases. Maybe we can add a special-case for "copying" an undef
You're not looking at this at the right level. "add" is opaque; in theory, it could involve indexing into an array using the values of the operands, which could crash the program if Anyway, trying to make promises about how exactly arithmetic is implemented leads down a path which isn't really productive. |
Whew
That's the opposite of what should be done. The formulation must be more abstract so that everything legal can be done with undef while still disallowing all that that causes the behavior to be undefined. Maybe one even has to go so far as to add a primitive size one type that cannot be interpreted as any other type (without transmute) and has exactly one value that spans all u8 values. One might interpret it as a one byte type with one byte padding but it's not really padding.
I see that adding an example did more harm than good. |
It seems like spec'ing the example of addition/bitmasks would also require expanding the notion of "undefined value" to be at the bit level, and for the language to have some idea about results of (special cases of) operators. The latter seems like a rather open ended space, with very complicated properties encodable. This isn't a blocker, but it does mean touching this requires some care. (Byte-level undef would work for that specific example, but it seems restrictive---what if |
Does the rust manual define undef at all? IIRC it only links to the llvm manual which already talks about undef at the bit level.
The example is not important. It's just supposed to show that, at the llvm level, working with undef can lead to definite results and doesn't necessarily lead to undef propagation. Actually using undef values in rust code for anything remotely complicated can easily lead to UB because of undef propagation. However, there is no reason for the manual to claim that reading undef always leads to UB which is much stricter than what llvm requires. |
Rust isn't LLVM, and we don't necessarily want to make every guarantee that it does. I'm personally not that happy that we defer to LLVM for many definitions for convenience, and I expect this to not be the case in the future e.g. if an actual spec is written. (Feel free to read it as if I said "introducing a more formal Rust undef, which is tracked at the bit level" in place of "expanding the notion ... bit level".) On the point of undefined values, you're right that we just link to LLVM's definition of undef, but we do so in the context of reading undef memory, which doesn't say anything about an in-register value as we'd have for arithmetic, so even the most pedantic reading is vague. Also, I don't recall any team discussion featuring undef values where anything other than the whole value was considered undef. Summary: this is under-spec'd and the existing underlying/assumed sense of this area at the Rust level is almost certainly not for individual bits. Of course, you're also right that using LLVM's undef can lead to definite results even without tracking bits (e.g. |
You're already de facto guaranteeing the current behavior by having it work. Significant changes cannot be made without silently (!) breaking code which is the complete opposite of stability. Even if you had reliable normative information, undefined operations that have behaved reliably for some time cannot always be made behave differently without causing many problems (e.g. signed integer overflow in C.) But there is no reliable normative information and thus people have to rely on what works in the current implementation for just about everything. E.g. the only official information about the behavior of
This doesn't even guarantee that the returned value has anything to do with the input value. Precisely because transmute is completely unspecified, the current implementation must be treated as normative. The same applies to |
For example, the following is discouraged by a lint but does not cause any problems: fn f(&self) -> &Self;
fn f_mut(&mut self) -> &mut Self {
unsafe { transmute(self.f()) }
} At the same time there is other, unreliable information flying around that says that transmuting The lack of any kind of information regarding |
I strongly disagree with this. We should be seeking to specify such things, not just accepting whatever random behavior happens to work in these corners. |
There's a difference between what we specify and what we allow LLVM to assume. The reason there is a lint about transmuting We still have not really decided how much "instant aliasing death" is a thing - @thestinger preferred to have access-based aliasing rules and I think he's got a point - but we didn't specify that it was not a thing. On the
There might very well be some other specification that defines some unspecified behaviour (either to something well-defined or undefined). For example, system calls are specified by your favourite OS's documentation. At this moment, we have no plans to publicly specify everything that is needed by a stable |
This same reasoning can be used to argue that, e.g., we should never change our sorting algorithm, because it may invoke the comparator in a different order, and so forth. We've also made it clear that various low-level details are expected to change, and that authors of unsafe code (in particular) will need to track the language as it evolves. That said, we should definitely consider "common practice" when deciding what kinds of things are undefined behavior. This is only partially because of existing code -- what I am most concerned about is just that if the rules are too complex and abstract (that is, too divorced from some abstract model of how the machine operators), people won't be able to keep them in their head, and so they will write noncomformant code that does surprising things when optimized. From what I can see, C has this problem in spades. Infinite loops, TBAA, etc all lead to making it actually surprisingly hard to write "correct" C code that does anything clever. But of course people write all kinds of clever things in C, many of which are compiler issues waiting to happen. I think @mahkoh has a point that it would be nice to affirm that particular idioms (e.g., a naively written memcpy that "seems right") will work without leading to undefined behavior. I'm just not sure if that's an urgent priority: it's a rather complex equation, since we must also consider what LLVM will do (and to what extent we can control that), and so forth, and we don't want to wind up guaranteeing too much. Put another way, I am sympathetic with the aims of this RFC, but I also wonder if it would be better to try to tackle the problem of "stabilizing" unsafe code patterns in a more wholesale fashion, rather than going at it piecemeal. |
There is no lint against transmuting
There are many special cases for char but apart from that I don't recall anything particularly ugly. It would certainly be better if memcpy did not have to be written with u8's and then rely on LLVM to optimize it.
I disagree with your definitions. Here are mine:
With these definitions, undefined behavior is what you called unspecified behavior. I think the C++11 standard agrees with my definition:
There is a significant difference between undefined behavior and unspecified behavior so we have to agree on what we're talking about.
I don't think anything in this issue is restricted to code in a libcore. In fact, libcore doesn't contain a memcpy so I'm not sure how libcore is related to this issue. A memcpy might be written in many situations: when you write a kernel; when you need a particularly optimized memcpy; when you need a memcpy that can be inlined; etc. And transmutes are certainly used in lots of code.
I don't recall this and breaking random unsafe code seems to go completely against the rest of your stability guarantees. Please link to the text where you said this. |
More specifically: I'll be greatly surprised if you've actually said that authors of unsafe must track the language or else their working code might break without a compiler warning or error, which is what this issue is about. |
@mahkoh I believe https://github.com/rust-lang/rfcs/blob/master/text/1122-language-semver.md#underspecified-language-semantics is the relevant bit. |
That there is no lint against something only means that there is no lint against it, but we may be forced to grandfather some way for The "memcpy hack" is basically that the representation of types is somehow both undefined and well-defined at the same moment. I don't really want to have that hack in Rust.
Compiler writers have traditionally taken "imposes no requirements" to mean that they are allowed to make the program do whatever they want in that case, which is basically equivalent to being allowed to assume that it does not happen (because if they assume wrong, something happens, which satisfies the empty set of requirements imposed).
In that case you would want to write your
That is certainly a very big problem. C's strict aliasing rules are a pretty similar rat's nest, but we need to do something to get out of it.
LLVM can already randomly break unsafe code by becoming smarter about exploiting some UB. We only reserve the right to do similar things on our side. |
Why are you forced to keep this working but are free to break
I'm not sure what you're going on about some hack. There is no hack except for what I already mentioned regarding chars.
I've not said anything contradicting this. The point was that that what you categorize under "unspecified behavior" is already "undefined behavior" and that for something to be unspecified it has to be explicitly mentioned in the spec. I realize that this might be confusing so let me refer you again to the definitions of those terms in the C++11 standard.
At the same time you're telling people that matching on an empty enum is the official way to get an llvm unreachable instruction. Who made it official? Can you point me to the official documentation containing this? If anything, this is even less valid than writing your favorite memcpy in rust code.
Maybe people should just write them in assembly or LLVM IR.
Neither LLVM nor Rust use TBAA which is the main source of UB related to C's aliasing rules. How is this in any way related to the current discussion?
So you only reserve the right to break just about everything because just about everything is UB in rust (see above). LLVM can do this because they actually have a decent amount of documentation allowing people to write code without having to rely on UB. |
"grandfather" = be required to figure out some way to not break it because there are already programs using it and we don't want to release Rust 2.0. Maybe we will be forced to grandfather
C requires that every data structure have a representation as an array of integers (characters) in a round-trippable way, while making that representation basically "undefined" in many ways. That frustrates "symbolic" implementations. I would prefer that Rust's specification be implementable symbolically (note that this does not mean that all the code in
The C specification tries hard not to have any things that are not defined anywhere. On the other hand, the precise sequence of assembly instructions emitted by a C compiler is not defined in any place, but saying that it is something the compiler is allowed to assume makes no sense. When we improve our spec, we should try to make sure that all cases of "in that other case, behaviour is unspecified" are explicitly stated.
Rust's codegen is quite explicitly unspecified. Even intrinsics are basically "emit the designated instruction, along with all necessary wrappers", so specifying that something lowers to exactly an
The issues caused by For example, you can write fn f_mut(&mut self) -> &mut Self {
unsafe { &mut *(self.f() as *const Self as *mut Self) }
} At least, that is supposed to be non-UB.
Rust has lifetime-based alias analysis, which has the same "utter the right incantations to guard against the evil optimizer" problems as TBAA.
Rust's specification is in a very sorry state, with many things left unspecified. Incidentally, what is important for future compatibility is not the specification but rather the stability guarantee, which explicitly says that underspecified areas are not stable and can change between releases. This means that code using these underspecified areas (including type punning) is unfortunately subject to breakage between releases. We try to avoid causing silent breakage, but we prefer that our users will be careful around these areas. |
Seems like a reasonable idea. Write a spec, break all the unspecified things you want (but not more), release 2.0. It's not like an increase of the major has to break lots of things. I'd be fine with it breaking unsafe code as long as I get a real spec in return.
It seems that if you keep every transmute of
I see what you mean but I don't consider this a hack.
Feel free to mention your concerns in #30407. While your idea is appealing on a theoretical level, I don't think this is realistic for a systems language, making it harder to write low level code (allocators, kernels, etc.) in rust. Calling stable rust a systems language is already questionable (it fails the simple test that a systems language can, theoretically, compile itself: stable rust requires language items, stable rust will never be able to compile language items; that is, a stable rust compiler cannot even theoretically be a self-hosting compiler), and this idea makes even nightly rust less systems-y. But, like I said, I see the value of your idea and I think a theoretical spec could very well keep the representation completely unspecified. But at the same time, the rustc documentation has to extend said spec to specify parts of the representation. Code that relies on such details is then of course not portable between implementations.
I'm not sure what you're saying here. Of course the compiler is allowed to assume that the sequence of assembly instructions is not defined. Otherwise it could not perform any optimizations. The C standard describes the behavior of the abstract machine. An implementation is allowed to handle the details in any way it wants as long as the observable behavior agrees with the one described in the standard.
I think there isn't really a difference between "official" and "sanctioned". As long as it's not written down somewhere, it's no more than hearsay. If such a thing has actually been discussed and agreed on, then write it down where everyone can look it up so that we can properly language lawyer once you break it.
Now we're getting somewhere! I assume that by "safely" you mean that your way is "sanctioned"? If so then I'm surprised because one would think that your way is more dangerous than the transmute.
I'd assume that, at the marked point, the borrow has been "released" and that
we've created two live mut pointers to the same address and that the second reference has an unbounded lifetime. The transmute version doesn't seem to have this problem since it goes directly from Edit: See also #30424 which is closely related.
There seem to be lots of things related to the interaction between pointers and references that are completely unspecified. It's one of the things mentioned in @aturon's link.
The text linked by @aturon uses lots of qualifiers to restrict this freedom. And references in particular are heavily specified by the following line in the documentation:
Which links to LLVM's docs which have lots of text describing noalias. |
We will find some reasonable semantics. We should try not to break code in practice, and to allow an upgrade path from what we break. Under that constraint, we should try to have the semantics as clear as possible.
Clearly we need a "low-level Rust" specification in addition to the "high-level Rust" specification. Obviously we need them - our high-level specification does not talk about ABIs at all. However, I don't see much value in allowing the C standard
rustc does not require any lang-items. I don't see how this situation is qualitatively different from libc using linked assembly files for various system call stubs.
It is basically at the level of official hearsay. We are not willing to document performance characteristics at any level beyond that. Optimizers, both ours and LLVM's, can generate whatever code they feel like as long as it functions correctly. We try to make them generate fast code for things people write, especially the "officially sanctioned" ways, but we are not willing to promise anything. As an analogy, |
Is there a point to all of this? It just seems like an excuse to complain about things. Ultimately, in the absence of an actual spec, the only thing we can go on is common sense and current behaviour. The corners are where common sense fails and current behaviour only works via luck. However, until we get a spec, there's no point in arguing about stuff like this. I'm in favour of just closing this issue unless some actionable issue is presented. We have other channels for this kind of discussion. |
I think we should have some organized place for tracking the Rust memory model mess. |
As @Aatch says, there's nothing really actionable here: spec-ing this sort of thing is the realm of an RFC, since there's design decisions to make and tradeoffs to be considered (e.g. http://www.playingwithpointers.com/problem-with-undef.html). Therefore, I'm closing. |
@huonw would you mind opening an RFC issue for this and linking it here? As Ariel said, we should have some central place to discuss this I'd do it but I'm on a mobile device for the next few hours |
okay i opened an RFC issue for Rust needing a memory model; cc rust-lang/rfcs#1447 |
This cannot be because memcpy will read padding bytes which are undef. It's also not true in practice because in
z
will be0xab00
and not undef.The text was updated successfully, but these errors were encountered: