-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define: Byte #181
Comments
"abyte" or "ambyte" as a short for "abstract (machine) byte" might be a term we could use instead. |
Considering just real hardware, it isn't uncommon to have to look up what kind of "byte" we are talking about. For example, whether a byte is 8, 16 (DSPs), or N bits wide, whether a byte can be 0xUU (e.g. Itanium), whether a byte can carry provenance or other information (e.g. CHERI), etc. depends on the machine. What matters for Rust programs is what a "byte" is in the Rust abstract machine, not what it is in the hardware the binary runs on. So I would just define "byte" with the meaning of it for the abstract machine, and then see if we also need to introduce a different term to specifically talk about "hardware bytes". Is there a part of the spec in which we actually need to talk specifically about "hardware bytes" differentiating them from "abstract machine bytes"? |
Itanium's NaT are not exactly like 0xUU, I think. But I get your point. The fact remains though that the vast majority of people, and plenty of online material, equates "byte" and (FWIW, we could define a Rust Abstract Machine byte to be 8 Rust Abstract Machine bits, but then a "bit" would be 0, 1, U (uninitialized), or a fragment of a pointer.) |
Sure, but isn't this actually correct? If an That is, when we define a "byte" in the abstract machine, a I agree with you that some people will think that an The only thing we can do is explain things properly in the docs and the spec. If I'm writing safe Rust code, and see an u8, thinking that it is 8-bits that can be 0 or 1 is actually ok-ish, since I can get hit by provenance or uninitialized memory in safe Rust. The question is whether we want to introduce a different term to denote this other types of |
That's a big "if", neither of those is settled and in particular integers likely shouldn't carry provenance. |
I'm not sure doing this is worth it. I don't think it is possible to just end up with a single uninitialized bit, or to copy the provenance of a single bit.
Sure, all of this hangs together. But in the hypothetical case in which we specify that a Byte carries the information that it is a fragment of a pointer, and |
Sorry, what I mean is
This is moving towards the edge of what is known about how to specify an abstract machine that supports both pointer provenance and ptr-int-casts. I am going to switch to the LLVM Abstract Machine here because that is where at least some things are known. When doing a
|
Regarding the second option, I think I remembered what the problem was with "integer load preserves pointer fragment; doing arithmetic on the fragment returns poison": this invalidates the equations |
The situation would become a lot simpler if having no provenance were strictly worse than having some provenance. In that case, optimizing an operation defined as stripping provenance to instead preserve provenance, e.g. with load-store elimination, would be sound. But for that to be true, for any given object, either:
In C, that dichotomy seems non-viable, since any object whose address escaped is assumed to be accessible with no provenance[1] , but if the object is accessible with any provenance then it's hard to enforce in-bounds conditions, as the paper describes. I am not sure there's no way to make immediate in-bounds enforcement work, but many C programs do use pointer arithmetic to temporarily take pointers out of bounds before bringing them back in bounds, and it would be nice to support them. In Rust, on the other hand, Stacked Borrows prohibits accessing most objects with no provenance, unless a reference to them (with valid provenance) was converted to a raw pointer and the corresponding entry is still on the top of the stack. Thus, most objects satisfy condition 1. Perhaps objects converted to raw pointers could be made to satisfy condition 2? We already distinguish But in any case... isn't this all a bit academic? We can't define a memory model for Rust that won't be upheld by LLVM's actual optimizations. We could consider violations to be bugs in LLVM, but that only makes sense if there's some plan or expectation that LLVM will comply with the model in the future. In my understanding, LLVM currently lazily preserves provenance, at the cost of treating Their third option looks the most conservative, but even with that option, if we specify that (i) stripping provenance can make things go from UB to non-UB and (ii) [1] CHERI violates this guarantee, but it also arguably violates the C spec, depending on how you interpret it. In any case, some programs expect that guarantee to be upheld. |
That runs against the very point of provenance, which is to introduce extra UB by distinguishing pointers.
You don't need "out of thin air" for this -- casting a pointer to an integer and back loses provenance (assuming integers don't have provenance). So even without guessing, you can access an allocation with a pointer that has no provenance.
An interesting thought. Basically, instead of interpreting
In some sense it is, but on the other hand there is no memory model that is upheld by LLVM's actual optimizations. The conjunction of all the axioms encoded by those optimizations implies false, as is demonstrated by provenance-related miscompilations. So LLVM will have to change anyway, if it wants to claim to implement any memory model. Because of that, I also don't think any statement of the form "LLVM currently does X with provenance" is meaningful. We can only speculate about the intention of the LLVM developers, but what LLVM does is just broken. For example, we can observe that integers are subject to full GVN and arithmetic transformations and hence carry no provenance; but LLVM also optimizes ptr-int-ptr roundtrips away (both in the form of eliminating casts and load-store-elimination for integer-typed accesses) which only makes sense if provenance is preserved. Depending on which subset of LLVMs optimizations you look, "what LLVM does" is different, and it is impossible to reconcile them. Of course, ultimately whatever LLVM moves to basically has to become what Rust does. So this likely should be discussed on the LVLM side. Unfortunately, my attempts to kick off such discussions on the LVLM bugtracker have been mostly fruitless. Very few people even acknowledge the lack of an operational specification of the LLVM Abstract Machine as a serious problem. My collaborators on the LLVM semantics paper have a longer history of working with the LLVM community, but even they are saying is it very hard to do any progress -- there is a lot of opposition to losing any of the existing optimizations, even if these optimizations demonstrably lead to miscompilations. I mean, this paper is two years old at this point, their approach fixes plenty of real-world bugs, they even have all sorts of tooling to automatically verify some optimizations correct, and it requries much less radical changes to the IR than introducing a new family of All of this is the reason why I usually stay clear of these open questions. We can say and define lots of things about our abstract machine that do not depend on how exactly this question gets answered. This includes specifying the general Maybe, to support C code that e.g. takes an arbitrary buffer of memory, and compresses or encrypts it, we need a But beyond this, we basically just can't say what exactly our semantics are for when data with provenance "flows" into a value of integer type. And for now that's okay, we got bigger problems. But @gnzlbg asked about |
However I don't think this is enough to make stripping provenance a legal transformation. The provenance-free pointer will still match a different item in the borrow stack than it would have with provenance. In particular, if the borrow stack has the item with provenance above the one without, stripping provenance leads to more stuff being popped of the borrow stack, which in turn can introduce UB. |
Is there any hope of sidestepping the general LLVM bugward-compatibility problem by asking LLVM to add explicit but optional annotations for strips/preserves/has/lacks/etc provenance, so we could define and implement whatever semantics we want independently of the C/C++ headache? |
I see little hope for that happening without incurring massive performance hits because all sorts of optimizations will not understand those annotations -- and if they just ignore them without understanding them, things will be unsound. |
When this comes to consensus, or maybe before, I think an unstable /// this type is defined to be the Rust abstract machine byte.
/// (whatever that ends up being defined as)
/// see: https://github.com/rust-lang/unsafe-code-guidelines/issues/181
#[repr(transparent)]
struct Byte(MaybeUninit<u8>) |
I don't think the actual Abstract Machine byte can be reflected into the standard library. That would be more like The abstract machine byte is defined outside the language and says how e.g. an interpreter would have to represent a Rust byte. It allows explicitly inspecting whether a byte is initialized and what its provenance is. In contrast, your |
I have a hard time seeing how transforming an operation that destroys provenance into one that doesn't is invalid. There are limited ways to "observe" provenance, and, at worst, it makes some operations not UB (which is always valid as I have observed in #253), but I can't see how it could introduce UB, except where the original operation has UB when provenance was destroyed. Likewise, a compiler could treat the operation as-if it does, but remove the actual operation (this wouldn't work, for example, for miri, absent an operation to do this, as discussed above). Though I don't believe this question has been tested (one advantage to strict-aliasing, you don't have to worry about arbitrary accesses of integer types that do this). |
That depends on what you mean by "destroys provenance". If you mean removing an operation that turns a pointer with provenance into a pointer that is completely invalid to use, then that's fine since assuming no UB in the input, the pointer could not have been used afterwards. But the more usual meaning is when you have an operation that turns a pointer with provenance into a pointer of unknown provenance, because pointers of unknown provenance can be used in lots of ways because the Note that since we're talking about the language semantics here, there is no "true provenance" to fall back on when a pointer of unknown provenance is accessed. So it doesn't make sense to say that "the original operation has UB when provenance was destroyed" because if provenance is destroyed then it's valid as long as the memory being accessed is there and has the right type and so on, but nothing about where the pointer came from because we don't remember or understand how this pointer came to be, that's what unknown provenance means. Regarding sketchy integer operations on pointers, another one that recently came up on zulip was |
Destroying provenance would mean invalidating the pointer, which means that the pointer cannot be used for basically anything (it is, for all intents and purposes, the same as a pointer that has never been valid after that operation). Creating a new pointer with an unknown provenance, that could potentially be a lot wider than the original pointer, should be impossible (because then a compiler could never assume that black-box code not exceed the provenance of any pointer given to it), and I believe is under stacked borrows.
Yeah, that's a fun one (I've done it before, in assembly code, though nothing high level). Strictly speaking, that would result in an invalid pointer under the model I mentioned, though I can see how it would be desirable. It could possibly be extended to include those operations, but the wording for that would be fun. |
Under the current definition, ptr-to-int casts have exactly this "widening" behavior. I don't think it is reasonable (or at least this seems to be an axiom of the group) that we can eliminate all provenance removing operations, because programmers want to be able to do fancy bit tricks with pointers, serialize them into bytes and send them over the network, and all sorts of other things that we couldn't possibly track from within the abstract machine. It's possible to declare those things UB but I don't think that solution would fly. |
That seems dangerous. Consider the following code: extern"C"{
pub fn does_something(param: *mut usize);
}
#[repr(C)]
struct Interesting{
array: [usize;1], // prevents *pointer-interconvertibility* trick under lccc's model
exclusive_access: *mut ()
};
pub fn do_interesting_things(ptr: &mut Interesting){
let x = ptr.exclusive_access;
does_something(&mut ptr.array[0]);
ptr.exclusive_access =x;
} Under both Stacked Borrows, and the (hopefully more permissive) lccc model, it should be a valid assumption that ptr.exclusive_access is not modified (as far as I can tell #256 and #2 only deal with reads). If pointer casts can remove provenance in a way that allows the pointer to still be valid, then this definition of pub unsafe extern"C" fn does_something(ptr: *mut usize){
let ptr = std::transmute(std::transmute<usize>(ptr)) as *mut Interesting;
(*ptr).exclusive_access = ptr as *mut ();
} |
I don't think that code is valid under SB. The problem is not the casts in pub fn do_interesting_things(ptr: &mut Interesting) {
let x = ptr.exclusive_access;
let y = &mut ptr.exclusive_access as *mut _;
does_something(&mut ptr.array[0]);
ptr.exclusive_access = x;
} |
Even in that code, I'd argue shouldn't be well-defined. |
I don't deny that this is a weird situation and it really should not be allowed. However, what it reflects is really that SB is an approximation of "true provenance", and it's not clear to me that true provenance is decidable at all. It is, I think, definable, most likely using an axiomatic method similar to the C++ standard, but tracking the actual data flow may be next to impossible - and even then it is inaccurate. Consider again the byte serialization example, where the pointer value was smuggled through "the real world" and so may as well be an "out of thin air" pointer; in such cases you can't tell whether it was the good pointer or the bitwise-identical bad pointer that was smuggled because the two situations have identical execution traces, so you can't make one UB and the other not.
In SB, it is memory locations that have borrow stacks attached to them, containing essentially the set of references/pointers that can be used to access them. In your first example, the local variable The reasoning behind why my variation is not rejected by SB is that once you take a reference to |
On Fri, Dec 4, 2020 at 13:53 Mario Carneiro ***@***.***> wrote:
Even in that code, I'd argue shouldn't be well-defined. y is not given to
does_something so y shouldn't be allowed to access it.
I don't deny that this is a weird situation and it really should not be
allowed. However, what it reflects is really that SB is an approximation of
"true provenance", and it's not clear to me that true provenance is
decidable at all. It is, I think, *definable*, most likely using an
axiomatic method similar to the C++ standard, but tracking the actual data
flow may be next to impossible - and even then it is inaccurate. Consider
again the byte serialization example, where the pointer value was smuggled
through "the real world" and so may as well be an "out of thin air"
pointer; in such cases you can't tell whether it was the good pointer or
the bitwise-identical bad pointer that was smuggled because the two
situations have identical execution traces, so you can't make one UB and
the other not.
If that's allowed, then ptr should be sufficient to gain write access to
ptr.exclusive_access
In SB, it is *memory locations* that have borrow stacks attached to them,
containing essentially the set of references/pointers that can be used to
access them. In your first example, the local variable ptr: &mut
Interesting is on the borrow stack of ptr.exclusive_access, but
does_something does not have that variable, it has the local variable ptr:
*mut usize, which was never put on the borrow stack for
ptr.exclusive_access, so using it or a pointer derived from it to access
ptr.exclusive_access is UB.
The reasoning behind why my variation is not rejected by SB is that once
you take a reference to ptr.exclusive_access, the compiler loses its
confidence in the ability to predict the data flow of that value, and it is
difficult to encode any such information in the semantics in a way that
does not predict the future. C does something similar, where taking a
reference is an effectful operation on the semantics if the pointer
"escapes" through a function call. SB just does it at the ref-to-ptr
barrier instead, because this is easy to simulate.
So then assuming y isn't used by "does_something" because it isn't passed
into it is valid?
… —
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#181 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGLD26BSXRDWRD5BJN3ROTSTEV3LANCNFSM4IHBPADA>
.
|
No, SB does not validate that reasoning in this instance. It has no idea how pointers get from point A to point B; all of that kind of tracking is done on references. Well, something like that at least; the details sometimes vary on when to erase provenance, but I happen to like this interpretation. Doing these kinds of shenanigans are to my mind the point of using pointers; if you want provenance just use references instead. |
Very interesting. I'd like to look into this matter further, and whether or not it makes sense to change it, but that could probably do with it's own issue. I would still consider a "provenance destroying operation" to not simply erase it, but destroy it entirely, and invalidate the resulting pointer. This would allow operations that do destroy provenance without prohibiting load-store elimitation. |
On top of what @digama0 said, here are some links to related LLVM/GCC issues that have code examples:
I am working on an upcoming blog post that will also hopefully help answer this question. (FWIW, in the future it'd be good to open a new issue for such questions. The question is not really about finding a concrete formal definition for "(Abstract Machine) byte", after all. Now someone discussing Abstract Machine Bytes will have to tease apart this only loosely related discussion.) |
That post is finally done. :) Here you go. It contains a concrete example for a "provenance-erasing" optimization (together with some other common optimizations) introducing a miscompilation. |
I'll make a PR to put the MiniRust definition into the glossary. |
We use the term "byte" in the glossary without defining it. See #180 (comment) and #175 for why that might not be as trivial as it might seem.
The text was updated successfully, but these errors were encountered: