Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define: Byte #181

Closed
RalfJung opened this issue Jul 26, 2019 · 29 comments · Fixed by #446
Closed

Define: Byte #181

RalfJung opened this issue Jul 26, 2019 · 29 comments · Fixed by #446
Labels
A-memory Topic: Related to memory accesses A-provenance Topic: Related to when which values have which provenance (but not which alias restrictions follow) C-terminology Category: Discussing terminology -- which term to use, how to define it, adding it to the glossary

Comments

@RalfJung
Copy link
Member

We use the term "byte" in the glossary without defining it. See #180 (comment) and #175 for why that might not be as trivial as it might seem.

@RalfJung
Copy link
Member Author

RalfJung commented Aug 9, 2019

"abyte" or "ambyte" as a short for "abstract (machine) byte" might be a term we could use instead.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 9, 2019

Considering just real hardware, it isn't uncommon to have to look up what kind of "byte" we are talking about. For example, whether a byte is 8, 16 (DSPs), or N bits wide, whether a byte can be 0xUU (e.g. Itanium), whether a byte can carry provenance or other information (e.g. CHERI), etc. depends on the machine.

What matters for Rust programs is what a "byte" is in the Rust abstract machine, not what it is in the hardware the binary runs on. So I would just define "byte" with the meaning of it for the abstract machine, and then see if we also need to introduce a different term to specifically talk about "hardware bytes". Is there a part of the spec in which we actually need to talk specifically about "hardware bytes" differentiating them from "abstract machine bytes"?

@RalfJung
Copy link
Member Author

RalfJung commented Aug 9, 2019

whether a byte can be 0xUU (e.g. Itanium)

Itanium's NaT are not exactly like 0xUU, I think. But I get your point.

The fact remains though that the vast majority of people, and plenty of online material, equates "byte" and u8. Some people would say a "byte" is "N bits" where N might in rare cases be different from 8, but that's not enough flexibility for the Rust Abstract Machine byte (at least if you say a bit is 0 or 1, and changing that seems worse than changing "byte"). Itanium, CHERI and friends are extremely niche architectures that barely anyone knows much about.

(FWIW, we could define a Rust Abstract Machine byte to be 8 Rust Abstract Machine bits, but then a "bit" would be 0, 1, U (uninitialized), or a fragment of a pointer.)

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 9, 2019

The fact remains though that the vast majority of people, and plenty of online material, equates "byte" and u8.

Sure, but isn't this actually correct? If an u8 can be 0xUU and carry provenance information, then that's exactly what a byte is in the abstract machine.

That is, when we define a "byte" in the abstract machine, a u8 should be defined as a "byte" as well, and the docs for u8 should explain all of this, such that we end with the spec, API-docs, the book, etc. all in sync.

I agree with you that some people will think that an u8 is just 8 bits that can only be zero or one. That's a relatively useful model for a lot of people, e.g., a lot of C and C++ programmers think of char in this way, and they are able to get a lot of work done even though these languages have uninitialized memory, trap representation, CHAR_BITS, etc.

The only thing we can do is explain things properly in the docs and the spec. If I'm writing safe Rust code, and see an u8, thinking that it is 8-bits that can be 0 or 1 is actually ok-ish, since I can get hit by provenance or uninitialized memory in safe Rust.

The question is whether we want to introduce a different term to denote this other types of u8, that are only 0 or 1s, and can't be undef or can't carry provenance.

@hanna-kruppe
Copy link

Sure, but isn't this actually correct? If an u8 can be 0xUU and carry provenance information, then that's exactly what a byte is in the abstract machine.

That's a big "if", neither of those is settled and in particular integers likely shouldn't carry provenance.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 9, 2019

(FWIW, we could define a Rust Abstract Machine byte to be 8 Rust Abstract Machine bits, but then a "bit" would be 0, 1, U (uninitialized), or a fragment of a pointer.)

I'm not sure doing this is worth it. I don't think it is possible to just end up with a single uninitialized bit, or to copy the provenance of a single bit.


That's a big "if", neither of those is settled and in particular integers likely shouldn't carry provenance.

Sure, all of this hangs together. But in the hypothetical case in which we specify that a Byte carries the information that it is a fragment of a pointer, and u8 is not a Byte, how can I copy a pointer to another memory location ? AFAICT we would need to introduce another different primitive type that denotes a byte, and can be used to, e.g., implement a memcpy, because a memcpy implemented on top of u8 wouldn't copy that.

@RalfJung
Copy link
Member Author

RalfJung commented Aug 9, 2019

Sure, but isn't this actually correct? If an u8 can be 0xUU and carry provenance information, then that's exactly what a byte is in the abstract machine.

Sorry, what I mean is 0..256 -- the "abstract values" that a u8 is generally thought represent.
Notice that even if we allow uninitialized u8, that just makes it Option<0..256> -- but it still fails to account for pointer provenance.

But in the hypothetical case in which we specify that a Byte carries the information that it is a fragment of a pointer, and u8 is not a Byte, how can I copy a pointer to another memory location ? AFAICT we would need to introduce another different primitive type that denotes a byte, and can be used to, e.g., implement a memcpy, because a memcpy implemented on top of u8 wouldn't copy that.

This is moving towards the edge of what is known about how to specify an abstract machine that supports both pointer provenance and ptr-int-casts. I am going to switch to the LLVM Abstract Machine here because that is where at least some things are known. When doing a i8 load of a byte in memory that is a pointer fragment, we have a few options:

  • We strip the provenance, return just that byte of the raw address. This is what Miri will implement, once it supports this operation, because it is the easiest to implement. But this is a bad choice for an optimizing IR because it means load-store elimination is illegal: doing an i8 load of a byte followed by a store of that byte to the same location ends up stripping the provenance of whatever is stored in memory; optimizing that away to not strip provenance can introduce UB if the program relied on the provenance being gone.

    In this model, a memcpy operating bytewise with i8 strips provenance of the pointers it copies. The resulting pointer is usable, but is not carrying the same provenance as the original pointer: it is as if the original pointer was cast to an integer and back.

  • We preserve provenance. But now we have an i8 with provenance, what happens when this is used for arithmetic or anything like that? Lazily stripping provenance only when such an operation happens invalidates some arithmetic transformations, trying to preserve provenance across the arithmetic is even worse, making arithmetic UB means arithmetic has side-effects with is bad for optimizations as well... maybe the result could be poison if an input byte has provenance. I am not sure if that has been explored.

    In this model, a memcpy operating bytewise with i8 works fine.

  • To avoid these problems, and allow load-store elimination, we can instead say that a load of type i8 is UB (or yields poison) if there is provenance. We also introduce a new type (family) dN (for "data" of size N bits), with the difference being that iN carries no provenance while dN just preserves whatever provenance exists in memory. Conversely, dN does not permit any arithmetic or other operations, just loads and stores, so there is no trouble with it having provenance.

    In this model, a memcpy would use d8, and that would work fine.

@RalfJung
Copy link
Member Author

RalfJung commented Aug 9, 2019

Regarding the second option, I think I remembered what the problem was with "integer load preserves pointer fragment; doing arithmetic on the fragment returns poison": this invalidates the equations x + 0 = x and x * 1 = x.

@comex
Copy link

comex commented Aug 10, 2019

The situation would become a lot simpler if having no provenance were strictly worse than having some provenance. In that case, optimizing an operation defined as stripping provenance to instead preserve provenance, e.g. with load-store elimination, would be sound.

But for that to be true, for any given object, either:

  1. It must be UB to access that object with no provenance – e.g. to pull that object's address 'out of thin air' as an integer, cast the integer to a pointer, and use the pointer; or
  2. It must be defined behavior to access the object with any provenance. (Which means that any UB caused by violating inbounds conditions must occur beforehand or not at all.)

In C, that dichotomy seems non-viable, since any object whose address escaped is assumed to be accessible with no provenance[1] , but if the object is accessible with any provenance then it's hard to enforce in-bounds conditions, as the paper describes. I am not sure there's no way to make immediate in-bounds enforcement work, but many C programs do use pointer arithmetic to temporarily take pointers out of bounds before bringing them back in bounds, and it would be nice to support them.

In Rust, on the other hand, Stacked Borrows prohibits accessing most objects with no provenance, unless a reference to them (with valid provenance) was converted to a raw pointer and the corresponding entry is still on the top of the stack. Thus, most objects satisfy condition 1. Perhaps objects converted to raw pointers could be made to satisfy condition 2? We already distinguish offset from wrapping_offset, so from a usability perspective it's less problematic to make offset-ing out of bounds insta-UB. That doesn't alleviate the paper's motivation to defer the UB for the optimizer's sake, but perhaps there is a workaround, or perhaps we could just make all raw pointer offsetting non-inbounds (or at least somehow weaker).

But in any case... isn't this all a bit academic? We can't define a memory model for Rust that won't be upheld by LLVM's actual optimizations. We could consider violations to be bugs in LLVM, but that only makes sense if there's some plan or expectation that LLVM will comply with the model in the future. In my understanding, LLVM currently lazily preserves provenance, at the cost of treating inbounds incorrectly in some edge cases. If we're optimistic, we can expect that this will be fixed someday, somehow, and that the fixed version will uphold some formal memory model. But we don't know what that model will be. Even if we assume they'll pick from one of @RalfJung's options above – which would be a hard sell, since each of them would break existing C code, but perhaps there are truly no better alternatives – we don't know which option they'll pick.

Their third option looks the most conservative, but even with that option, if we specify that (i) stripping provenance can make things go from UB to non-UB and (ii) i8 loads strip provenance, we can get in trouble if LLVM decides on a model that satisfies (i) but not (ii).

[1] CHERI violates this guarantee, but it also arguably violates the C spec, depending on how you interpret it. In any case, some programs expect that guarantee to be upheld.

@RalfJung
Copy link
Member Author

RalfJung commented Aug 11, 2019

optimizing an operation defined as stripping provenance to instead preserve provenance, e.g. with load-store elimination, would be sound.

That runs against the very point of provenance, which is to introduce extra UB by distinguishing pointers.
The other direction could be true: you can always strip provenance, that will not introduce UB. I believe this is true for the "basic" provenance of C that remembers which allocation a pointer originally pointed to. It is not true for Stacked Borrows though, and I don't think it is true for the provenance needed for restrict in C either.

It must be UB to access that object with no provenance – e.g. to pull that object's address 'out of thin air' as an integer, cast the integer to a pointer, and use the pointer; or

You don't need "out of thin air" for this -- casting a pointer to an integer and back loses provenance (assuming integers don't have provenance). So even without guessing, you can access an allocation with a pointer that has no provenance.

In Rust, on the other hand, Stacked Borrows prohibits accessing most objects with no provenance, unless a reference to them (with valid provenance) was converted to a raw pointer and the corresponding entry is still on the top of the stack. Thus, most objects satisfy condition 1. Perhaps objects converted to raw pointers could be made to satisfy condition 2?

An interesting thought. Basically, instead of interpreting Untagged items in the borrow stack as giving permission to untagged pointers, they would be interpreted as giving permission to all pointers -- a wildcard. At a first glance, it seems like this should have no negative effect on optimizations, since all optimizations rely on controlling which tags are allowed to access some location, and once untagged pointers may do accesses, the game is over -- untagged pointers could come from anywhere.

But in any case... isn't this all a bit academic? We can't define a memory model for Rust that won't be upheld by LLVM's actual optimizations.

In some sense it is, but on the other hand there is no memory model that is upheld by LLVM's actual optimizations. The conjunction of all the axioms encoded by those optimizations implies false, as is demonstrated by provenance-related miscompilations. So LLVM will have to change anyway, if it wants to claim to implement any memory model.

Because of that, I also don't think any statement of the form "LLVM currently does X with provenance" is meaningful. We can only speculate about the intention of the LLVM developers, but what LLVM does is just broken. For example, we can observe that integers are subject to full GVN and arithmetic transformations and hence carry no provenance; but LLVM also optimizes ptr-int-ptr roundtrips away (both in the form of eliminating casts and load-store-elimination for integer-typed accesses) which only makes sense if provenance is preserved. Depending on which subset of LLVMs optimizations you look, "what LLVM does" is different, and it is impossible to reconcile them.

Of course, ultimately whatever LLVM moves to basically has to become what Rust does. So this likely should be discussed on the LVLM side. Unfortunately, my attempts to kick off such discussions on the LVLM bugtracker have been mostly fruitless. Very few people even acknowledge the lack of an operational specification of the LLVM Abstract Machine as a serious problem. My collaborators on the LLVM semantics paper have a longer history of working with the LLVM community, but even they are saying is it very hard to do any progress -- there is a lot of opposition to losing any of the existing optimizations, even if these optimizations demonstrably lead to miscompilations. I mean, this paper is two years old at this point, their approach fixes plenty of real-world bugs, they even have all sorts of tooling to automatically verify some optimizations correct, and it requries much less radical changes to the IR than introducing a new family of dN types -- and still, it doesn't seem like we are any closer to actually having support for freezing in LLVM than 2 years ago. It's very frustrating. :(

All of this is the reason why I usually stay clear of these open questions. We can say and define lots of things about our abstract machine that do not depend on how exactly this question gets answered. This includes specifying the general Value domain and notions like Byte.

Maybe, to support C code that e.g. takes an arbitrary buffer of memory, and compresses or encrypts it, we need a strip_provenance operation. This would be a NOP on real hardware, but far from a NOP on the abstract machine. Code could combine that with freeze to handle both padding and pointer values for cases like "I want to take an arbitrary buffer and do some math on it". If, in Rust or C, loading a pointer value from memory at integer type and then doing arithmetic on it is valid, the compiler would have to insert strip_provenance in plenty of places where integers with provenance could occur -- because I find it very unlikely that LLVM IR will have a semantics where such code without strip_provenance is allowed (they would have to basically remove GVN, which is not going to happen).

But beyond this, we basically just can't say what exactly our semantics are for when data with provenance "flows" into a value of integer type. And for now that's okay, we got bigger problems. But @gnzlbg asked about memcpy above, and about the relationship between the u8 type inside the abstract machine and the Byte type we use to define the abstract machine's memory interface -- in other words, he asked about provenance in integers that are not subject to any (arithmetic or otherwise) operation, and that prompted my response. I guess all of this is a long-winded way of saying we should punt on that question for now.

@RalfJung
Copy link
Member Author

An interesting thought. Basically, instead of interpreting Untagged items in the borrow stack as giving permission to untagged pointers, they would be interpreted as giving permission to all pointers -- a wildcard.

However I don't think this is enough to make stripping provenance a legal transformation. The provenance-free pointer will still match a different item in the borrow stack than it would have with provenance. In particular, if the borrow stack has the item with provenance above the one without, stripping provenance leads to more stuff being popped of the borrow stack, which in turn can introduce UB.

@Ixrec
Copy link
Contributor

Ixrec commented Aug 11, 2019

Is there any hope of sidestepping the general LLVM bugward-compatibility problem by asking LLVM to add explicit but optional annotations for strips/preserves/has/lacks/etc provenance, so we could define and implement whatever semantics we want independently of the C/C++ headache?

@RalfJung
Copy link
Member Author

I see little hope for that happening without incurring massive performance hits because all sorts of optimizations will not understand those annotations -- and if they just ignore them without understanding them, things will be unsound.

@RalfJung RalfJung added C-terminology Category: Discussing terminology -- which term to use, how to define it, adding it to the glossary A-memory Topic: Related to memory accesses A-provenance Topic: Related to when which values have which provenance (but not which alias restrictions follow) labels Aug 14, 2019
@Dante-Broggi
Copy link
Contributor

When this comes to consensus, or maybe before, I think an unstable core::mem::Byte type should be added to solidify the concept and to simplify working with raw bytes.
e.g. (as an example):

/// this type is defined to be the Rust abstract machine byte.
/// (whatever that ends up being defined as)
/// see: https://github.com/rust-lang/unsafe-code-guidelines/issues/181
#[repr(transparent)]
struct Byte(MaybeUninit<u8>)

@RalfJung
Copy link
Member Author

I don't think the actual Abstract Machine byte can be reflected into the standard library. That would be more like Option<u8>, plus taking into account provenance.

The abstract machine byte is defined outside the language and says how e.g. an interpreter would have to represent a Rust byte. It allows explicitly inspecting whether a byte is initialized and what its provenance is. In contrast, your Byte type is defined inside the language and used e.g. by low-level functions that want to be able to move around arbitrary data, but it is fundamentally impossible to test if a given byte is initialized.

@chorman0773
Copy link
Contributor

I have a hard time seeing how transforming an operation that destroys provenance into one that doesn't is invalid. There are limited ways to "observe" provenance, and, at worst, it makes some operations not UB (which is always valid as I have observed in #253), but I can't see how it could introduce UB, except where the original operation has UB when provenance was destroyed. Likewise, a compiler could treat the operation as-if it does, but remove the actual operation (this wouldn't work, for example, for miri, absent an operation to do this, as discussed above). Though I don't believe this question has been tested (one advantage to strict-aliasing, you don't have to worry about arbitrary accesses of integer types that do this).
If provenance is not destroyed, would it not make sense to allow addition and subtraction on integers "with" provenance. This would allow the fundamental implementation of ptr.wrapping_offset(i) to be defined as equivalent to (ptr as usize +i*size_of::<T>()) as *mut T. The is the model that lccc uses: "if a value is an integer value that was converted from a pointer, after some sequence of addition and subtraction such that the equivalent pointer arithmetic operation has defined behaviour, the result is the result of that equivalent operation". Multiplication is a lot more interesting, though. Standard multiplication like (ptr as usize*2) as *mut T probably would, but *1 is a lot more interesting. Likewise, (*2)/2 is also interesting, though I'd argue since the operation produces the original value it can preserve it (provided it doesn't overflow), though I'm not particularly attached to that argument.

@digama0
Copy link

digama0 commented Dec 4, 2020

I have a hard time seeing how transforming an operation that destroys provenance into one that doesn't is invalid.

That depends on what you mean by "destroys provenance". If you mean removing an operation that turns a pointer with provenance into a pointer that is completely invalid to use, then that's fine since assuming no UB in the input, the pointer could not have been used afterwards. But the more usual meaning is when you have an operation that turns a pointer with provenance into a pointer of unknown provenance, because pointers of unknown provenance can be used in lots of ways because the compiler language semantics have to be conservative and assume the access is OK, while if you put the provenance back then it might decide that the code is UB after all. So it does introduce UB to eliminate a provenance-removing operation.

Note that since we're talking about the language semantics here, there is no "true provenance" to fall back on when a pointer of unknown provenance is accessed. So it doesn't make sense to say that "the original operation has UB when provenance was destroyed" because if provenance is destroyed then it's valid as long as the memory being accessed is there and has the right type and so on, but nothing about where the pointer came from because we don't remember or understand how this pointer came to be, that's what unknown provenance means.

Regarding sketchy integer operations on pointers, another one that recently came up on zulip was p as usize & !3 or similar, for align-down (or align-up if you use (p as usize + 3) & !3 instead), which is certainly a common operation on pointers but involves more than simply linear arithmetic.

@chorman0773
Copy link
Contributor

That depends on what you mean by "destroys provenance"

Destroying provenance would mean invalidating the pointer, which means that the pointer cannot be used for basically anything (it is, for all intents and purposes, the same as a pointer that has never been valid after that operation). Creating a new pointer with an unknown provenance, that could potentially be a lot wider than the original pointer, should be impossible (because then a compiler could never assume that black-box code not exceed the provenance of any pointer given to it), and I believe is under stacked borrows.

Regarding sketchy integer operations on pointers, another one that recently came up on zulip was p as usize & !3 or similar, for align-down (or align-up if you use (p as usize + 3) & !3 instead), which is certainly a common operation on pointers but involves more than simply linear arithmetic.

Yeah, that's a fun one (I've done it before, in assembly code, though nothing high level). Strictly speaking, that would result in an invalid pointer under the model I mentioned, though I can see how it would be desirable. It could possibly be extended to include those operations, but the wording for that would be fun.

@digama0
Copy link

digama0 commented Dec 4, 2020

Creating a new pointer with an unknown provenance, that could potentially be a lot wider than the original pointer, should be impossible (because then a compiler could never assume that black-box code not exceed the provenance of any pointer given to it), and I believe is under stacked borrows.

Under the current definition, ptr-to-int casts have exactly this "widening" behavior. I don't think it is reasonable (or at least this seems to be an axiom of the group) that we can eliminate all provenance removing operations, because programmers want to be able to do fancy bit tricks with pointers, serialize them into bytes and send them over the network, and all sorts of other things that we couldn't possibly track from within the abstract machine. It's possible to declare those things UB but I don't think that solution would fly.

@chorman0773
Copy link
Contributor

Under the current definition, ptr-to-int casts have exactly this "widening" behavior

That seems dangerous. Consider the following code:

extern"C"{
pub fn does_something(param: *mut usize);
}
#[repr(C)]
struct Interesting{
    array: [usize;1], // prevents *pointer-interconvertibility* trick under lccc's model
    exclusive_access: *mut ()
};

pub fn do_interesting_things(ptr: &mut Interesting){
    let x = ptr.exclusive_access;
    does_something(&mut ptr.array[0]);
    ptr.exclusive_access =x;
}

Under both Stacked Borrows, and the (hopefully more permissive) lccc model, it should be a valid assumption that ptr.exclusive_access is not modified (as far as I can tell #256 and #2 only deal with reads). If pointer casts can remove provenance in a way that allows the pointer to still be valid, then this definition of does_something would make the assumption unsound (note that only a rust definition would permit this, in C++ using reinterpret_cast would be a no-op because of the round-trip rule, and bit_cast would be unspecified, likewise for a pointer-cast in C):

pub unsafe extern"C" fn does_something(ptr: *mut usize){
     let ptr = std::transmute(std::transmute<usize>(ptr)) as *mut Interesting;
     (*ptr).exclusive_access = ptr as *mut ();
}

@digama0
Copy link

digama0 commented Dec 4, 2020

I don't think that code is valid under SB. The problem is not the casts in does_something, but rather the subsequent access of *ptr).exclusive_access, which at that point is not accepting writes from raw pointers, because the only thing on the borrow stack for that location is ptr: &mut Interesting. To make that valid you would have to first access ptr.exclusive_access in a way that makes it plausible that a later raw pointer access could be made, for example

pub fn do_interesting_things(ptr: &mut Interesting) {
    let x = ptr.exclusive_access;
    let y = &mut ptr.exclusive_access as *mut _;
    does_something(&mut ptr.array[0]);
    ptr.exclusive_access = x;
}

@chorman0773
Copy link
Contributor

Even in that code, I'd argue shouldn't be well-defined. y is not given to does_something so y shouldn't be allowed to access it.
Also, I do not see how creating a new pointer to a subobject would alter the well-definedness. If that's allowed, then ptr should be sufficient to gain write access to ptr.exclusive_access, as ptr is Uniq, and as you say, ptr is on the borrow stack for that location.
If this is actually permitted by stacked borrows, that represents a defect in my pointer model, as it would not permit it (the pointer &mut ptr.array[0] can reach to ptr.array[0] and the immediately enclosing array, ptr.array). However, I would be suprised if it is, and certainly if that is intended to be case.

@digama0
Copy link

digama0 commented Dec 4, 2020

Even in that code, I'd argue shouldn't be well-defined. y is not given to does_something so y shouldn't be allowed to access it.

I don't deny that this is a weird situation and it really should not be allowed. However, what it reflects is really that SB is an approximation of "true provenance", and it's not clear to me that true provenance is decidable at all. It is, I think, definable, most likely using an axiomatic method similar to the C++ standard, but tracking the actual data flow may be next to impossible - and even then it is inaccurate. Consider again the byte serialization example, where the pointer value was smuggled through "the real world" and so may as well be an "out of thin air" pointer; in such cases you can't tell whether it was the good pointer or the bitwise-identical bad pointer that was smuggled because the two situations have identical execution traces, so you can't make one UB and the other not.

If that's allowed, then ptr should be sufficient to gain write access to ptr.exclusive_access

In SB, it is memory locations that have borrow stacks attached to them, containing essentially the set of references/pointers that can be used to access them. In your first example, the local variable ptr: &mut Interesting is on the borrow stack of ptr.exclusive_access, but does_something does not have that variable, it has the local variable ptr: *mut usize, which was never put on the borrow stack for ptr.exclusive_access, so using it or a pointer derived from it to access ptr.exclusive_access is UB.

The reasoning behind why my variation is not rejected by SB is that once you take a reference to ptr.exclusive_access, the compiler loses its confidence in the ability to predict the data flow of that value, and it is difficult to encode any such information in the semantics in a way that does not predict the future. C does something similar, where taking a reference is an effectful operation on the semantics if the pointer "escapes" through a function call. SB just does it at the ref-to-ptr barrier instead, because this is easy to simulate.

@chorman0773
Copy link
Contributor

chorman0773 commented Dec 4, 2020 via email

@digama0
Copy link

digama0 commented Dec 4, 2020

So then assuming y isn't used by "does_something" because it isn't passed into it is valid?

No, SB does not validate that reasoning in this instance. It has no idea how pointers get from point A to point B; all of that kind of tracking is done on references. Well, something like that at least; the details sometimes vary on when to erase provenance, but I happen to like this interpretation. Doing these kinds of shenanigans are to my mind the point of using pointers; if you want provenance just use references instead.

@chorman0773
Copy link
Contributor

No, SB does not validate that reasoning in this instance

Very interesting. I'd like to look into this matter further, and whether or not it makes sense to change it, but that could probably do with it's own issue. I would still consider a "provenance destroying operation" to not simply erase it, but destroy it entirely, and invalidate the resulting pointer. This would allow operations that do destroy provenance without prohibiting load-store elimitation.

@RalfJung
Copy link
Member Author

RalfJung commented Dec 6, 2020

I have a hard time seeing how transforming an operation that destroys provenance into one that doesn't is invalid.

On top of what @digama0 said, here are some links to related LLVM/GCC issues that have code examples:

I am working on an upcoming blog post that will also hopefully help answer this question.

(FWIW, in the future it'd be good to open a new issue for such questions. The question is not really about finding a concrete formal definition for "(Abstract Machine) byte", after all. Now someone discussing Abstract Machine Bytes will have to tease apart this only loosely related discussion.)

@RalfJung
Copy link
Member Author

I am working on an upcoming blog post that will also hopefully help answer this question.

That post is finally done. :) Here you go. It contains a concrete example for a "provenance-erasing" optimization (together with some other common optimizations) introducing a miscompilation.

@RalfJung
Copy link
Member Author

RalfJung commented Aug 8, 2023

I'll make a PR to put the MiniRust definition into the glossary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-memory Topic: Related to memory accesses A-provenance Topic: Related to when which values have which provenance (but not which alias restrictions follow) C-terminology Category: Discussing terminology -- which term to use, how to define it, adding it to the glossary
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants