Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide on zero-sized offsets and memory accesses #472

Closed
Tracked by #671
RalfJung opened this issue Oct 28, 2023 · 29 comments
Closed
Tracked by #671

Decide on zero-sized offsets and memory accesses #472

RalfJung opened this issue Oct 28, 2023 · 29 comments

Comments

@RalfJung
Copy link
Member

RalfJung commented Oct 28, 2023

We have previously discussed that we want to make offset(0) always-defined, but never FCP's this decision. We also discussed zero-sized memory accesses, without reaching a clear conclusion. We never discussed the case of zero-offset offset_from.

I propose we resolve all these questions around zero-sized offsets/accesses as follows:

  • ptr.offset(0) is always defined.
  • ptr1.offset_from(ptr2) is always allowed when ptr1.addr() == ptr2.addr().
  • Zero-sized accesses (reads and writes) are always allowed with every pointer.

We also adjust the definition of "dereferenceable for n bytes" to say that every pointer is dereferenceable for 0 bytes. This implies that any aligned non-null pointer is valid as a reference to a zero-sized type.

Together these ensure "provenance monotonicity": if something is allowed on a pointer without provenance, then adding arbitrary provenance to the pointer can never introduce UB. We also achieve that for ptr: *const (), ptr.offset(_) is allowed if and only if ptr.read() is allowed (because they are both always allowed).

For offset_from specifically, we need to deal with ptr::invalid::<u8>(A).offset_from(ptr::invalid::<u8>(A)) being allowed for A != 0 (and maybe even for A == 0, see the next section). This means it must still be allowed when we add arbitrary provenance to both pointers, including arbitrary different provenance.

If ptr::invalid returns a provenance-less pointer, then we already allow zero-sized offset/read/write on such provenance-less pointers, and offset_from as well if both pointers have no provenance and the same address, so the proposal is almost the obvious result of applying "provenance monotonicity closure" to the current semantics: keep everything allowed that we currently allow, keep ptr::invalid unchanged, and allow enough additional cases such that provenance monotonicity holds.

Null pointer

This proposal is almost, but not exactly, the provenance monotonicity closure of the current semantics. There is one further change that does not fall out of provenance monotonicity closure:

ptr::null::<()>().read() is allowed, or more generally zero-sized reads/writes with the null pointer are allowed. Those could remain forbidden without violating provenance monotonicity. We allow them for consistency with ptr::null::<T>().offset(0), which we decided to allow, and which in particular C++ (but not C) allows -- having more UB than C++ without a good reason seems like a bad call, and if offset considers 0 bytes to be "in-bounds at the null pointer", then it seems only fair that reads/writes do the same. References still must be non-null, and we assume that null is never in-bounds of a non-zero-sized allocation, so non-zero-sized accesses at null remain UB regardless of whether that memory is mapped or not. This issue is concerned with zero-sized offsets/accesses only, so changing the rules for any other kind of access is out of scope.

There is one downside to this: we can no longer infer "nonnull" from a read/write having happened on a pointer, unless we know that the size of the access was non-zero. However, we can infer "nonnull" for references in general, and we have NonNull to express non-null-ness of raw pointers, so code can still steer the compiler in the right direction if it has to. Furthermore, the size is generally known post-monomorphization, so all the non-null reasoning LLVM does based on a pointer being used for memory accesses is still valid.

Alternative proposal

All that said, there is an alternative proposal that achieves provenance monotonicity. It considers there to exist some dedicated provenance that covers "zero-sized accesses at every location". Let's call this the "zero-sized provenance". ptr::invalid (and thus ptr::null) would be changed to return a pointer with that special provenance. Int-to-ptr transmute would still yield a pointer without provenance. Zero-sized accesses are then allowed if

  • the pointer has the zero-sized provenance
  • or the pointer has a regular provenance of a live allocation and the address is in-bounds of that allocation

We haven't discussed offset or offset_from under this proposal, but presumably we'd use similar rules: offset(0) is UB on pointers without provenance but allowed on pointers with the zero-sized provenance. On pointers with regular provenance it requires the pointer to be in-bounds. offset_from would be UB if either pointer has no provenance; if they both have the zero-sized provenance then it's allowed if they have the same address; otherwise they must both have regular (non-zero-sized) provenance of some live allocation and be in-bounds of that allocation.

This proposal:

  • allows the optimization discussed in the meeting, assuming allocations cannot shrink
  • achieves provenance monotonicity
  • allows ptr::null::<T>().offset(0)
  • for ptr: *const (), allows ptr.offset(_) if and only if ptr.read() is allowed

However,

  • it achieves provenance monotonicity by making more things UB (for instance, transmute::<_, *const ()>(4usize).read()), which all else being equal seems worse than achieving it by making more things defined
  • it needs a new intrinsic to create pointers with the zero-sized provenance

There's a variant of this proposal which has not one zero-sized provenance but one such provenance for each address; in that case the pointer is in-bounds for 0 bytes only if it points to that address. This is equivalent to having a zero-sized allocation at every address exist at program startup. This model disallows even more code (for instance, ptr::invalid::<()>(4).byte_add(4).read() becomes UB).

Summary

Overall this means we have a design space of (at least) 6 models:

  • zero-sized-accesses-are-nops
  • zero-sized-provenance-global
  • zero-sized-provenance-per-address

and each of them with or without allowing zero-sized null pointer reads/writes. (I'm assuming it as a given here that we do want to allow zero-sized null pointer offsets and offset_from of the null pointer with itself.)

The two zero-sized-provenance models avoid having a "lattice" of provenance with a non-trivial bottom element: the bottom, "no provenance", just doesn't allow anything at all. OTOH it needs the new concept of a zero-sized provenance (or memory pre-initialized with many zero-sized allocations), making the theory more complicated as well. The main proposal being suggested here make the "bottom" provenance support zero-sized accesses anywhere, thus making the provenance lattice less canonical but avoiding a dedicated "zero-sized provenance".

In the meeting people generally weren't convinced by the one optimization example we have that is enabled by the extra UB in the zero-sized-provenance models (in particular since it is incompatible with shrinking allocations). Using zero-sized accesses/dereferenceability as an optimization signal is not clearly a good idea, and when it comes to UB, implicit/unintended signals are dangerous. Provenance as a concept is already deeply mysterious to many programmers; making it more complicated by introducing imaginary provenance for the zero-sized case (with the penalty of misunderstandings being UB) does not seem advisable. The example could be rewritten to use n.max(4) as the length, which would make the assumption that the length is at least 4 explicit (but of course, the programmer might not know that they have to write this to get all the optimizations).

So, because of all of that, I think we should go with the model that has the least UB. I'll wait a bit before starting FCP to see whether people have any new points they'd like to see added to this summary.

@RalfJung
Copy link
Member Author

RalfJung commented Nov 4, 2023

It's been a week and there were no comments. Let's go then.
@rfcbot merge

@rfcbot
Copy link
Collaborator

rfcbot commented Nov 4, 2023

Team member @RalfJung has proposed to merge this. The next step is review by the rest of the tagged team members:

Concerns:

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@digama0
Copy link

digama0 commented Nov 4, 2023

I have a concern about the case

ptr1.offset_from(ptr2) is always defined when ptr1.addr() == ptr2.addr().

Shouldn't this say that ptr1 and ptr2 have the same provenance, or at least are from the same allocation? Because I would assume the compiler wants to deduce that from this comparison, and IIUC this is in the documentation.

@digama0
Copy link

digama0 commented Nov 4, 2023

not sure if I know how to do this:

@rfcbot concern offset_from-different-allocations

@rfcbot
Copy link
Collaborator

rfcbot commented Nov 4, 2023

🔔 This is now entering its final comment period, as per the review above. 🔔

@chorman0773
Copy link
Contributor

Nice race condition.

@RalfJung
Copy link
Member Author

RalfJung commented Nov 5, 2023

Shouldn't this say that ptr1 and ptr2 have the same provenance, or at least are from the same allocation? Because I would assume the compiler wants to deduce that from this comparison, and IIUC this is in the documentation.

That would be equivalent to not having any special case for when the two pointers are the same. I thought if we have a special case for offset 0 then we would want a matching special case for offset_from -- but maybe that's not a good idea?

The restriction arises because:

  • It is needed to make the operation const. But in const, two pointers from different allocations can never have the same address, so this new case changes nothing.
  • (Very tentatively) it was shown in this paper that a ptrdiff operation with suitable restrictions does not "leak" the base address of a pointer, which could be helpful for analyses. I am honestly unsure if this is preserved under my proposed change, so maybe you are right and we should keep the offset_from spec unchanged...

Can we use "ptr1.offset(n) is well-defined with result ptr2" if and only if "ptr2.offset_from(ptr1) is well-defined with result n" as the defining property? I originally thought no, but I just realized this is a lot more subtle. It depends on our notion of equality on pointers. If it's "full equality including having the same provenance", then that property is already busted, since we allow ptr2.offset_from(ptr1) when both are in the same allocation even if they have different aliasing tags. So we probably want to consider "pointers with their provenance reduced to just the allocation", or some such thing? I honestly don't remember what I was thinking... But if we go with that, then what would the spec be that achieves that property? I think it has to be "they must be to the same allocation, and either at the same offset or both in-bounds". That's a weird spec though? Or maybe not, for the "no provenance" case that's already what we do! What do you all think?

But I tend to agree, we want to keep the "same-allocation" restriction.

@digama0
Copy link

digama0 commented Nov 5, 2023

Is it legal to use offset_from when the pointers have the null provenance? My initial reading was that your argument was that assuming this is the case, provenance monotonicity forces us to allow offset_from for pointers with arbitrary different provenance. But maybe we can just say that offset_from on null provenance is always UB, even when the pointers have the same value.

@RalfJung
Copy link
Member Author

RalfJung commented Nov 5, 2023

Is it legal to use offset_from when the pointers have the null provenance?

What is the null provenance? If you mean "no provenance", then the answer is "yes but only if both pointers have no provenance, and they point to the same address". I think we should keep that. Basically if we allow a pointer to be deref'd for n times the pointee size (for a non-zero-sized pointee) then ptr.offset(n).offset_from(ptr) must always return n (and never be UB), IMO. For n = 0, this implies we must allow ptr::null::<u8>().offset_from(ptr::null::<u8>()) and have it return 0.

@RalfJung
Copy link
Member Author

RalfJung commented Nov 5, 2023

I have updated the proposal to say that

  • ptr1.offset_from(ptr2) has the following requirements when ptr1.addr() == ptr2.addr():
    • either both pointers have no provenance (the provenance used by ptr::invalid and ptr::null)
    • or both pointers are derived from a pointer to the same allocation (notably, there is no inbounds requirement in this case!)

The offset_from spec remains unchanged compared to today when ptr1.addr() != ptr2.addr().

I have also removed the references to that offset/offset_from property, I don't think it holds with more complicated provenance anyway so I don't really know what I was thinking there...

@Nadrieril
Copy link
Member

Nadrieril commented Nov 5, 2023

Doesn't this break provenance monotonicity? Is "the provenance used by ptr::null" not the bottom of the provenance preorder?

@RalfJung
Copy link
Member Author

RalfJung commented Nov 5, 2023

Oh, right... we have to allow the case where one side has no provenance and the other side has some alloc's provenance.

@RalfJung
Copy link
Member Author

RalfJung commented Nov 5, 2023

Ah no I've finally re-traced the steps that led me to the original spec. We need to allow arbitrary provenance mismatches. If we start out with both pointers having no provenance, then provenance monotonicity says that we can add provenance A to one pointer and provenance B to the other pointer and the call must still be allowed.

@digama0
Copy link

digama0 commented Nov 5, 2023

So I take it you agree with my previous message? I think I just gave that argument. But AFAIK we don't need to allow offset_from when both values have no provenance, and without that the monotonicity rationale breaks down.

EDIT: Actually I see you addressed this in #472 (comment) . In which case I'm satisfied with the rationale.

@RalfJung
Copy link
Member Author

RalfJung commented Nov 5, 2023

I think I just gave that argument

Yeah sorry, I didn't understand your message properly initially.

But AFAIK we don't need to allow offset_from when both values have no provenance, and without that the monotonicity rationale breaks down.

I think we at least "need to" do that in terms of backwards compatibility. Miri and const-eval currently allow it, and we generally allow zero-sized accesses of non-null pointers without provenance arguing that "it's as if there was a zero-sized allocation there". It follows that offset_from should also work, since things are in-bounds of that zero-sized allocation.

But also beyond that it seems pretty necessary. If you want to represent an iterator as a begin and end pointer, and not allocate memory for when there are no elements, you'll use NonNull::dangling as the begin and end pointer. But we'll still want to allow using offset_from to compute how many elements are left in the iterator.

@digama0
Copy link

digama0 commented Nov 5, 2023

@rfcbot resolved offset_from-different-allocations

@rfcbot
Copy link
Collaborator

rfcbot commented Nov 5, 2023

🔔 This is now entering its final comment period, as per the review above. 🔔

@matklad
Copy link
Member

matklad commented Nov 10, 2023

Zero-sized accesses (reads and writes) are always allowed with every pointer.

An interesting thing I’ve recently learned: WebAssembly does the opposite, it traps on zero-sized out of bound accesses:

https://webassembly.github.io/spec/core/exec/instructions.html#xref-syntax-instructions-syntax-instr-memory-mathsf-memory-copy

WebAssembly/design#1482

@RalfJung
Copy link
Member Author

Awesome, thanks all!
I opened a tracking issue for the implementation of this: rust-lang/rust#117945.

@davidben
Copy link

davidben commented Feb 1, 2024

(Mostly for my own curiosity.)

Regarding the offset_from bits, wouldn't the new offset_from preconditions mean the compiler can no longer transform end2 = begin.offset(end.offset_from(begin)) into end2 = end? end2 will have begin's provenance and, a priori, the compiler may not know that end and begin share provenance. With the tighter precondition, offset_from is enough to infer this. With the new precondition, the compiler needs to know that end.offset_from(begin) != 0, which may not be generally true, otherwise we might have hit the same address case.

Not sure how important that is transform is in practice, but it isn't completely random: this is converting back and forth between the pointer pair and pointer + length representations of a slice. Or, in other words, what happens if you call iter.as_slice().iter() on a std::slice::Iter. If iter's creation was too far away, the compiler may not know end and begin were related.

That precondition and provenance monotonicity seem to just be in conflict though, if I'm understanding that property correctly. :-/ If you're always allowed to add provenance to a pointer, and the "add provenance" operation is allowed to add arbitrary provenances inconsistent with the pointer address, you might start with two danglings, add different provenance to each, and then attempt to offset_from.

@RalfJung
Copy link
Member Author

RalfJung commented Feb 3, 2024

@davidben yes that sounds about right. And yes it is unfortunate but I also don't see an alternative. We could say that offset_from on two pointers without provenance is just forbidden, i.e., they must have the same provenance and they both must have provenance. But that excludes legitimate uses involving empty slices.

This seems like a case where the "special zero-sized provenance" alternative actually has an advantage.

That said, you can still optimize this to end2 = begin.with_addr(end.addr()), which removes the arithmetic (assuming with_addr turns into some suitable primitive intrinsic).

@RalfJung
Copy link
Member Author

RalfJung commented Feb 3, 2024

I guess there is a point to be made for specifying the intrinsic behind offset_from the way we document above (so we have provenance monotonicity in the op.sem), but then in the public library function that exposes the intrinsic, add some extra library UB for when the two pointers have different provenance. That would let us change the implementation later if we found a better way.

However, what should Miri do in that case? Usually we only implement language UB. Given that it's hard to reason about provenance, that would make this problem quite easy to miss. To represent this properly we'd need some sort of intrinsic that the library function can call to inform Miri that it'd like some extra checks on the provenance...

bors added a commit to rust-lang-ci/rust that referenced this issue Feb 21, 2024
rename ptr::invalid -> ptr::without_provenance

It has long bothered me that `ptr::invalid` returns a pointer that is actually valid for zero-sized memory accesses. In general, it doesn't even make sense to ask "is this pointer valid", you have to ask "is this pointer valid for a given memory access". We could say that a pointer is invalid if it is not valid for *any* memory access, but [the way this FCP is going](rust-lang/unsafe-code-guidelines#472), it looks like *all* pointers will be valid for zero-sized memory accesses.

Two possible alternative names emerged as people's favorites:
1. Something involving `dangling`, in analogy to `NonNull::dangling`. To avoid inconsistency with the `NonNull` method, the address-taking method could be called `dangling_at(addr: usize) -> *const T`.
2. `without_provenance`, to be symmetric with the inverse operation `ptr.addr_without_provenance()` (currently still called `ptr.addr()` but probably going to be renamed)

I have no idea which one of these is better. I read [this comment](rust-lang#117658 (comment)) as expressing a slight preference for something like the second option, so I went for that. I'm happy to go with `dangling_at` as well.

Cc `@rust-lang/opsem`
bors added a commit to rust-lang-ci/rust that referenced this issue Feb 21, 2024
rename ptr::invalid -> ptr::without_provenance

It has long bothered me that `ptr::invalid` returns a pointer that is actually valid for zero-sized memory accesses. In general, it doesn't even make sense to ask "is this pointer valid", you have to ask "is this pointer valid for a given memory access". We could say that a pointer is invalid if it is not valid for *any* memory access, but [the way this FCP is going](rust-lang/unsafe-code-guidelines#472), it looks like *all* pointers will be valid for zero-sized memory accesses.

Two possible alternative names emerged as people's favorites:
1. Something involving `dangling`, in analogy to `NonNull::dangling`. To avoid inconsistency with the `NonNull` method, the address-taking method could be called `dangling_at(addr: usize) -> *const T`.
2. `without_provenance`, to be symmetric with the inverse operation `ptr.addr_without_provenance()` (currently still called `ptr.addr()` but probably going to be renamed)

I have no idea which one of these is better. I read [this comment](rust-lang#117658 (comment)) as expressing a slight preference for something like the second option, so I went for that. I'm happy to go with `dangling_at` as well.

Cc `@rust-lang/opsem`
github-actions bot pushed a commit to rust-lang/miri that referenced this issue Feb 22, 2024
rename ptr::invalid -> ptr::without_provenance

It has long bothered me that `ptr::invalid` returns a pointer that is actually valid for zero-sized memory accesses. In general, it doesn't even make sense to ask "is this pointer valid", you have to ask "is this pointer valid for a given memory access". We could say that a pointer is invalid if it is not valid for *any* memory access, but [the way this FCP is going](rust-lang/unsafe-code-guidelines#472), it looks like *all* pointers will be valid for zero-sized memory accesses.

Two possible alternative names emerged as people's favorites:
1. Something involving `dangling`, in analogy to `NonNull::dangling`. To avoid inconsistency with the `NonNull` method, the address-taking method could be called `dangling_at(addr: usize) -> *const T`.
2. `without_provenance`, to be symmetric with the inverse operation `ptr.addr_without_provenance()` (currently still called `ptr.addr()` but probably going to be renamed)

I have no idea which one of these is better. I read [this comment](rust-lang/rust#117658 (comment)) as expressing a slight preference for something like the second option, so I went for that. I'm happy to go with `dangling_at` as well.

Cc `@rust-lang/opsem`
bors added a commit to rust-lang-ci/rust that referenced this issue May 22, 2024
…cottmcm

offset: allow zero-byte offset on arbitrary pointers

As per prior `@rust-lang/opsem` [discussion](rust-lang/opsem-team#10) and [FCP](rust-lang/unsafe-code-guidelines#472 (comment)):

- Zero-sized reads and writes are allowed on all sufficiently aligned pointers, including the null pointer
- Inbounds-offset-by-zero is allowed on all pointers, including the null pointer
- `offset_from` on two pointers derived from the same allocation is always allowed when they have the same address

This removes surprising UB (in particular, even C++ allows "nullptr + 0", which we currently disallow), and it brings us one step closer to an important theoretical property for our semantics ("provenance monotonicity": if operations are valid on bytes without provenance, then adding provenance can't make them invalid).

The minimum LLVM we require (v17) includes https://reviews.llvm.org/D154051, so we can finally implement this.

The `offset_from` change is needed to maintain the equivalence with `offset`: if `let ptr2 = ptr1.offset(N)` is well-defined, then `ptr2.offset_from(ptr1)` should be well-defined and return N. Now consider the case where N is 0 and `ptr1` dangles: we want to still allow offset_from here.

I think we should change offset_from further, but that's a separate discussion.

Fixes rust-lang#65108
[Tracking issue](rust-lang#117945) | [T-lang summary](rust-lang#117329 (comment))

Cc `@nikic`
bors added a commit to rust-lang-ci/rust that referenced this issue May 22, 2024
…cottmcm

offset: allow zero-byte offset on arbitrary pointers

As per prior `@rust-lang/opsem` [discussion](rust-lang/opsem-team#10) and [FCP](rust-lang/unsafe-code-guidelines#472 (comment)):

- Zero-sized reads and writes are allowed on all sufficiently aligned pointers, including the null pointer
- Inbounds-offset-by-zero is allowed on all pointers, including the null pointer
- `offset_from` on two pointers derived from the same allocation is always allowed when they have the same address

This removes surprising UB (in particular, even C++ allows "nullptr + 0", which we currently disallow), and it brings us one step closer to an important theoretical property for our semantics ("provenance monotonicity": if operations are valid on bytes without provenance, then adding provenance can't make them invalid).

The minimum LLVM we require (v17) includes https://reviews.llvm.org/D154051, so we can finally implement this.

The `offset_from` change is needed to maintain the equivalence with `offset`: if `let ptr2 = ptr1.offset(N)` is well-defined, then `ptr2.offset_from(ptr1)` should be well-defined and return N. Now consider the case where N is 0 and `ptr1` dangles: we want to still allow offset_from here.

I think we should change offset_from further, but that's a separate discussion.

Fixes rust-lang#65108
[Tracking issue](rust-lang#117945) | [T-lang summary](rust-lang#117329 (comment))

Cc `@nikic`
github-actions bot pushed a commit to rust-lang/miri that referenced this issue May 23, 2024
offset: allow zero-byte offset on arbitrary pointers

As per prior `@rust-lang/opsem` [discussion](rust-lang/opsem-team#10) and [FCP](rust-lang/unsafe-code-guidelines#472 (comment)):

- Zero-sized reads and writes are allowed on all sufficiently aligned pointers, including the null pointer
- Inbounds-offset-by-zero is allowed on all pointers, including the null pointer
- `offset_from` on two pointers derived from the same allocation is always allowed when they have the same address

This removes surprising UB (in particular, even C++ allows "nullptr + 0", which we currently disallow), and it brings us one step closer to an important theoretical property for our semantics ("provenance monotonicity": if operations are valid on bytes without provenance, then adding provenance can't make them invalid).

The minimum LLVM we require (v17) includes https://reviews.llvm.org/D154051, so we can finally implement this.

The `offset_from` change is needed to maintain the equivalence with `offset`: if `let ptr2 = ptr1.offset(N)` is well-defined, then `ptr2.offset_from(ptr1)` should be well-defined and return N. Now consider the case where N is 0 and `ptr1` dangles: we want to still allow offset_from here.

I think we should change offset_from further, but that's a separate discussion.

Fixes rust-lang/rust#65108
[Tracking issue](rust-lang/rust#117945) | [T-lang summary](rust-lang/rust#117329 (comment))

Cc `@nikic`
flip1995 pushed a commit to flip1995/rust-clippy that referenced this issue May 24, 2024
offset: allow zero-byte offset on arbitrary pointers

As per prior `@rust-lang/opsem` [discussion](rust-lang/opsem-team#10) and [FCP](rust-lang/unsafe-code-guidelines#472 (comment)):

- Zero-sized reads and writes are allowed on all sufficiently aligned pointers, including the null pointer
- Inbounds-offset-by-zero is allowed on all pointers, including the null pointer
- `offset_from` on two pointers derived from the same allocation is always allowed when they have the same address

This removes surprising UB (in particular, even C++ allows "nullptr + 0", which we currently disallow), and it brings us one step closer to an important theoretical property for our semantics ("provenance monotonicity": if operations are valid on bytes without provenance, then adding provenance can't make them invalid).

The minimum LLVM we require (v17) includes https://reviews.llvm.org/D154051, so we can finally implement this.

The `offset_from` change is needed to maintain the equivalence with `offset`: if `let ptr2 = ptr1.offset(N)` is well-defined, then `ptr2.offset_from(ptr1)` should be well-defined and return N. Now consider the case where N is 0 and `ptr1` dangles: we want to still allow offset_from here.

I think we should change offset_from further, but that's a separate discussion.

Fixes rust-lang/rust#65108
[Tracking issue](rust-lang/rust#117945) | [T-lang summary](rust-lang/rust#117329 (comment))

Cc `@nikic`
bors added a commit to rust-lang/rust-analyzer that referenced this issue Jun 20, 2024
offset: allow zero-byte offset on arbitrary pointers

As per prior `@rust-lang/opsem` [discussion](rust-lang/opsem-team#10) and [FCP](rust-lang/unsafe-code-guidelines#472 (comment)):

- Zero-sized reads and writes are allowed on all sufficiently aligned pointers, including the null pointer
- Inbounds-offset-by-zero is allowed on all pointers, including the null pointer
- `offset_from` on two pointers derived from the same allocation is always allowed when they have the same address

This removes surprising UB (in particular, even C++ allows "nullptr + 0", which we currently disallow), and it brings us one step closer to an important theoretical property for our semantics ("provenance monotonicity": if operations are valid on bytes without provenance, then adding provenance can't make them invalid).

The minimum LLVM we require (v17) includes https://reviews.llvm.org/D154051, so we can finally implement this.

The `offset_from` change is needed to maintain the equivalence with `offset`: if `let ptr2 = ptr1.offset(N)` is well-defined, then `ptr2.offset_from(ptr1)` should be well-defined and return N. Now consider the case where N is 0 and `ptr1` dangles: we want to still allow offset_from here.

I think we should change offset_from further, but that's a separate discussion.

Fixes rust-lang/rust#65108
[Tracking issue](rust-lang/rust#117945) | [T-lang summary](rust-lang/rust#117329 (comment))

Cc `@nikic`
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jul 15, 2024
…oli-obk

offset_from: always allow pointers to point to the same address

This PR implements the last remaining part of the t-opsem consensus in rust-lang/unsafe-code-guidelines#472: always permits offset_from when both pointers have the same address, no matter how they are computed. This is required to achieve *provenance monotonicity*.

Tracking issue: rust-lang#117945

### What is provenance monotonicity and why does it matter?

Provenance monotonicity is the property that adding arbitrary provenance to any no-provenance pointer must never make the program UB. More specifically, in the program state, data in memory is stored as a sequence of [abstract bytes](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#abstract-byte), where each byte can optionally carry provenance. When a pointer is stored in memory, all of the bytes it is stored in carry that provenance. Provenance monotonicity means: if we take some byte that does not have provenance, and give it some arbitrary provenance, then that cannot change program behavior or introduce UB into a UB-free program.

We care about provenance monotonicity because we want to allow the optimizer to remove provenance-stripping operations. Removing a provenance-stripping operation effectively means the program after the optimization has provenance where the program before the optimization did not -- since the provenance removal does not happen in the optimized program. IOW, the compiler transformation added provenance to previously provenance-free bytes. This is exactly what provenance monotonicity lets us do.

We care about removing provenance-stripping operations because `*ptr = *ptr` is, in general, (likely) a provenance-stripping operation. Specifically, consider `ptr: *mut usize` (or any integer type), and imagine the data at `*ptr` is actually a pointer (i.e., we are type-punning between pointers and integers). Then `*ptr` on the right-hand side evaluates to the data in memory *without* any provenance (because [integers do not have provenance](https://rust-lang.github.io/rfcs/3559-rust-has-provenance.html#integers-do-not-have-provenance)). Storing that back to `*ptr` means that the abstract bytes `ptr` points to are the same as before, except their provenance is now gone. This makes  `*ptr = *ptr`  a provenance-stripping operation  (Here we assume `*ptr` is fully initialized. If it is not initialized, evaluating `*ptr` to a value is UB, so removing `*ptr = *ptr` is trivially correct.)

### What does `offset_from` have to do with provenance monotonicity?

With `ptr = without_provenance(N)`, `ptr.offset_from(ptr)` is always well-defined and returns 0. By provenance monotonicity, I can now add provenance to the two arguments of `offset_from` and it must still be well-defined. Crucially, I can add *different* provenance to the two arguments, and it must still be well-defined. In other words, this must always be allowed: `ptr1.with_addr(N).offset_from(ptr2.with_addr(N))` (and it returns 0). But the current spec for `offset_from` says that the two pointers must either both be derived from an integer or both be derived from the same allocation, which is not in general true for arbitrary `ptr1`, `ptr2`.

To obtain provenance monotonicity, this PR hence changes the spec for offset_from to say that if both pointers have the same address, the function is always well-defined.

### What further consequences does this have?

It means the compiler can no longer transform `end2 = begin.offset(end.offset_from(begin))` into `end2 = end`. However, it can still be transformed into `end2 = begin.with_addr(end.addr())`, which later parts of the backend (when provenance has been erased) can trivially turn into `end2 = end`.

The only alternative I am aware of is a fundamentally different handling of zero-sized accesses, where a "no provenance" pointer is not allowed to do zero-sized accesses and instead we have a special provenance that indicates "may be used for zero-sized accesses (and nothing else)". `offset` and `offset_from` would then always be UB on a "no provenance" pointer, and permit zero-sized offsets on a "zero-sized provenance" pointer. This achieves provenance monotonicity. That is, however, a breaking change as it contradicts what we landed in rust-lang#117329. It's also a whole bunch of extra UB, which doesn't seem worth it just to achieve that transformation.

### What about the backend?

LLVM currently doesn't have an intrinsic for pointer difference, so we anyway cast to integer and subtract there. That's never UB so it is compatible with any relaxation we may want to apply.

If LLVM gets a `ptrsub` in the future, then plausibly it will be consistent with `ptradd` and [consider two equal pointers to be inbounds](rust-lang#124921 (comment)).
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jul 15, 2024
…oli-obk

offset_from: always allow pointers to point to the same address

This PR implements the last remaining part of the t-opsem consensus in rust-lang/unsafe-code-guidelines#472: always permits offset_from when both pointers have the same address, no matter how they are computed. This is required to achieve *provenance monotonicity*.

Tracking issue: rust-lang#117945

### What is provenance monotonicity and why does it matter?

Provenance monotonicity is the property that adding arbitrary provenance to any no-provenance pointer must never make the program UB. More specifically, in the program state, data in memory is stored as a sequence of [abstract bytes](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#abstract-byte), where each byte can optionally carry provenance. When a pointer is stored in memory, all of the bytes it is stored in carry that provenance. Provenance monotonicity means: if we take some byte that does not have provenance, and give it some arbitrary provenance, then that cannot change program behavior or introduce UB into a UB-free program.

We care about provenance monotonicity because we want to allow the optimizer to remove provenance-stripping operations. Removing a provenance-stripping operation effectively means the program after the optimization has provenance where the program before the optimization did not -- since the provenance removal does not happen in the optimized program. IOW, the compiler transformation added provenance to previously provenance-free bytes. This is exactly what provenance monotonicity lets us do.

We care about removing provenance-stripping operations because `*ptr = *ptr` is, in general, (likely) a provenance-stripping operation. Specifically, consider `ptr: *mut usize` (or any integer type), and imagine the data at `*ptr` is actually a pointer (i.e., we are type-punning between pointers and integers). Then `*ptr` on the right-hand side evaluates to the data in memory *without* any provenance (because [integers do not have provenance](https://rust-lang.github.io/rfcs/3559-rust-has-provenance.html#integers-do-not-have-provenance)). Storing that back to `*ptr` means that the abstract bytes `ptr` points to are the same as before, except their provenance is now gone. This makes  `*ptr = *ptr`  a provenance-stripping operation  (Here we assume `*ptr` is fully initialized. If it is not initialized, evaluating `*ptr` to a value is UB, so removing `*ptr = *ptr` is trivially correct.)

### What does `offset_from` have to do with provenance monotonicity?

With `ptr = without_provenance(N)`, `ptr.offset_from(ptr)` is always well-defined and returns 0. By provenance monotonicity, I can now add provenance to the two arguments of `offset_from` and it must still be well-defined. Crucially, I can add *different* provenance to the two arguments, and it must still be well-defined. In other words, this must always be allowed: `ptr1.with_addr(N).offset_from(ptr2.with_addr(N))` (and it returns 0). But the current spec for `offset_from` says that the two pointers must either both be derived from an integer or both be derived from the same allocation, which is not in general true for arbitrary `ptr1`, `ptr2`.

To obtain provenance monotonicity, this PR hence changes the spec for offset_from to say that if both pointers have the same address, the function is always well-defined.

### What further consequences does this have?

It means the compiler can no longer transform `end2 = begin.offset(end.offset_from(begin))` into `end2 = end`. However, it can still be transformed into `end2 = begin.with_addr(end.addr())`, which later parts of the backend (when provenance has been erased) can trivially turn into `end2 = end`.

The only alternative I am aware of is a fundamentally different handling of zero-sized accesses, where a "no provenance" pointer is not allowed to do zero-sized accesses and instead we have a special provenance that indicates "may be used for zero-sized accesses (and nothing else)". `offset` and `offset_from` would then always be UB on a "no provenance" pointer, and permit zero-sized offsets on a "zero-sized provenance" pointer. This achieves provenance monotonicity. That is, however, a breaking change as it contradicts what we landed in rust-lang#117329. It's also a whole bunch of extra UB, which doesn't seem worth it just to achieve that transformation.

### What about the backend?

LLVM currently doesn't have an intrinsic for pointer difference, so we anyway cast to integer and subtract there. That's never UB so it is compatible with any relaxation we may want to apply.

If LLVM gets a `ptrsub` in the future, then plausibly it will be consistent with `ptradd` and [consider two equal pointers to be inbounds](rust-lang#124921 (comment)).
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jul 15, 2024
…oli-obk

offset_from: always allow pointers to point to the same address

This PR implements the last remaining part of the t-opsem consensus in rust-lang/unsafe-code-guidelines#472: always permits offset_from when both pointers have the same address, no matter how they are computed. This is required to achieve *provenance monotonicity*.

Tracking issue: rust-lang#117945

### What is provenance monotonicity and why does it matter?

Provenance monotonicity is the property that adding arbitrary provenance to any no-provenance pointer must never make the program UB. More specifically, in the program state, data in memory is stored as a sequence of [abstract bytes](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#abstract-byte), where each byte can optionally carry provenance. When a pointer is stored in memory, all of the bytes it is stored in carry that provenance. Provenance monotonicity means: if we take some byte that does not have provenance, and give it some arbitrary provenance, then that cannot change program behavior or introduce UB into a UB-free program.

We care about provenance monotonicity because we want to allow the optimizer to remove provenance-stripping operations. Removing a provenance-stripping operation effectively means the program after the optimization has provenance where the program before the optimization did not -- since the provenance removal does not happen in the optimized program. IOW, the compiler transformation added provenance to previously provenance-free bytes. This is exactly what provenance monotonicity lets us do.

We care about removing provenance-stripping operations because `*ptr = *ptr` is, in general, (likely) a provenance-stripping operation. Specifically, consider `ptr: *mut usize` (or any integer type), and imagine the data at `*ptr` is actually a pointer (i.e., we are type-punning between pointers and integers). Then `*ptr` on the right-hand side evaluates to the data in memory *without* any provenance (because [integers do not have provenance](https://rust-lang.github.io/rfcs/3559-rust-has-provenance.html#integers-do-not-have-provenance)). Storing that back to `*ptr` means that the abstract bytes `ptr` points to are the same as before, except their provenance is now gone. This makes  `*ptr = *ptr`  a provenance-stripping operation  (Here we assume `*ptr` is fully initialized. If it is not initialized, evaluating `*ptr` to a value is UB, so removing `*ptr = *ptr` is trivially correct.)

### What does `offset_from` have to do with provenance monotonicity?

With `ptr = without_provenance(N)`, `ptr.offset_from(ptr)` is always well-defined and returns 0. By provenance monotonicity, I can now add provenance to the two arguments of `offset_from` and it must still be well-defined. Crucially, I can add *different* provenance to the two arguments, and it must still be well-defined. In other words, this must always be allowed: `ptr1.with_addr(N).offset_from(ptr2.with_addr(N))` (and it returns 0). But the current spec for `offset_from` says that the two pointers must either both be derived from an integer or both be derived from the same allocation, which is not in general true for arbitrary `ptr1`, `ptr2`.

To obtain provenance monotonicity, this PR hence changes the spec for offset_from to say that if both pointers have the same address, the function is always well-defined.

### What further consequences does this have?

It means the compiler can no longer transform `end2 = begin.offset(end.offset_from(begin))` into `end2 = end`. However, it can still be transformed into `end2 = begin.with_addr(end.addr())`, which later parts of the backend (when provenance has been erased) can trivially turn into `end2 = end`.

The only alternative I am aware of is a fundamentally different handling of zero-sized accesses, where a "no provenance" pointer is not allowed to do zero-sized accesses and instead we have a special provenance that indicates "may be used for zero-sized accesses (and nothing else)". `offset` and `offset_from` would then always be UB on a "no provenance" pointer, and permit zero-sized offsets on a "zero-sized provenance" pointer. This achieves provenance monotonicity. That is, however, a breaking change as it contradicts what we landed in rust-lang#117329. It's also a whole bunch of extra UB, which doesn't seem worth it just to achieve that transformation.

### What about the backend?

LLVM currently doesn't have an intrinsic for pointer difference, so we anyway cast to integer and subtract there. That's never UB so it is compatible with any relaxation we may want to apply.

If LLVM gets a `ptrsub` in the future, then plausibly it will be consistent with `ptradd` and [consider two equal pointers to be inbounds](rust-lang#124921 (comment)).
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Jul 15, 2024
Rollup merge of rust-lang#124921 - RalfJung:offset-from-same-addr, r=oli-obk

offset_from: always allow pointers to point to the same address

This PR implements the last remaining part of the t-opsem consensus in rust-lang/unsafe-code-guidelines#472: always permits offset_from when both pointers have the same address, no matter how they are computed. This is required to achieve *provenance monotonicity*.

Tracking issue: rust-lang#117945

### What is provenance monotonicity and why does it matter?

Provenance monotonicity is the property that adding arbitrary provenance to any no-provenance pointer must never make the program UB. More specifically, in the program state, data in memory is stored as a sequence of [abstract bytes](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#abstract-byte), where each byte can optionally carry provenance. When a pointer is stored in memory, all of the bytes it is stored in carry that provenance. Provenance monotonicity means: if we take some byte that does not have provenance, and give it some arbitrary provenance, then that cannot change program behavior or introduce UB into a UB-free program.

We care about provenance monotonicity because we want to allow the optimizer to remove provenance-stripping operations. Removing a provenance-stripping operation effectively means the program after the optimization has provenance where the program before the optimization did not -- since the provenance removal does not happen in the optimized program. IOW, the compiler transformation added provenance to previously provenance-free bytes. This is exactly what provenance monotonicity lets us do.

We care about removing provenance-stripping operations because `*ptr = *ptr` is, in general, (likely) a provenance-stripping operation. Specifically, consider `ptr: *mut usize` (or any integer type), and imagine the data at `*ptr` is actually a pointer (i.e., we are type-punning between pointers and integers). Then `*ptr` on the right-hand side evaluates to the data in memory *without* any provenance (because [integers do not have provenance](https://rust-lang.github.io/rfcs/3559-rust-has-provenance.html#integers-do-not-have-provenance)). Storing that back to `*ptr` means that the abstract bytes `ptr` points to are the same as before, except their provenance is now gone. This makes  `*ptr = *ptr`  a provenance-stripping operation  (Here we assume `*ptr` is fully initialized. If it is not initialized, evaluating `*ptr` to a value is UB, so removing `*ptr = *ptr` is trivially correct.)

### What does `offset_from` have to do with provenance monotonicity?

With `ptr = without_provenance(N)`, `ptr.offset_from(ptr)` is always well-defined and returns 0. By provenance monotonicity, I can now add provenance to the two arguments of `offset_from` and it must still be well-defined. Crucially, I can add *different* provenance to the two arguments, and it must still be well-defined. In other words, this must always be allowed: `ptr1.with_addr(N).offset_from(ptr2.with_addr(N))` (and it returns 0). But the current spec for `offset_from` says that the two pointers must either both be derived from an integer or both be derived from the same allocation, which is not in general true for arbitrary `ptr1`, `ptr2`.

To obtain provenance monotonicity, this PR hence changes the spec for offset_from to say that if both pointers have the same address, the function is always well-defined.

### What further consequences does this have?

It means the compiler can no longer transform `end2 = begin.offset(end.offset_from(begin))` into `end2 = end`. However, it can still be transformed into `end2 = begin.with_addr(end.addr())`, which later parts of the backend (when provenance has been erased) can trivially turn into `end2 = end`.

The only alternative I am aware of is a fundamentally different handling of zero-sized accesses, where a "no provenance" pointer is not allowed to do zero-sized accesses and instead we have a special provenance that indicates "may be used for zero-sized accesses (and nothing else)". `offset` and `offset_from` would then always be UB on a "no provenance" pointer, and permit zero-sized offsets on a "zero-sized provenance" pointer. This achieves provenance monotonicity. That is, however, a breaking change as it contradicts what we landed in rust-lang#117329. It's also a whole bunch of extra UB, which doesn't seem worth it just to achieve that transformation.

### What about the backend?

LLVM currently doesn't have an intrinsic for pointer difference, so we anyway cast to integer and subtract there. That's never UB so it is compatible with any relaxation we may want to apply.

If LLVM gets a `ptrsub` in the future, then plausibly it will be consistent with `ptradd` and [consider two equal pointers to be inbounds](rust-lang#124921 (comment)).
github-actions bot pushed a commit to rust-lang/miri that referenced this issue Jul 16, 2024
offset_from: always allow pointers to point to the same address

This PR implements the last remaining part of the t-opsem consensus in rust-lang/unsafe-code-guidelines#472: always permits offset_from when both pointers have the same address, no matter how they are computed. This is required to achieve *provenance monotonicity*.

Tracking issue: rust-lang/rust#117945

### What is provenance monotonicity and why does it matter?

Provenance monotonicity is the property that adding arbitrary provenance to any no-provenance pointer must never make the program UB. More specifically, in the program state, data in memory is stored as a sequence of [abstract bytes](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#abstract-byte), where each byte can optionally carry provenance. When a pointer is stored in memory, all of the bytes it is stored in carry that provenance. Provenance monotonicity means: if we take some byte that does not have provenance, and give it some arbitrary provenance, then that cannot change program behavior or introduce UB into a UB-free program.

We care about provenance monotonicity because we want to allow the optimizer to remove provenance-stripping operations. Removing a provenance-stripping operation effectively means the program after the optimization has provenance where the program before the optimization did not -- since the provenance removal does not happen in the optimized program. IOW, the compiler transformation added provenance to previously provenance-free bytes. This is exactly what provenance monotonicity lets us do.

We care about removing provenance-stripping operations because `*ptr = *ptr` is, in general, (likely) a provenance-stripping operation. Specifically, consider `ptr: *mut usize` (or any integer type), and imagine the data at `*ptr` is actually a pointer (i.e., we are type-punning between pointers and integers). Then `*ptr` on the right-hand side evaluates to the data in memory *without* any provenance (because [integers do not have provenance](https://rust-lang.github.io/rfcs/3559-rust-has-provenance.html#integers-do-not-have-provenance)). Storing that back to `*ptr` means that the abstract bytes `ptr` points to are the same as before, except their provenance is now gone. This makes  `*ptr = *ptr`  a provenance-stripping operation  (Here we assume `*ptr` is fully initialized. If it is not initialized, evaluating `*ptr` to a value is UB, so removing `*ptr = *ptr` is trivially correct.)

### What does `offset_from` have to do with provenance monotonicity?

With `ptr = without_provenance(N)`, `ptr.offset_from(ptr)` is always well-defined and returns 0. By provenance monotonicity, I can now add provenance to the two arguments of `offset_from` and it must still be well-defined. Crucially, I can add *different* provenance to the two arguments, and it must still be well-defined. In other words, this must always be allowed: `ptr1.with_addr(N).offset_from(ptr2.with_addr(N))` (and it returns 0). But the current spec for `offset_from` says that the two pointers must either both be derived from an integer or both be derived from the same allocation, which is not in general true for arbitrary `ptr1`, `ptr2`.

To obtain provenance monotonicity, this PR hence changes the spec for offset_from to say that if both pointers have the same address, the function is always well-defined.

### What further consequences does this have?

It means the compiler can no longer transform `end2 = begin.offset(end.offset_from(begin))` into `end2 = end`. However, it can still be transformed into `end2 = begin.with_addr(end.addr())`, which later parts of the backend (when provenance has been erased) can trivially turn into `end2 = end`.

The only alternative I am aware of is a fundamentally different handling of zero-sized accesses, where a "no provenance" pointer is not allowed to do zero-sized accesses and instead we have a special provenance that indicates "may be used for zero-sized accesses (and nothing else)". `offset` and `offset_from` would then always be UB on a "no provenance" pointer, and permit zero-sized offsets on a "zero-sized provenance" pointer. This achieves provenance monotonicity. That is, however, a breaking change as it contradicts what we landed in rust-lang/rust#117329. It's also a whole bunch of extra UB, which doesn't seem worth it just to achieve that transformation.

### What about the backend?

LLVM currently doesn't have an intrinsic for pointer difference, so we anyway cast to integer and subtract there. That's never UB so it is compatible with any relaxation we may want to apply.

If LLVM gets a `ptrsub` in the future, then plausibly it will be consistent with `ptradd` and [consider two equal pointers to be inbounds](rust-lang/rust#124921 (comment)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants