-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP documents for memory interface and value domain #175
Merged
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
d0c9729
add basic memory interface
RalfJung 499fa9e
begin with value domain
RalfJung 648ec02
ptr-to-byte
RalfJung ca891ac
more editing
RalfJung 168249a
tweaks
RalfJung 53b899e
address many review comments
RalfJung 91648b1
just assume a memory model is in scope
RalfJung 3785d4c
typo
RalfJung 17611a1
add NonNullU8
RalfJung 334c8b6
linebreaks
RalfJung d871de6
NonNullU8 is actually called NonZeroU8
RalfJung d63559b
cite cerberus
RalfJung File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,156 @@ | ||
# Rust Memory Interface | ||
|
||
**Note:** This document is not normative nor endorsed by the UCG WG. Its purpose is to be the basis for discussion and to set down some key terminology. | ||
|
||
The purpose of this document is to describe the interface between a Rust program and memory. | ||
This interface is a key part of the Rust Abstract Machine: it lets us separate concerns by splitting the Machine (i.e., its specification) into two pieces, connected by this well-defined interface: | ||
* The *expression/statement semantics* of Rust boils down to explaining which "memory events" (calls to the memory interface) happen in which order - expressed as calls to the methods of this interface, and reactions to its return values. | ||
This part of the specification is *pure* in the sense that it has no "state": everything that needs to be remembered from one expression evaluation to the next is communicated through memory. | ||
* The Rust *memory model* explains which interactions with the memory are legal (the others are UB), and which values can be returned by reads. | ||
A memory model is defined by implementing the memory interface. | ||
|
||
The interface shown below is also opinionated in several ways. | ||
It is not intended to be able to support *any imaginable* memory model, but rather start the process of reducing the design space of what we consider a "reasonable" memory model for Rust. | ||
For example, it explicitly acknowledges that pointers are not just integers and that uninitialized memory is special (both are true for C and C++ as well but you have to read the standard very careful, and consult [defect report responses](http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_260.htm), to see this). | ||
Another key property of the interface presented below is that it is *untyped*. | ||
This implies that in Rust, *operations are typed, but memory is not* - a key difference to C and C++ with their type-based strict aliasing rules. | ||
At the same time, the memory model provides a *side-effect free* way to turn pointers into "raw bytes", which is *not* [the direction C++ is moving towards](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2364.pdf), and we might have to revisit this choice later if it turns out to not be workable. | ||
|
||
## Pointers | ||
|
||
One key question a memory model has to answer is *what is a pointer*. | ||
It might seem like the answer is just "an integer of appropriate size", but [that is not the case][pointers-complicated]. | ||
This becomes even more prominent with aliasing models such as [Stacked Borrows]. | ||
So we leave it up to the memory model to answer this question, and make `Pointer` an associated type. | ||
Practically speaking, `Pointer` will be some representation of an "address", plus [provenance] information. | ||
|
||
[provenance]: https://github.com/rust-lang/unsafe-code-guidelines/blob/master/reference/src/glossary.md#pointer-provenance | ||
|
||
## Bytes | ||
|
||
The unit of communication between the memory model and the rest of the program is a *byte*. | ||
Again, the question of "what is a byte" is not as trivial as it might seem; beyond `u8` values we have to represent `Pointer`s and [uninitialized memory][uninit]. | ||
We define the `Byte` type as follows, where `Pointer` will later be instantiated with the `Memory::Pointer` associated type. | ||
|
||
```rust | ||
enum Byte<Pointer> { | ||
/// The "normal" case: a (frozen, initialized) integer in `0..256`. | ||
Raw(u8), | ||
/// An uninitialized byte. | ||
Uninit, | ||
/// One byte of a pointer. | ||
PtrFragment { | ||
/// The pointer of which this is a byte. | ||
RalfJung marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/// That is, the byte is a fragment of this pointer. | ||
ptr: Pointer, | ||
/// Which byte of the pointer this is. | ||
/// `idx` will always be in `0..PTR_SIZE`. | ||
idx: u8, | ||
} | ||
} | ||
``` | ||
|
||
The purpose of `PtrFragment` is to enable a byte-wise representation of a `Pointer`. | ||
On a 32-bit system, the sequence of 4 bytes representing `ptr: Pointer` is: | ||
``` | ||
[ | ||
PtrFragment { ptr, idx: 0 }, | ||
PtrFragment { ptr, idx: 1 }, | ||
PtrFragment { ptr, idx: 2 }, | ||
PtrFragment { ptr, idx: 3 }, | ||
] | ||
``` | ||
|
||
Based on the `PtrToInt` trait (see below), we can turn every initialized `Byte` into an integer in `0..256`: | ||
RalfJung marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```rust | ||
impl<Pointer: PtrToInt> Byte<Pointer> { | ||
fn as_int(self) -> Option<u8> { | ||
match self { | ||
Byte::Raw(int) => Some(int), | ||
Byte::Uninit => None, | ||
Byte::PtrFragment { ptr, idx } => | ||
Some(ptr.get_byte(idx)), | ||
} | ||
} | ||
} | ||
``` | ||
|
||
## Memory interface | ||
|
||
The Rust memory interface is described by the following (not-yet-complete) trait definition: | ||
|
||
```rust | ||
/// All operations are fallible, so they return `Result`. If they fail, that | ||
/// means the program caused UB. What exactly the `UndefinedBehavior` type is | ||
/// does not matter here. | ||
type Result<T=()> = std::result::Result<T, UndefinedBehavior>; | ||
|
||
/// *Note*: All memory operations can be non-deterministic, which means that | ||
RalfJung marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/// executing the same operation on the same memory can have different results. | ||
/// We also let all operations potentially mutate memory. For example, reads | ||
/// actually do change the current state when considering concurrency or | ||
/// Stacked Borrows. | ||
/// This is pseudo-Rust, so we just use fully owned types everywhere for | ||
/// symmetry and simplicity. | ||
trait Memory { | ||
RalfJung marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/// The type of pointer values. | ||
type Pointer: Copy + PtrToInt; | ||
|
||
/// The size of pointer values. | ||
const PTR_SIZE: u64; | ||
RalfJung marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
/// Create a new allocation. | ||
fn allocate(&mut self, size: u64, align: u64) -> Result<Self::Pointer>; | ||
|
||
/// Remove an allocation. | ||
fn deallocate(&mut self, ptr: Self::Pointer, size: u64, align: u64) -> Result; | ||
|
||
/// Write some bytes to memory. | ||
fn write(&mut self, ptr: Self::Pointer, bytes: Vec<Byte<Self::Pointer>>) -> Result; | ||
|
||
/// Read some bytes from memory. | ||
fn read(&mut self, ptr: Self::Pointer, len: u64) -> Result<Vec<Byte<Self::Pointer>>>; | ||
|
||
/// Offset the given pointer. | ||
fn offset(&mut self, ptr: Self::Pointer, offset: u64, mode: OffsetMode) | ||
-> Result<Self::Pointer>; | ||
|
||
/// Cast the given integer to a pointer. (The other direction is handled by `PtrToInt` below.) | ||
fn int_to_ptr(&mut self, int: u64) -> Result<Self::Pointer>; | ||
} | ||
|
||
/// The `Pointer` type must know how to extract its bytes, *without any access to the `Memory`*. | ||
trait PtrToInt { | ||
/// Get the `idx`-th byte of the pointer. `idx` must be in `0..PTR_SIZE`. | ||
fn get_byte(self, idx: u8) -> u8; | ||
} | ||
|
||
/// The rules applying to this pointer offset operation. | ||
enum OffsetMode { | ||
/// Wrapping offset; never UB. | ||
Wrapping, | ||
/// Non-wrapping offset; UB if it wraps. | ||
NonWrapping, | ||
/// In-bounds offset; UB if it wraps or if old and new pointer are not both | ||
/// in-bounds of the same allocation (details are specified by the memory | ||
/// model). | ||
Inbounds, | ||
} | ||
``` | ||
|
||
We will generally assume we have a particular memory model in scope, and freely refer to its `PTR_SIZE` and `Pointer` items. | ||
We will also write `Byte` for `Byte<Pointer>`. | ||
|
||
This is a very basic memory interface that is incomplete in at least the following ways: | ||
|
||
* To implement rules like "dereferencing a null, unaligned, or dangling raw pointer is UB" (even if no memory access happens), there needs to be a way to do an "alignment, bounds and null-check". | ||
* There needs to be some way to do alignment checks -- either using the above operation, or by adding `align` parameters to `read` and `write`. | ||
* To represent concurrency, many operations need to take a "thread ID" and `read` and `write` need to take an [`Ordering`]. | ||
* To represent [Stacked Borrows], there needs to be a "retag" operation, and that one will in fact be "lightly typed" (it cares about `UnsafeCell`). | ||
* Maybe we want operations that can compare pointers without casting them to integers. | ||
|
||
[pointers-complicated]: https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html | ||
[uninit]: https://www.ralfj.de/blog/2019/07/14/uninit.html | ||
[`Ordering`]: https://doc.rust-lang.org/nightly/core/sync/atomic/enum.Ordering.html | ||
[Stacked Borrows]: stacked-borrows.md |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
# Rust Value Domain | ||
|
||
**Note:** This document is not normative nor endorsed by the UCG WG. Its purpose is to be the basis for discussion and to set down some key terminology. | ||
|
||
The purpose of this document is to describe what the set of *all possible values* is in Rust. | ||
This is an important definition: one key element of a Rust specification will be to define the [representation relation][representation] of every type. | ||
This relation relates values with lists of bytes: it says, for a given value and list of bytes, if that value is represented by that list. | ||
However, before we can even start specifying the relation, we have to specify the involved domains. | ||
`Byte` is defined as part of [the memory interface][memory-interface]; this document is about defining `Value`. | ||
|
||
[representation]: https://github.com/rust-lang/unsafe-code-guidelines/blob/master/reference/src/glossary.md#representation | ||
[memory-interface]: memory-interface.md | ||
|
||
## Values | ||
|
||
The Rust value domain is described by the following (incomplete) type definition (using the `Pointer` type defined by [the memory interface][memory-interface]): | ||
|
||
```rust | ||
enum Value { | ||
/// A mathematical integer, used for `i*`/`u*` types. | ||
Int(BigInt), | ||
/// A Boolean value, used for `bool`. | ||
Bool(bool), | ||
RalfJung marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/// A pointer value, used for (thin) references and raw pointers. | ||
Ptr(Pointer), | ||
RalfJung marked this conversation as resolved.
Show resolved
Hide resolved
RalfJung marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/// An uninitialized value. | ||
Uninit, | ||
/// An n-tuple, used for arrays, structs, tuples, SIMD vectors. | ||
Tuple(Vec<Self>), | ||
RalfJung marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/// A variant of a sum type, used for enums. | ||
Variant { | ||
idx: BigInt, | ||
data: Box<Self>, | ||
}, | ||
/// A "bag of raw bytes", used for unions. | ||
RawBag(Vec<Byte>), | ||
/* ... */ | ||
} | ||
``` | ||
|
||
The point of this type is to capture the mathematical concepts that are represented by the data we store in memory. | ||
The definition is likely incomplete, and even if it was complete now, we might expand it as Rust grows. | ||
That is okay; all previously defined representation relations are still well-defined when the domain grows, the newly added values will just not be valid for old types as one would expect. | ||
|
||
## Example value relations | ||
|
||
We show some examples for how one might want to use this `Value` domain to define the value relation for a type. | ||
|
||
### `bool` | ||
|
||
The value relation for `bool` relates `Bool(b)` with `[r]` if and only if `r.as_int() == Some(if b { 1 } else { 0 })`. | ||
(`as_int` is defined in [the memory interface][memory-interface].) | ||
|
||
### `()` | ||
|
||
The value relation for the `()` type relates the empty tuple `Tuple([])` (assuming we can use array notation to "match" on `Vec`) with the empty byte list `[]`, and that's it. | ||
|
||
### `!` | ||
|
||
The value relation for the `!` type is empty: nothing is related to anything at this type. | ||
|
||
### `#[repr(C)] struct Pair<T, U>(T, U)` | ||
|
||
The value relation for `Pair` is based on the value relations for `T` and `U`. | ||
A value `Tuple([t, u])` is represented by a list of bytes `rt ++ pad1 ++ ru ++ pad2` (using `++` for list concatenation) if: | ||
|
||
* `t` is represented by `rt` at type `T`. | ||
* `u` is represented by `ru` at type `U`. | ||
* The length of `rt ++ pad1` is equal to the offset of the `U` field. | ||
* The length of the entire list `rt ++ pad1 ++ ru ++ pad2` is equal to the size of `Pair<T, U>`. | ||
|
||
This relation specifies that values of type `Pair` are always 2-tuples (aka, pairs). | ||
It also says that the actual content of the padding bytes is entirely irrelevant, we only care to have the right number of them to "pad" `ru` to the right place and to "pad" the entire list to have the right length. | ||
So, for example when considering `Pair<u8, u16>`, the value `Tuple[42, 119]` is represented on a little-endian target by `[Raw(42), byte, Raw(119), Raw(0)]` for *any* `byte: Byte`. | ||
RalfJung marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### `&T`/`&mut T` | ||
|
||
Reference types are tricky. | ||
But a possible value relation for sized `T` is: | ||
A value `Ptr(ptr)` is related to `[PtrFragment { ptr, idx: 0 }, ..., PtrFragment { ptr, idx: PTR_SIZE-1 }]` if `ptr` is non-NULL and appropriately aligned (defining alignment is left open for now). | ||
|
||
### `u8` | ||
|
||
For the value representation of integer types, there are two different reasonable choices. | ||
Certainly, a value `Int(i)` where `i` in `0..256` is related to `[b]` if `b.as_int() == Some(i)`. | ||
|
||
And then, maybe, we also want to additionally say that value `Uninit` is related to byte list `[Uninit]`. | ||
This essentially corresponds to saying that uninitialized memory is a valid representation of a `u8` value (namely, the uninitialized value). | ||
|
||
### `NonZeroU8` | ||
|
||
`NonZeroU8` is basically the same as `u8` except that we remove the 0, and certainly do not consider `Uninit` a valid value. | ||
A value `Int(i)` where `i` in `1..256` is related to `[b]` if `b.as_int() == Some(i)`, and that's it. | ||
|
||
### `union` | ||
|
||
The `union` type does not even try to interpret memory, so for a `union` of size `n`, the value relation says that for any byte list `bytes` of that length, `RawBag(bytes)` is related to `bytes`. | ||
(Note however that [this definition might not be implementable](https://github.com/rust-lang/unsafe-code-guidelines/issues/156).) | ||
|
||
## The role of the value representation in the operational semantics | ||
|
||
One key use of the value representation is to define a "typed" interface to memory: | ||
|
||
```rust | ||
trait TypedMemory: Memory { | ||
/// Write a value of the given type to memory. | ||
fn typed_write(&mut self, ptr: Self::Pointer, val: Value, ty: Type) -> Result; | ||
|
||
/// Read a value of the given type. | ||
fn typed_read(&mut self, ptr: Self::Pointer, ty: Type) -> Result<Value>; | ||
} | ||
``` | ||
|
||
Here, `Type` is some representation of the Rust type system (akin to [`Ty`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc/ty/type.Ty.html) in the compiler). | ||
This interface is inspired by [Cerberus](https://www.cl.cam.ac.uk/~pes20/cerberus/). | ||
|
||
We can implement `TypedMemory` for any `Memory` as follows: | ||
* For `typed_write`, pick any representation of `val` for `ty`, and call `Memory::write`. If no representation exists, we have UB. | ||
* For `typed_read`, read `ty.size()` many bytes from memory, and then determine which value this list of bytes represents. If it does not represent any value, we have UB. | ||
|
||
In particular, this defines the semantics of a "typed copy": when an assignment is executed in Rust (at some type `T`), or when an argument of type `T` is passed to a function or a return value of type `T` returned from a function, this is called a "typed copy". | ||
Its semantics is basically to do a `typed_read` at `T` followed by a `typed_write` at `T`. | ||
In particular, this means doing a "typed copy" at `T` of some memory that has no valid representation at `T` is undefined behavior! | ||
This also means that for types that have padding, the "typed copy" does not preserve the padding bytes. | ||
|
||
## Relation to validity invariant | ||
|
||
One way we *could* also use the value representation (and the author thinks this is exceedingly elegant) is to define the validity invariant. | ||
Certainly, it is the case that if a list of bytes is not related to any value for a given type `T`, then that list of bytes is *invalid* for `T` and it should be UB to produce such a list of bytes at type `T`. | ||
We could decide that this is an "if and only if", i.e., that the validity invariant for a type is exactly "must be in the value representation". | ||
For many types this is likely what we will do anyway (e.g., for `bool` and `!` and `()` and integers), but for references, this choice would mean that *validity of the reference cannot depend on what memory looks like*---so "dereferencable" and "points to valid data" cannot be part of the validity invariant for references. | ||
The reason this is so elegant is that, as we have seen above, a "typed copy" already very naturally is UB when the memory that is copied is not a valid representation of `T`. | ||
This means we do not even need a special clause in our specification for the validity invariant---in fact, the term does not even have to appear in the specification---as everything juts falls out of how a "typed copy" applies the value representation twice. | ||
|
||
Justifying the `dereferencable` LLVM attribute is, in this case, left to the aliasing model (e.g. [Stacked Borrows]), just like the `noalias` attribute. | ||
|
||
[Stacked Borrows]: stacked-borrows.md |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
Byte
is not a good name for this, as it can be viewed as re-defining the term. I am basically starting from "a byte is what gets stored at a memory address", but other people might think "a byte is 8 bits" or "a byte is an integer in0..256
", and then this type here makes no sense. It turns out that matching all these definitions at once is just not possible, though.