Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guarantee slice representation #3775

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cramertj
Copy link
Member

@cramertj cramertj commented Feb 19, 2025

This RFC guarantees the representation of &str and &[T] to be equivalent to a repr(C) struct containing a pointer and a length.

Rendered

@cramertj cramertj force-pushed the slice-repr branch 2 times, most recently from fb67199 to 73ad005 Compare February 19, 2025 01:17
Copy link
Contributor

@zachs18 zachs18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably clearly distinguish between "slices" and "pointers/references to slices", as they are different types (and the layout of slices themselves is already defined (edit: fixed link)).

Comment on lines 61 to 70
The validity requirements for the in-memory slice representation are the same
as [those documented on `std::slice::from_raw_parts`](https://doc.rust-lang.org/std/slice/fn.from_raw_parts.html).
Namely:

* `data` must be non-null, valid for reads for `len * mem::size_of::<T>()` many bytes,
and it must be properly aligned. This means in particular:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The validity requirements for the in-memory slice representation are the same
as [those documented on `std::slice::from_raw_parts`](https://doc.rust-lang.org/std/slice/fn.from_raw_parts.html).
Namely:
* `data` must be non-null, valid for reads for `len * mem::size_of::<T>()` many bytes,
and it must be properly aligned. This means in particular:
The validity requirements for the in-memory representation of raw slice pointers are simply the validity requirements of the fields (i.e. initialized).
The validity requirements for the in-memory representation of slice references are the same
as [those documented on `std::slice::from_raw_parts`](https://doc.rust-lang.org/std/slice/fn.from_raw_parts.html) for shared slice references, and [those documented on `std::slice::from_raw_parts_mut`](https://doc.rust-lang.org/std/slice/fn.from_raw_parts_mut.html) for unique slice references.
Namely:
* `data` must be non-null, valid for reads (shared reference) or writes (mutable reference) for `len * mem::size_of::<T>()` many bytes,
and it must be properly aligned. This means in particular:

Raw pointers to slices don't need to necessarily point to anything, and shared and unique references have different (but analogous) reqirements.


Also: We might not want to say "validity requirements" here, as the "validity requirements" of references, and the concept of "validity requirements" in general, are not necessarily settled: rust-lang/unsafe-code-guidelines#412 rust-lang/unsafe-code-guidelines#539

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not want to say "validity requirements" here

Yes, "validity requirements" is an overloaded term. I'd be happy to change it here if you have suggestions for a better one!

}
```

The precise ABI of slices is not guaranteed, so `&[T]` may not be passed by-value
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wny not go ahead and define this ABI as well?

Copy link
Member

@workingjubilee workingjubilee Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our ABI for this type is not exactly the ABI of the equivalent C struct.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another way to encode a C type that has an equivalent ABI to whatever Rust does right now? Or otherwise define the ABI in a way that other languages could somehow comply.

And then promise Rust will not ever change that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, its return ABI falls outside what can be obtained via C code on some platforms.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, thanks for explaining. That's.. really unfortunate.

One could still rigorously define the ABI to be what rustc does right now, and let callers in other languages use per-architecture inline assembly if they want to (this should be possible right?). But that's a much harder sell, and it's reasonable to leave this out of this RFC

#[repr(C)]
struct Slice<T> {
data: *const T,
len: usize,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this specific order? Why not length before pointer? Are there any plausible reasons to prefer one over the other, e.g., based on target architecture?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that our current layout algorithm is best able to exploit any niche in len if len is first in this ordering. And we currently cannot use a "double niche" on two different fields very effectively. I can work on improving this, but I cannot promise a specific result yet.

Due to various quirks of our existing APIs, the len niche would not apply to

  1. zero-sized types
  2. the "raw" version of this: *mut [T] and *const [T]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@workingjubile Why would len have a niche? I would expect data to be the field with a niche, as it is non-null for &[T].

Copy link

@hanna-kruppe hanna-kruppe Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total size len * mem::size_of::<T>() of the slice must be no larger than isize::MAX, and adding that size to data must not “wrap around” the address space. See the safety documentation of pointer::offset.

https://doc.rust-lang.org/std/slice/fn.from_raw_parts.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, clever! Thanks for describing that-- I'll mention it in the RFC.

Copy link
Member

@scottmcm scottmcm Feb 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cramertj It's really important that the length have a niche in &[T], as has been FCPed by opsem in rust-lang/unsafe-code-guidelines#510, because that allows us to add range metadata to the lengths in every function that takes a &[T], and correspondingly lets LLVM know that, for example, (i + j)/2 on in-bounds indexes can't overflow (well, for non-ZST Ts, which TBH are the only ones that matter for loop-over-slice optimizations).

Sadly *const [T] doesn't have the niche, though, because https://doc.rust-lang.org/std/ptr/fn.slice_from_raw_parts.html was stabilized before this was realized and because casting *const [()] to *const [i32] works and doesn't change the metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, also, we want to ensure to leave space for size-and-alignment based niches in references too.

I guess for a slice there's always the "zero size" case, so that simplifies to just alignment niches, but that still means (-align, align) are all impossible. (Which simplifies to just "it's not null" for something with align-1.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as has been FCPed by opsem in rust-lang/unsafe-code-guidelines#510

Theoretically, we could have been even stricter, additionally requiring that ptr + length not overflow the address space.

@cramertj cramertj force-pushed the slice-repr branch 2 times, most recently from 5ec195d to 09a64ea Compare February 19, 2025 17:28
standard library's ability to change and take advantage of new optimization
opportunities.

# Rationale and alternatives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crABI provides a way to pass slices to/from C without locking in our representation #3470, that is probably worth mentioning somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a related way, maybe we could do something like guarantee the layout only in repr(C) types or in extern "C" function signatures? Which retains the flexibility to change ordering in most cases.

Copy link
Member

@workingjubilee workingjubilee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that February 2025 is the right time to stabilize this layout, but I do think it is the right time to start discussing how to do so, why we should do so, and removing any blockers to it. Even if we didn't stabilize this layout, we could provide more ways to statically resolve the layout details so that code could interact with it. It's somewhat embarrassing that ptr_metadata isn't stabilized, after all.

Nonetheless, I have a question: what code specifically do we want to enable with this choice? Is it Rust code or external over-FFI code? In the latter case, refusing to stabilize any parameter-passing or return-passing details would remain somewhat frustrating and makes this less of an improvement over the status quo.

Comment on lines +50 to +56
```rust
#[repr(C)]
struct Slice<T> {
data: *const T,
len: usize,
}
```
Copy link
Member

@workingjubilee workingjubilee Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the RFC does describe that the non-null requirements are upheld, because *const T where T: Sized lacks a "niche", the way this is written could be misread to preclude our "niches", which obviously cannot be the case because we promise niche-based transformations in certain cases. There are two major niches we may wish to be able to exploit:

  1. The well-known "null pointer" niche for data. This one is a promised transformation.
  2. A less well-known case is that if T is non-zero-sized, the size of the slice reference is bounded by isize::MAX, which means that we have a very large niche on the len as the top bit can never be set.

This RFC should probably at least mention the guaranteed niche transformations exist and their implications, just to make clear it is not contradicting or overruling them. A weaker version was described in RFC 3391 but we strengthened that decision to generalize to similarly-shaped enums, not just Result or Option, post-hoc. I documented them in this test, I don't know where to look it up in the reference: https://github.com/rust-lang/rust/blob/ed49386d3aa3a445a9889707fd405df01723eced/tests/ui/rfcs/rfc-3391-result-ffi-guarantees.rs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another niche that isn't currently exploited but may be in the future is alignment, e.g. the pointer in &[u32] is not only non-null but addresses 1, 2, and 3 are also invalid (assuming align_of::<u32>() == 4). This is a smaller niche than the length being at most isize::MAX / size_of::<T>(), but it also applies for ZSTs.

Copy link
Member

@workingjubilee workingjubilee Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, perhaps more generally we should be clear that "the representation is stable" still allows for transformations to happen "around it", particularly when it comes to ADT tag layout. This means that unsafe code that interacts with a blob of bytes that happens to be EnumType<&[T]> should be cautious. This was always the case but will only "get worse" over time.

Comment on lines +58 to +59
The precise ABI of slice references is not guaranteed, so `&[T]` may not be
passed by-value or returned by-value from an `extern "C" fn`.
Copy link
Member

@workingjubilee workingjubilee Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I don't think is well-described here, or at least I feel that based on some other comments it might be at risk of being glossed over when some people engage with this RFC, is that calling convention (AKA parameter and return passing) and in-memory layout are both very different things.

Often, "ABI" is used to describe both, because they are both technically part of the literal "application binary interface". Due to arguments and returns sometimes passing through the stack, and almost always when the number of them is large enough, in-memory layout becomes very relevant for almost every calling convention. But they're not the same.

In particular, every calling convention is a beautiful and unique snowflake. Thus, we could define the calling convention for &[T] specifically when passed over extern "C" and extern "C-unwind" without defining it in other cases, and especially without defining it for extern "Rust" (i.e. the "default" ABI).

BUT I would advise caution in promising compatibility in a very specific way like "just for extern "C"" like that, as people will misread, misinterpret, and fail to consider edge cases. We cannot solve such entirely, but we should be very careful about tentative almost-promises. It might only be appropriate to mention as a future possibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to amend the RFC to clarify this point! The sentence you highlighted seems like it covers this to me, but perhaps I need to make it bolder/brighter/a headline.

Do you have other suggestions for how I could make this more straightforward?

#[repr(C)]
struct Slice<T> {
data: *const T,
len: usize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that our current layout algorithm is best able to exploit any niche in len if len is first in this ordering. And we currently cannot use a "double niche" on two different fields very effectively. I can work on improving this, but I cannot promise a specific result yet.

Due to various quirks of our existing APIs, the len niche would not apply to

  1. zero-sized types
  2. the "raw" version of this: *mut [T] and *const [T]

Comment on lines +47 to +54
Slice references are represented with a pointer and length pair. Their in-memory
layout is the same as a `#[repr(C)]` struct like the following:

```rust
#[repr(C)]
struct Slice<T> {
data: *const T,
len: usize,
Copy link
Member

@workingjubilee workingjubilee Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wished to extend Rust to allow references to unsized types that encompass unsized types... for example, &[[T]], or more interestingly, &[dyn Trait]... then one of the likely choices we would make is to have "triple-wide" (or larger!) references, instead of this "double-wide" one. Thus we would only apply the corresponding layout this RFC describes (or any other such specified layout) in the case of where T: Sized.

I feel this "multiple metadata" possibility should be considered and either explicitly reserved as a future possibility or explicitly dismissed. I feel accepting this RFC as-is could be interpreted as foreclosing it, but there might forever be grumbling about how "it doesn't say...!" Thus if we don't want to do that, we should be clear, and if we're still open to that possibility, we should also be clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, but IMO it seems difficult to imagine how &[[T]] would work-- even with triple-wide references, the items in the collection could be heterogeneous.

I suppose we could make it work if all of the elements were the same type, e.g. coercing &[String] into &[dyn Debug].

Personally, I'd be happy to restrict this RFC to specify that this only applies to &[T] where T: Sized. Would that resolve your concern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could make the RFC compatible with &[[U]] by simply relaxing the bound to T: ?Sized:

&[[U]]
->
{ data: *const [U], len: usize }
->
{ data: { data: *const U, len: usize }, len: usize }

so I don't see how this definition of Slice<T> forecloses the multi-meta possibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, but IMO it seems difficult to imagine how &[[T]] would work-- even with triple-wide references, the items in the collection could be heterogeneous.

You could either require homogeneity like you suggest, or you could have the metadata be its own pointer to a separate region of memory, sharing the lifetime of the data pointer. (I wrote a very experimental crate that does something like the latter: https://github.com/Jules-Bertholet/unsized-vec)

But since current Rust doesn’t support this, and nothing in this RFC would interfere with someday adopting either option, it’s not a concern IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could make the RFC compatible with &[[U]] by simply relaxing the bound to T: ?Sized

another valid extension is to just insert more fields for metadata between the pointer and length or change the length to be a struct-- these match how the pointer metadata APIs work more closely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another valid extension is to just insert more fields for metadata between the pointer and length or change the length to be a struct-- these match how the pointer metadata APIs work more closely.

changing the metadata (length) from usize to a struct would be rust-lang/libs-team#246


You could either require homogeneity like you suggest, or you could have the metadata be its own pointer to a separate region of memory, sharing the lifetime of the data pointer.

if we still want to have zero-cost unsizing the type &[[U]] must only be able to represent rectangular matrices:

// unsizing from &[i32; 5] to &[i32] is free
let y: &[i32] = &[1, 2, 3, 4, 5];

// unsizing a `&[[f64; 3]; 2]` to `&[[f64]]` should also not involve any allocation.
let x: &[[f64]] = &[[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]];

Copy link
Member

@programmerjake programmerjake Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing the metadata (length) from usize to a struct would be rust-lang/libs-team#246

I don't really think so, that's just wrapping the metadata type in a struct whenever you're not using it as part of a pointer type. This is explicitly similar to mem::Discriminant which wraps an enum's discriminant in a struct, even though the discriminant might be a well-known type, e.g. u8 for a repr(u8) fieldless enum.

for [T] where T: Sized, the metadata type would still be usize; the metadata could be a struct or some other type when using custom extern types -- that kind of custom extern type doesn't exist in Rust yet.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both homogeneous &[dyn Trait] and rectangular &[[T]] sound useful, especially because the retangular case benefits from more optimizations.

We could consider unsizing options for more general types too like struct Foo<T>(T,T), so then a homogeneous Foo<dyn Trait>.

## Compatibility with C++ `std::span`

The largest drawback of this layout and set of validity requirements is that it
may preclude `&[T]` from being representationally equivalent to C++'s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep getting tripped up reading this section by thinking it says we're committing to something that diverges from C++. I would phrase it as something like

Because C++ does not guarantee a layout for std::span, stabilizing one for Rust now means C++ could choose to stabilize a different layout and prevent the two from ever being compatible. However, this is unlikely to ever happen because MSVC uses the same representation proposed in this RFC and has stringent stability requirements.

* Unlike Rust, `std::span` allows the `data` pointer to be `nullptr`. One
possibile workaround for this would be to guarantee that `Option<&[T]>` uses
`data: std::ptr::null(), len: 0` to represent the `None` case, making
`std::span<T>` equivalent to `Option<&[T]>` for non-zero-sized types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be worth a perf run to see if there's any measurable impact of doing this.

Comment on lines +185 to +187
Types like `*const ()` / `&()` are widely used to pass around pointers today.
We cannot make them zero-sized, and it would be surprising to make a
different choice for `&[()]`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have warned against this and made everyone use *const u8 instead. Seems impossible to make a change like this now though.

Comment on lines +121 to +123
Either option may offer modest performance benefits for highly generic code
which happens to create empty slices of uninhabited types, but this is unlikely
to be worth the cost of maintaining a special case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, we could opt not to guarantee this since no one seems to have much of a use case for this right now anyway.

@cramertj
Copy link
Member Author

cramertj commented Feb 19, 2025

@workingjubilee

Nonetheless, I have a question: what code specifically do we want to enable with this choice? Is it Rust code or external over-FFI code? In the latter case, refusing to stabilize any parameter-passing or return-passing details would remain somewhat frustrating and makes this less of an improvement over the status quo.

My use-case is FFI code (specifically, autogenerated C++ interop). Rust code can usually rely on calls to .as_ptr/len, so the guarantees in this RFC are much less useful in Rust. I don't particularly care whether the by-value ABI is stable as I'm happy to pass slice references by-pointer in parameter and return values.

@ehuss ehuss added the T-lang Relevant to the language team, which will review and decide on the RFC. label Feb 20, 2025
size and alignment of slices were not guaranteed. However, the Rust compiler
accepts the `repr(C)` declaration above without warning.

# Guide-level explanation
Copy link
Member

@scottmcm scottmcm Feb 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You appear to have deleted the reference-level explanation section from the template.

I would encourage you to split this back into two parts: the guide level informal explanation that talks about how it allows you to read things from C, but then also have a reference-level explanation of exactly what it's guaranteeing, without ever using the word "layout", because "layout" means too many different things to different people. (Some people think it means just std::alloc::Layout, some include field offsets, some include validity invariants, etc.)

I'm absolutely in favour of doing this RFC, see this old Zulip thread, so long as it's clearly scoped to the parts that really are uncontroversial.

Notably, I don't think that any description that include *const T is correct in a reference-level section if it applies to &[_] references, because it at least needs to be NonNull<T> -- but it also needs to be aligned, and a bunch of other things.

So I want to see a precise, positively-specified list of exactly what we're RFCing.

A quick stab at it:

  • we're only talking about {&,&mut,*const,*mut} [T] with T: Sized
  • we're only talking about platforms where sizeof(*const impl Thin) == sizeof(usize).
  • the size is 2 * sizeof(usize).
  • we're only talking about platforms where alignof(*const impl Thin) == alignof(usize).
  • the align is alignof(usize).
  • the pointer component is at offset 0, which points to the first element
  • the length component is at offset sizeof(usize), and contains the length in units of elements (not of bytes)

I don't know if we stop there, but if that's all we're committing to I think it's uncontroversial -- and I bet it's what a whole bunch of unsafe code in the ecosystem already assumes anyway, de facto, since it's been true since at least 1.1.0.

I don't know what, if anything, we want to promise or require about the validity invariant, especially since that's already defined to be different between &[i32] and *const [i32]. Perhaps this RFC should just not say anything about it -- reading a rust slice from C doesn't need to know anything about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-lang Relevant to the language team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.