Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: guaranteed-zero padding for niches #70230

Open
RalfJung opened this issue Mar 21, 2020 · 32 comments
Open

Idea: guaranteed-zero padding for niches #70230

RalfJung opened this issue Mar 21, 2020 · 32 comments
Labels
A-layout Area: Memory layout of types C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@RalfJung
Copy link
Member

RalfJung commented Mar 21, 2020

As suggested here, one option to gain more layout optimizations would be (either for all repr(Rust) types or with per-type opt-in) to use a different kind of padding.

Right now, padding bytes may contain arbitrary data, and can be omitted when copying as they are "unstable" in a typed copy. Instead, we could also imagine a kind of padding ("zero-padding") that is guaranteed to be zero. This has the disadvantage that padding bytes need to be initialized on struct creation and copied when the struct is copied, but it has the advantage that padding bytes can become "niches" for layout optimizations.

LLVM's padding is of the "unstable" kind, but we could just add explicit fields ourselves (fields that are guaranteed to be zero) when translating zero-padded types to LLVM.

Cc @eddyb

@jonas-schievink jonas-schievink added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 21, 2020
@eddyb
Copy link
Member

eddyb commented Mar 21, 2020

LLVM's padding is of the "unstable" kind, but we could just add explicit fields ourselves (fields that are guaranteed to be zero) when translating zero-padded types to LLVM.

LLVM has no concept of "padding" outside of "first-class aggregates", which we only use for multiple (well, 2) return values (via ScalarPair), and nothing else (no values in functions, no global initializers). And really, it's basically like loading/storing the fields individually.

As long a type is Abi::Aggregate, and it's not copied by explicit field-by-field accesses, LLVM only sees a memcpy (so size and alignment), nothing else.

EDIT: oh and there's passing values around in calls - that's also not LLVM, it's us, we'd have to make sure we don't apply architecture-specific call ABI rules (for e.g. extern "C" fns) that introspect the field structure of the type, and treat it like an opaque blob instead.

@RalfJung
Copy link
Member Author

LLVM has no concept of "padding" outside of "first-class aggregates", which we only use for multiple (well, 2) return values (via ScalarPair), and nothing else (no values in functions, no global initializers). And really, it's basically like loading/storing the fields individually.

I don't know what you mean. When I define a struct/union in LLVM, won't it add padding in between the fields?

LLVM is permitted to turn a copy of struct-type into a field-by-field copy, and AFAIK we had examples where it did that? This is the kind of transformation that becomes illegal with "zero-padding" (unless we somehow already know that the target memory is zero, e.g. because it is already valid at the given type).

@bjorn3
Copy link
Member

bjorn3 commented Mar 21, 2020

What would be the overhead of the actual zeroing of the padding? Especially for cg_clif, which barely performs any memory optimizations I imagine this will have a significant overhead. The only memory optimization performed by cg_clif results in a ~15% improvement. However that optimization is currently so dumb that I think zeroing padding will prevent performing this optimization. For cg_llvm it will just result in more ir to churn through with optimizations (=compiletime slowdown), or a runtime slowdown without optimizations.

@RalfJung
Copy link
Member Author

RalfJung commented Mar 21, 2020

Assuming the backend does naive copying (i.e., it always copies the entire struct and doesn't try to be clever by only copying the fields and ignoring padding), then the only overhead is in the constructor: Struct { f1, f2 } would have to compile to code that also writes zeroes to padding between and after the fields.

@bjorn3
Copy link
Member

bjorn3 commented Mar 21, 2020

Assuming the backend does naive copying (i.e., it always copies the entire struct and doesn't try to be clever by only copying the fields and ignoring padding)

Yes

then the only overhead is in the constructor

That is exactly what I am afraid of having a >5% overhead on compile time and, when not using optimizations, runtime. Especially for enums it may result in a lot more stores being generated.

@bjorn3
Copy link
Member

bjorn3 commented Mar 21, 2020

Also for the mir returned by instance_mir, the constructor is desugared into individual field assignments. This means that you either have to zero the full local in the prelude, or you have to do a complex analysis to find which bytes will be overwritten later. For the later you can't just take the field offsets, as different enum variants may have different padding.

@RalfJung
Copy link
Member Author

RalfJung commented Mar 21, 2020

I was thinking of struct padding only so far, where it should be pretty trivial to write some zeroes between the fields.

But I also don't see the issue with enums. In a constructor, the enum variant is statically known. So you just look at the fields of the initial variant, and zero the gaps between them.

That is exactly what I am afraid of having a >5% overhead on compile time and, when not using optimizations, runtime.

That sounds like a wild guess to me -- do you have any foundation for it being 5% rather than, say, 0.5%? I'd expect the impact to be tiny, since most structs will have way more data bytes than padding bytes and most code is not running constructors.

@bjorn3
Copy link
Member

bjorn3 commented Mar 21, 2020

But I also don't see the issue with enums. In a constructor, the enum variant is statically known. So you just look at the fields of the initial variant, and zero the gaps between them.

I guess zeroing during the codegen of SetDiscriminator will work. That will unnecessarily duplicate zeroing in some cases though.

That sounds like a wild guess to me -- do you have any foundation for it being 5% rather than, say, 0.5%? I'd expect the impact to be tiny, since most structs will have way more data bytes than padding bytes and most code is not running constructors.

Not much, but given the huge speed win of the very very dumb memory optimization I implemented (bail out on address taken, more than one store possibly preceding a load, ...) I expect the cost to be relatively high. I can try to benchmark it for cg_clif though.

@upsuper
Copy link
Contributor

upsuper commented Mar 22, 2020

Assuming the backend does naive copying (i.e., it always copies the entire struct and doesn't try to be clever by only copying the fields and ignoring padding), then the only overhead is in the constructor: Struct { f1, f2 } would have to compile to code that also writes zeroes to padding between and after the fields.

Another potential optimization is that if reading the padding is undefined, the compiler might be able to utilize it to vectorize some writing even if such writing would overwrite the padding area, but if pading is defined to be zero then such optimization may be less feasible.

Not sure whether that's something really happening, though.

@eddyb
Copy link
Member

eddyb commented Mar 22, 2020

I don't know what you mean. When I define a struct/union in LLVM, won't it add padding in between the fields?

Only if you use those types as "first class aggregates", as in, by-value.
Oh and Abi::Scalar should be fine since we use it only when there is no padding after the scalar, so only Abi::ScalarPair and Abi::Vector lose padding (outside of non-Rust call ABIs).

When used for field access only, like we do, it's just a silly way to specify a field offset, nothing more.

llvm.memcpy calls don't see a type, they just know how many bytes to copy, and we pass the full size (aka "stride"), not the smaller unpadded size, anyway (mostly because we don't track the latter).


Also, if you look at this example's LLVM IR, this is how we lower non-unions:

#[repr(C)] // To avoid field reorder.
pub struct S(i16, i32, i64);
%S = type { [0 x i16], i16, [1 x i16], i32, [0 x i32], i64, [0 x i64] }

Note how we've made the padding fields explicit, and fill it in even when LLVM would compute the right offsets without it anyway? Well, it's because LLVM has no way to specify a custom alignment.

And without tracking "alignment from LLVM leaves" for every layout, we can't easily know in the general case whether LLVM will compute the correct offset or not, so we always force it with explicit padding.

Anyway, all that "padding" is just to make LLVM compute the right offset numbers, we should seriously consider just using one zero-length array to force an alignment and one byte array for offsets.


@RalfJung I suppose the only way LLVM knows something like S above has padding, especially in the local stack frame, is that some byte ranges are never written to.

If they were, it wouldn't matter how you expressed the offsets, and unless used by value (which you shouldn't do because support is dismal) LLVM struct (tuple) "types" are a red herring.

@RalfJung
Copy link
Member Author

Note how we've made the padding fields explicit, and fill it in even when LLVM would compute the right offsets without it anyway? Well, it's because LLVM has no way to specify a custom alignment.

Oh I see, so LLVM itself couldn't even do the "replace copy of struct by copy of fields skipping padding" transformation, because it does not know which fields are padding. Nothing would have to change here for guaranteed-zero padding.


Btw, with nightly features we can already "manually" zero-pad a struct, and layout optimizations can exploit that:

#![feature(rustc_attrs)]

#[rustc_layout_scalar_valid_range_start(0)]
#[rustc_layout_scalar_valid_range_end(0)]
struct ZeroPad(u8);

impl ZeroPad {
    const ZERO: ZeroPad = unsafe { ZeroPad(0) };
}

struct Test(u8, ZeroPad, u16);
fn main() {
    println!("{}", std::mem::size_of::<Option<Test>>());
}

Playground

@RustyYato
Copy link
Contributor

@RalfJung We can get even simpler, just use an enum!

#[repr(u8)]
enum ZeroPad {
    Zero = 0
}

@RalfJung
Copy link
Member Author

@KrishnaSannasi indeed, that works just as well. Nice catch :)

@bjorn3
Copy link
Member

bjorn3 commented Mar 23, 2020

I am currently trying to implement zero padding on SetDiscriminator and I already got a miscompilation for FieldPlacement::Arbitrary. This suggests that it is non-trivial to write code to find the padding.

@RalfJung
Copy link
Member Author

It should be easier for structs though, right? As mentioned before, I mostly had structs in mind, where padding is much simpler than for enums.

It may make more sense to say that Deaggregator actually has to respect zero-padding in its lowering -- with per-field initialization, the necessary information is already mostly gone and would have to be carefully reconstructed.

@bjorn3
Copy link
Member

bjorn3 commented Mar 23, 2020

The enum variant is as hard as the struct variant if you take the route of zeroing during SetDiscriminant. I just used layout.for_variant().

@RalfJung
Copy link
Member Author

RalfJung commented Mar 23, 2020

Not struct variants. Structs. As in struct Foo { field1: type1, field2: type2 }.

@bjorn3
Copy link
Member

bjorn3 commented Mar 23, 2020

Yes, I meant structs. "struct variant" as in the variant of zeroing codegen for structs. I can understand the confusion though. By the way, turns out I was zeroing scalars as they get FieldPlacement::Arbitrary without fields.

@RalfJung
Copy link
Member Author

RalfJung commented Mar 23, 2020

But there is no SetDiscriminant for structs, is there? There is none here.

@RalfJung
Copy link
Member Author

Also, for your LayoutDetails troubles, this could be helpful: #69901

@bjorn3
Copy link
Member

bjorn3 commented Mar 23, 2020

I thought there was a SetDiscriminant to finalize the construction of every aggregate, even for structs. Something with partial initialization and borrowck. I guess I will try to zero during the function prologue instead.

@RalfJung
Copy link
Member Author

RalfJung commented Mar 23, 2020

I don't think this is a task for codegen. MIR has to make it clear when zeroing is necessary. That is the case when Aggregate initialization is used, but the Deaggregate path destroys that information.

@eddyb
Copy link
Member

eddyb commented Mar 23, 2020

Part of the plan to get rid of the Deaggregate pass, btw, was to indicate full initialization with SetDiscriminant (which could be renamed to FinalizeVariant or something).

You could make Deaggregate keep SetDiscriminant to be able to access it in codegen.

@bjorn3
Copy link
Member

bjorn3 commented Mar 23, 2020

The cost of zeroing the padding seems to be less than I expected. Both during SetDiscriminant (enum only) and during the prologue (adt only) it doesn't seem to have a negative impact. In fact it seems to slightly improve performance. I don't know if this just my computer warming up (the benches with padding happened to be sorted later alphabetically) or if it is caused by for example the store to load forwarding of the CPU. More benchmarking with the final implementation method will likely be necessary.

Zeroing code for cg_clif
let variant_layout = place.layout().for_variant(fx, variant_idx); // Remove `for_variant` when zeroing during the prologue
let mut padding_ranges = vec![];
match &variant_layout.fields {
    layout::FieldPlacement::Union(field_count) => {
        padding_ranges.push(
            (0..*field_count).map(|field| variant_layout.field(fx, field).size).max().unwrap_or(Size::ZERO)
            ..
            place.layout().size,
        );
    }
    layout::FieldPlacement::Array { stride, count } => {
        for field in 0..*count as usize {
            padding_ranges.push(
                variant_layout.fields.offset(field) + variant_layout.field(fx, field).size
                ..
                variant_layout.fields.offset(field) + *stride
            );
        }
    }
    layout::FieldPlacement::Arbitrary { .. } => {
        let mut last_field = None;
        for field in variant_layout.fields.index_by_increasing_offset() {
            if let Some(last_field) = last_field {
                padding_ranges.push(
                    variant_layout.fields.offset(last_field) + variant_layout.field(fx, last_field).size
                    ..
                    variant_layout.fields.offset(field)
                );
            }
            last_field = Some(field);
        }
        if let Some(last_field) = last_field {
            padding_ranges.push(
                variant_layout.fields.offset(last_field) + variant_layout.field(fx, last_field).size
                ..
                variant_layout.size
            );
        } else {
            //padding_ranges.push(Size::ZERO..variant_layout.size);
        }
    }
}

if let ty::Adt(adt_def, _) = place.layout().ty.kind {
    //if adt_def.is_struct() {
    //} else {
    //    padding_ranges.clear();
    //}
} else {
    padding_ranges.clear();
}

for padding_range in padding_ranges {
    let addr = place.to_ptr(fx).offset_i64(fx, i64::try_from(padding_range.start.bytes()).unwrap()).get_addr(fx);
    let size = padding_range.end.bytes() - padding_range.start.bytes();
    fx.bcx.emit_small_memset(
        fx.module.target_config(),
        addr,
        0,
        size,
        std::cmp::min(/*greatest_divisible_power_of_two*/(size as i64 & -(size as i64)) as u64, 8u64) as u8, /*FIXME*/
    );
}

@bjorn3
Copy link
Member

bjorn3 commented Mar 23, 2020

Full benchmark details:

[BENCH RUN] ebobby/simple-raytracer
Benchmark #1: ./raytracer_cg_clif_no_zero_pad
  Time (mean ± σ):     13.009 s ±  0.853 s    [User: 12.964 s, System: 0.016 s]
  Range (min … max):   12.061 s … 14.689 s    10 runs
 
Benchmark #2: ./raytracer_cg_clif_release
  Time (mean ± σ):     10.194 s ±  0.454 s    [User: 10.162 s, System: 0.012 s]
  Range (min … max):    9.523 s … 11.052 s    10 runs
 
Benchmark #3: ./raytracer_cg_clif_zero_pad
  Time (mean ± σ):     12.320 s ±  0.416 s    [User: 12.302 s, System: 0.006 s]
  Range (min … max):   11.850 s … 13.185 s    10 runs
 
Benchmark #4: ./raytracer_cg_clif_zero_pad_arbitrary
  Time (mean ± σ):     12.618 s ±  0.684 s    [User: 12.583 s, System: 0.011 s]
  Range (min … max):   11.933 s … 13.932 s    10 runs
 
Benchmark #5: ./raytracer_cg_clif_zero_pad_prologue
  Time (mean ± σ):     11.997 s ±  0.076 s    [User: 11.976 s, System: 0.006 s]
  Range (min … max):   11.906 s … 12.155 s    10 runs
 
Benchmark #6: ./raytracer_cg_clif_zero_pad_prologue_enum_too
  Time (mean ± σ):     12.205 s ±  0.152 s    [User: 12.192 s, System: 0.006 s]
  Range (min … max):   12.096 s … 12.574 s    10 runs
 
Benchmark #7: ./raytracer_cg_clif_zero_pad_prologue_enum_too_release
  Time (mean ± σ):      9.440 s ±  0.081 s    [User: 9.425 s, System: 0.007 s]
  Range (min … max):    9.374 s …  9.656 s    10 runs
 
Summary
  './raytracer_cg_clif_zero_pad_prologue_enum_too_release' ran
    1.08 ± 0.05 times faster than './raytracer_cg_clif_release'
    1.27 ± 0.01 times faster than './raytracer_cg_clif_zero_pad_prologue'
    1.29 ± 0.02 times faster than './raytracer_cg_clif_zero_pad_prologue_enum_too'
    1.31 ± 0.05 times faster than './raytracer_cg_clif_zero_pad'
    1.34 ± 0.07 times faster than './raytracer_cg_clif_zero_pad_arbitrary'
    1.38 ± 0.09 times faster than './raytracer_cg_clif_no_zero_pad'

arbitrary => For the SetDiscriminant version: Don't ignore padding for FieldPlacement::Arbitrary.
release => Run the memory optimization.
prologue => Zero in the function prologue instead of during SetDiscrimintant.
enum_too => For the prologue version: Don't ignore enums.

@workingjubilee
Copy link
Member

Possibly relevant prior art, approaching guarantees of zero-initialized padding from a security perspective, and avoiding various accidents that might occur due to zero-initialized padding:
Solving Uninitialized Stack Memory on Windows

@atagunov
Copy link

In fact it seems to slightly improve performance

Could it be

  • writing a partial cache line requires CPU to first fetch that line from RAM + write it back
  • writing a full cache line requires CPU to do writing only?

@cuviper
Copy link
Member

cuviper commented Sep 24, 2021

Struct { f1, f2 } would have to compile to code that also writes zeroes to padding between and after the fields.

What about MaybeUninit<Struct>, does that also need to be created with zeroed padding?
If I write f1, then f2, then call any of the assume_init* methods, I expect to be in a good state.

edit: @comex also mentioned this in rust-lang/unsafe-code-guidelines#174 (comment).

@RalfJung
Copy link
Member Author

If I write f1, then f2, then call any of the assume_init* methods, I expect to be in a good state.

This assumption would not be true for repr(GuaranteedZeroPadding).

Something has to give, we won't get more niches for free.

@cuviper
Copy link
Member

cuviper commented Sep 24, 2021

This assumption would not be true for repr(GuaranteedZeroPadding).

I see, I guess that's what you meant here:

(either for all repr(Rust) types or with per-type opt-in)

So repr(Rust) probably isn't feasible, but a new repr could be the opt-in.

@eddyb
Copy link
Member

eddyb commented Sep 30, 2021

Small nit: #[repr(C)] and #[repr(Rust)] naming styles are exceptions because they name languages, it should be #[repr(guaranteed_zero_padding)] instead.

But also I would expect something like #[repr(padding(zeroed))] or maybe even a more general form of #[repr(padding(fill_with(0)))].

@RalfJung
Copy link
Member Author

I totally meant this syntax as a strawman. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-layout Area: Memory layout of types C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

9 participants