Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate memory usage of compiling the packed_simd crate #57829

Closed
hsivonen opened this issue Jan 22, 2019 · 14 comments · Fixed by #58207
Closed

Investigate memory usage of compiling the packed_simd crate #57829

hsivonen opened this issue Jan 22, 2019 · 14 comments · Fixed by #58207
Labels
A-SIMD Area: SIMD (Single Instruction Multiple Data) I-compilemem Issue: Problems and improvements with respect to memory usage during compilation.

Comments

@hsivonen
Copy link
Member

Steps to reproduce

  1. Create a new crate with cargo.
  2. Add packed_simd = '0.3.1' to Cargo.toml of the new crate.
  3. Build the new crate.

Actual results

While compiling packed_simd, rustc takes more than 2 GB of RAM.

Expected results

Lesser RAM usage.

Additional info

Maybe it's just the nature of packed_simd that it takes a lot of RAM to compile, and there's no bug. However, if RAM usage reached 3 GB in the future, the crate would become unbuildable on 32-bit systems. It might be worthwhile to investigate if building packed_simd has to take this much RAM or if there is an opportunity to use less RAM without adversely affecting compilation speed on systems that have plenty of RAM.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 22, 2019

cc @mw @nnethercote

@Centril Centril added I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. A-SIMD Area: SIMD (Single Instruction Multiple Data) labels Jan 22, 2019
@matthiaskrgr
Copy link
Member

Looks like nll needs a lot of memory here

�[0m�[0m�[1m�[32m   Compiling�[0m packed_simd v0.3.1
  time: 0.054; rss: 57MB	parsing
  time: 0.000; rss: 58MB	attributes injection
  time: 0.000; rss: 58MB	recursion limit
  time: 0.000; rss: 58MB	crate injection
  time: 0.000; rss: 58MB	plugin loading
  time: 0.000; rss: 58MB	plugin registration
  time: 0.005; rss: 58MB	pre ast expansion lint checks
    time: 2.550; rss: 369MB	expand crate
    time: 0.000; rss: 369MB	check unused macros
  time: 2.550; rss: 369MB	expansion
  time: 0.000; rss: 369MB	maybe building test harness
  time: 0.012; rss: 369MB	maybe creating a macro crate
  time: 0.048; rss: 370MB	creating allocators
  time: 0.036; rss: 370MB	AST validation
  time: 0.497; rss: 412MB	name resolution
  time: 0.075; rss: 412MB	complete gated feature checking
  time: 0.321; rss: 481MB	lowering ast -> hir
  time: 0.081; rss: 482MB	early lint checks
    time: 0.052; rss: 504MB	validate hir map
  time: 0.353; rss: 504MB	indexing hir
  time: 0.000; rss: 504MB	load query result cache
  time: 0.000; rss: 504MB	looking for entry point
  time: 0.000; rss: 504MB	dep graph tcx init
  time: 0.001; rss: 504MB	looking for plugin registrar
  time: 0.001; rss: 504MB	looking for derive registrar
  time: 0.019; rss: 504MB	loop checking
  time: 0.024; rss: 504MB	attribute checking
    time: 0.000; rss: 515MB	solve_nll_region_constraints(DefId(0/1:2171 ~ packed_simd[a932]::v64[0]::f32x2[0]::{{constant}}[0]))
*snip*
    time: 0.000; rss: 527MB	solve_nll_region_constraints(DefId(0/1:4611 ~ packed_simd[a932]::vSize[0]::{{impl}}[587]::from[0]::U[0]::array[0]::{{constant}}[0]))
  time: 0.636; rss: 527MB	stability checking
  time: 0.124; rss: 527MB	type collecting
  time: 0.003; rss: 527MB	outlives testing
  time: 0.019; rss: 527MB	impl wf inference
    time: 0.000; rss: 1113MB	solve_nll_region_constraints(DefId(0/1:224 ~ packed_simd[a932]::codegen[0]::shuffle[0]::{{impl}}[0]::{{constant}}[0]))
*snip*
    time: 0.000; rss: 1246MB	solve_nll_region_constraints(DefId(0/1:4867 ~ packed_simd[a932]::vPtr[0]::{{impl}}[104]::{{constant}}[0]))
  time: 9.972; rss: 1408MB	coherence checking
  time: 0.002; rss: 1408MB	variance testing
    time: 0.000; rss: 1605MB	solve_nll_region_constraints(DefId(0/1:366 ~ packed_simd[a932]::codegen[0]::v16[0]::{{impl}}[0]::NT[0]::{{constant}}[0]))
*snip*
    time: 0.000; rss: 2013MB	solve_nll_region_constraints(DefId(0/0:4027 ~ packed_simd[a932]::codegen[0]::reductions[0]::mask[0]::{{impl}}[7]::any[0]))
    time: 0.000; rss: 2013MB	solve_nll_region_constraints(DefId(0/0:4053 ~ packed_simd[a932]::codegen[0]::reductions[0]::mask[0]::{{impl}}[17]::any[0]))
  time: 5.040; rss: 2013MB	MIR borrow checking
  time: 0.000; rss: 2013MB	dumping chalk-like clauses
  time: 0.005; rss: 2013MB	MIR effect checking
  time: 0.072; rss: 2018MB	death checking
  time: 0.021; rss: 2018MB	unused lib feature checking
  time: 0.176; rss: 2019MB	lint checking
  time: 0.000; rss: 2019MB	resolving dependency formats
    time: 0.890; rss: 2055MB	write metadata
      time: 0.010; rss: 2055MB	collecting roots
      time: 0.186; rss: 2056MB	collecting mono items
    time: 0.196; rss: 2056MB	monomorphization collection
    time: 0.001; rss: 2056MB	codegen unit partitioning
    time: 0.122; rss: 2060MB	codegen to LLVM IR
    time: 0.000; rss: 2060MB	assert dep graph
    time: 0.000; rss: 2060MB	serialize dep graph
  time: 1.215; rss: 2060MB	codegen
    time: 0.056; rss: 2063MB	llvm function passes [packed_simd.smey8184-cgu.0]
    time: 0.777; rss: 2071MB	llvm module passes [packed_simd.smey8184-cgu.0]
    time: 0.798; rss: 2079MB	codegen passes [packed_simd.smey8184-cgu.0]
  time: 1.703; rss: 1539MB	LLVM passes
  time: 0.000; rss: 1540MB	serialize work products
  time: 0.017; rss: 1540MB	linking

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 22, 2019

Coherence checking also takes a good chunk of memory:

time: 0.000; rss: 1246MB	solve_nll_region_constraints(DefId(0/1:4867 ~ packed_simd[a932]::vPtr[0]::{{impl}}[104]::{{constant}}[0]))
  time: 9.972; rss: 1408MB	coherence checking

although NLL is the first suspect here. I wonder why NLL uses this much memory, packed_simd is full of methods, but the great majority of them are essentially one liners.

@memoryruins
Copy link
Contributor

Reported the following spike of memory usage in #57432, which occurred after #56723

packed-simd-memory

@mati865
Copy link
Contributor

mati865 commented Jan 30, 2019

This one could be closed as duplicate of #57432 I guess.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 30, 2019

EDIT: @mati865 you are right, these are duplicates, I thought that was a different issue that apparently never got filled, so forget this.


original comment:

@mati865 while they are related, they are two different issues:

  • this issue is about compiling packed_simd itself, which started using much more memory recently, resulting in some builds failing for consumers (encoding-rs)

  • Compile time perf regression for packed-simd's max-rss #57432 is about increased compile-times and memory usage when compiling other crates when packed_simd is part of libcore (e.g. via core::simd)

@nnethercote
Copy link
Contributor

I did a DHAT run. The "At t-gmax" measurement is the relevant one, it's short for "time of global max". It shows that the interning of constants within TypeFolder is accounting for over 54% of the global peak:

AP 1.1.1.1.1/2 (2 children) {
  Total:     912,261,120 bytes (12.02%, 7,312.63/Minstr) in 6 blocks (0%, 0/Minstr), avg size 152,043,520 bytes, avg lifetime 103,155,024,513.33 instrs (82.69% of program duration)
  At t-gmax: 912,261,120 bytes (54.74%) in 6 blocks (0%), avg size 152,043,520 bytes
  At t-end:  0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
  Reads:     1,827,458,569 bytes (4.97%, 14,648.81/Minstr), 2/byte
  Writes:    844,260,160 bytes (9.59%, 6,767.54/Minstr), 0.93/byte
  Allocated at {
    #1: 0xB66BCCB: alloc (alloc.rs:72)
    #2: 0xB66BCCB: alloc (alloc.rs:148)
    #3: 0xB66BCCB: allocate_in<u8,alloc::alloc::Global> (raw_vec.rs:96)
    #4: 0xB66BCCB: with_capacity<u8> (raw_vec.rs:140)
    #5: 0xB66BCCB: new<u8> (lib.rs:66)
    #6: 0xB66BCCB: arena::DroplessArena::grow (lib.rs:346)
    #7: 0x8C1BB25: alloc_raw (lib.rs:362)
    #8: 0x8C1BB25: alloc<rustc::ty::sty::LazyConst> (lib.rs:378)
    #9: 0x8C1BB25: alloc<rustc::ty::sty::LazyConst> (lib.rs:465)
    #10: 0x8C1BB25: intern_lazy_const (context.rs:1123)
    #11: 0x8C1BB25: <rustc::traits::project::AssociatedTypeNormalizer<'a, 'b, 'gcx, 'tcx> as rustc::ty::fold::TypeFolder<'gcx, 'tcx>>::fold_const (project.rs:423)
    #12: 0x8C1B235: fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:1049)
    #13: 0x8C1B235: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:719)
    #14: 0x8C1B235: <rustc::traits::project::AssociatedTypeNormalizer<'a, 'b, 'gcx, 'tcx> as rustc::ty::fold::TypeFolder<'gcx, 'tcx>>::fold_ty (project.rs:337)
    #15: 0x890C0D0: fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:769)
    #16: 0x890C0D0: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (subst.rs:135)
    #17: 0x890C0D0: fold_with<rustc::ty::subst::Kind,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #18: 0x890C0D0: {{closure}}<rustc::traits::project::AssociatedTypeNormalizer> (subst.rs:328)
    #19: 0x890C0D0: call_once<(&rustc::ty::subst::Kind),closure> (function.rs:279)
    #20: 0x890C0D0: map<&rustc::ty::subst::Kind,rustc::ty::subst::Kind,&mut closure> (option.rs:414)
    #21: 0x890C0D0: next<rustc::ty::subst::Kind,core::slice::Iter<rustc::ty::subst::Kind>,closure> (mod.rs:567)
    #22: 0x890C0D0: <smallvec::SmallVec<A> as core::iter::traits::collect::Extend<<A as smallvec::Array>::Item>>::extend (lib.rs:1349)
    #23: 0x8EF9787: from_iter<[rustc::ty::subst::Kind; 8],core::iter::adapters::Map<core::slice::Iter<rustc::ty::subst::Kind>, closure>> (lib.rs:1333)
    #24: 0x8EF9787: collect<core::iter::adapters::Map<core::slice::Iter<rustc::ty::subst::Kind>, closure>,smallvec::SmallVec<[rustc::ty::subst::Kind; 8]>> (iterator.rs:1466)
    #25: 0x8EF9787: rustc::ty::subst::<impl rustc::ty::fold::TypeFoldable<'tcx> for &'tcx rustc::ty::List<rustc::ty::subst::Kind<'tcx>>>::super_fold_with (subst.rs:328)
    #26: 0x8C1B183: fold_with<&rustc::ty::List<rustc::ty::subst::Kind>,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #27: 0x8C1B183: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:721)
    #28: 0x8C1B183: <rustc::traits::project::AssociatedTypeNormalizer<'a, 'b, 'gcx, 'tcx> as rustc::ty::fold::TypeFolder<'gcx, 'tcx>>::fold_ty (project.rs:337)
    #29: 0x890C0D0: fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:769)
    #30: 0x890C0D0: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (subst.rs:135)
    #31: 0x890C0D0: fold_with<rustc::ty::subst::Kind,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #32: 0x890C0D0: {{closure}}<rustc::traits::project::AssociatedTypeNormalizer> (subst.rs:328)
    #33: 0x890C0D0: call_once<(&rustc::ty::subst::Kind),closure> (function.rs:279)
    #34: 0x890C0D0: map<&rustc::ty::subst::Kind,rustc::ty::subst::Kind,&mut closure> (option.rs:414)
    #35: 0x890C0D0: next<rustc::ty::subst::Kind,core::slice::Iter<rustc::ty::subst::Kind>,closure> (mod.rs:567)
    #36: 0x890C0D0: <smallvec::SmallVec<A> as core::iter::traits::collect::Extend<<A as smallvec::Array>::Item>>::extend (lib.rs:1349)
    #37: 0x8EF9787: from_iter<[rustc::ty::subst::Kind; 8],core::iter::adapters::Map<core::slice::Iter<rustc::ty::subst::Kind>, closure>> (lib.rs:1333)
    #38: 0x8EF9787: collect<core::iter::adapters::Map<core::slice::Iter<rustc::ty::subst::Kind>, closure>,smallvec::SmallVec<[rustc::ty::subst::Kind; 8]>> (iterator.rs:1466)
    #39: 0x8EF9787: rustc::ty::subst::<impl rustc::ty::fold::TypeFoldable<'tcx> for &'tcx rustc::ty::List<rustc::ty::subst::Kind<'tcx>>>::super_fold_with (subst.rs:328)
    #40: 0x8BFE173: fold_with<&rustc::ty::List<rustc::ty::subst::Kind>,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #41: 0x8BFE173: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (macros.rs:344)
    #42: 0x8BFE173: fold_with<rustc::ty::sty::TraitRef,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #43: 0x8BFE173: super_fold_with<rustc::ty::sty::TraitRef,rustc::traits::project::AssociatedTypeNormalizer> (macros.rs:397)
    #44: 0x8BFE173: fold_with<core::option::Option<rustc::ty::sty::TraitRef>,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #45: 0x8BFE173: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (macros.rs:344)
    #46: 0x8BFE173: fold_with<rustc::ty::ImplHeader,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #47: 0x8BFE173: fold<rustc::ty::ImplHeader> (project.rs:315)
    #48: 0x8BFE173: normalize_with_depth<rustc::ty::ImplHeader> (project.rs:274)
    #49: 0x8BFE173: normalize<rustc::ty::ImplHeader> (project.rs:258)
    #50: 0x8BFE173: rustc::traits::coherence::with_fresh_ty_vars (coherence.rs:107)

@nnethercote
Copy link
Contributor

@eddby @oli-obk @RalfJung Any thoughts on how to improve intern_lazy_const?

@RalfJung
Copy link
Member

RalfJung commented Feb 4, 2019

Cc @eddyb

@nnethercote
Copy link
Contributor

Any thoughts on how to improve intern_lazy_const?

There is an obvious problem: intern_lazy_const doesn't intern the value! And the values passed are exceedingly repetitive. Here's a histogram of the top 10, which account for 97.2% of the calls:

17886042 counts:
(  1)  5253160 (29.4%, 29.4%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 2 }) })
(  2)  5192895 (29.0%, 58.4%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 4 }) })
(  3)  3928986 (22.0%, 80.4%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 8 }) })
(  4)  1600916 ( 9.0%, 89.3%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 16 }) })
(  5)   719785 ( 4.0%, 93.3%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 32 }) })
(  6)   299507 ( 1.7%, 95.0%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 1 }) })
(  7)   271847 ( 1.5%, 96.5%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 64 }) })
(  8)    61636 ( 0.3%, 96.9%): Unevaluated(DefId(0/1:4735 ~ packed_simd[3c0f]::vPtr[0]::mptrx4[0]::{{constant}}[0]), [])
(  9)    61636 ( 0.3%, 97.2%): Unevaluated(DefId(0/1:4823 ~ packed_simd[3c0f]::vPtr[0]::mptrx8[0]::{{constant}}[0]), [])
( 10)    61636 ( 0.3%, 97.6%): Unevaluated(DefId(0/1:4653 ~ packed_simd[3c0f]::vPtr[0]::mptrx2[0]::{{constant}}[0]), [])

Fixing this should drastically reduce the memory usage.

I tried doing the obvious thing by introducing GlobalCtxt::lazy_const_interner, heavily inspired by GlobalCtxt::layout_interner, but I couldn't get the lifetimes to work. I will try again tomorrow if nobody else beats me to it.

nnethercote added a commit to nnethercote/rust that referenced this issue Feb 6, 2019
Currently it just unconditionally allocates it in the arena.

For a "Clean Check" build of the the `packed-simd` benchmark, this
change reduces both the `max-rss` and `faults` counts by 59%; it
slightly (~3%) increases the instruction counts but the `wall-time` is
unchanged.

For the same builds of a few other benchmarks, `max-rss` and `faults`
drop by 1--5%, but instruction counts and `wall-time` changes are in the
noise.

Fixes rust-lang#57432, fixes rust-lang#57829.
@hsivonen
Copy link
Member Author

hsivonen commented Feb 7, 2019

FWIW, without the in-flight fix here, a relatively small tweak to packed_simd made packed_simd uncompilable on an ARMv7 system whose /proc/meminfo says there's 3624684 kB of RAM plus some swap. (And a Chrome OS kernel; I don't know what kind of swap use policy Chrome OS applies.)

I'll test again once the fix for this issue is in nightly.

@RalfJung
Copy link
Member

RalfJung commented Feb 9, 2019

This just brought down my whole system -- 16GB of RAM used to be enough to compile two rustc in parallel (with 8 jobs each), but with the current RAM consumption that does not seem to be the case any more.

bors added a commit that referenced this issue Feb 9, 2019
Make `intern_lazy_const` actually intern its argument.

Currently it just unconditionally allocates it in the arena.

For a "Clean Check" build of the the `packed-simd` benchmark, this
change reduces both the `max-rss` and `faults` counts by 59%; it
slightly (~3%) increases the instruction counts but the `wall-time` is
unchanged.

For the same builds of a few other benchmarks, `max-rss` and `faults`
drop by 1--5%, but instruction counts and `wall-time` changes are in the
noise.

Fixes #57432, fixes #57829.
@oli-obk
Copy link
Contributor

oli-obk commented Feb 10, 2019

Can you try again with today's nightly?

@hsivonen
Copy link
Member Author

FWIW, without the in-flight fix here, a relatively small tweak to packed_simd made packed_simd uncompilable on an ARMv7 system whose /proc/meminfo says there's 3624684 kB of RAM plus some swap. (And a Chrome OS kernel; I don't know what kind of swap use policy Chrome OS applies.)

I'll test again once the fix for this issue is in nightly.

Much better memory usage now. Thank you!

It seems it would be worthwhile to nominate this for uplift to beta, but I'm not permitted to add the tag myself.

pietroalbini pushed a commit to pietroalbini/rust that referenced this issue Feb 17, 2019
Currently it just unconditionally allocates it in the arena.

For a "Clean Check" build of the the `packed-simd` benchmark, this
change reduces both the `max-rss` and `faults` counts by 59%; it
slightly (~3%) increases the instruction counts but the `wall-time` is
unchanged.

For the same builds of a few other benchmarks, `max-rss` and `faults`
drop by 1--5%, but instruction counts and `wall-time` changes are in the
noise.

Fixes rust-lang#57432, fixes rust-lang#57829.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-SIMD Area: SIMD (Single Instruction Multiple Data) I-compilemem Issue: Problems and improvements with respect to memory usage during compilation.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants