Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wasm] Jiterpreter tracking issue #78428

Open
kg opened this issue Nov 15, 2022 · 3 comments
Open

[wasm] Jiterpreter tracking issue #78428

kg opened this issue Nov 15, 2022 · 3 comments
Assignees
Labels
arch-wasm WebAssembly architecture area-Codegen-Interpreter-mono tracking This issue is tracking the completion of other related issues.
Milestone

Comments

@kg
Copy link
Member

kg commented Nov 15, 2022

The jiterpreter (#76477) has pending work needed:

  • Introduce a Jiterpreter CI lane that sets all the tiering thresholds low so that we flush out any issues with obscure interp opcodes or cold code
  • Investigate integrating jit calls directly into compiled traces
  • Investigate integrating icalls directly into compiled traces
  • Run statistics on blazor applications once the jiterpreter is integrated, to identify any instructions that need to be added
  • Remove more unnecessary transition/wrapper glue from do_jit_call and interp_entry paths, as seen here: image
    • Maintain a table of vtable slots containing interp_entry _in wrappers, then patch the vtables (design pending)
    • When generating dedicated do_jit_call routine, punch through the _out wrapper (model on the mini-generic-sharing.c code generator)
    • Optimize direct jit calls to turn the common ldloca sp + offset -> tnn.load pair into tnn_load offset
    • Optimize out passing of ftndesc arg to direct jit call wrappers, target and rgctx can be compiled in (not possible due to generic sharing)
  • Cache interpreter stack locals in wasm locals, then flush them back to the interpreter stack on exit
  • Cache non-volatile fields in wasm locals, then flush them back to the heap on exit
  • Threading support (incomplete draft to-do list)
    • Pre-grow function pointer table to a set size at startup in each thread
    • Ensure empty function pointer slots are filled with appropriate 'dummy' functions so that threads will not crash when calling them
    • When jitting a new function, RPC the wasm blob or compiled module to threads so they can register the pointer
    • Thread-safe interpreter opcode patching
    • Thread-safe do_jit_call pointer/cache updates
  • Multi-trace optimizations
    • For traces with an offset other than 0 (large ones only?) attempt to reuse other existing traces?
    • Stop compiling traces when we encounter an already-compiled trace (likely function prologue -> loop body)
    • When we encounter an already compiled trace, call it directly from the current trace
  • Heuristic improvements
    • Don't put trace entry points too close together
    • If a trace is likely to conditionally abort early in its execution, don't insert an entry point (requires interpreter to mark blocks as unlikely if they contain a throw)
    • Add 'estimated cost' value for each opcode to mintops.def that estimates cost of running it in interp
    • Add estimated cost value for each jiterpreter opcode that estimates the quality of generated wasm code
    • Instead of using trace length heuristic, only keep traces where estimated jiterpreter cost <= interp cost
    • Ensure new system keeps short high value traces like Vector128.Add-with-SIMD
    • Improve estimated jiterpreter cost by factoring in (measured on v8 and/or spidermonkey) cost of entering a trace
    • Factor in the lack of branch prediction when estimating cost of jiterpreter branches like null checks
    • Insert entry points periodically in very large basic blocks so that the jiterp can resume when a trace ends due to being too large
  • Control flow improvements
    • Basic backwards branch implementation
    • Implement CFG tracker that assembles module at the end
    • Eliminate branch block comparison(s) for forward branches
    • Eliminate branch block comparison(s) for backward branches
    • Don't generate dispatch table entries for branch targets that cannot be reached by backward branches
    • Don't generate a dispatch table if all back branches in a trace go to a single place
    • Identify cases where each back branch target is independent, and generate separate loops
    • Record each CALL_HANDLER target and use that to implement ENDFINALLY
    • When we emit an unconditional bailout, set a 'prune opcodes' flag and don't translate any unreachable opcodes after it until we hit a branch target block
    • Outline bailouts and exits to a shared return at the end of traces
    • Change all bailouts to be the form if (cond) { br bailout_block } or br_if bailout_block
  • Monitoring phase improvements
    • Tune threshold
    • Generate a mapping table from return values (we know the possible set) to executed opcode or uop count
    • Set threshold in terms of opcodes or uops
    • Discard mapping table after monitoring phase
  • Store-to-load forwarding
    • If a series of opcodes r/w overwrite a dreg, drop the store/load pair for the leading opcodes, i.e. a = b * 2; a = a + 1; (this turns out to make things slower in v8 for some reason, so prototype won't land)
    • Use a wasm local instead of leave-on-stack
    • Fully optimize out stores and loads for cases where the dreg is only read once by leaving it on the wasm stack
    • Forward constants from their most recent store to load(s) that use them ([wasm] Add limited constant propagation to the jiterpreter for ldc.i4 and ldloca #99706)
  • Re-enable early trace abort with back branches active but only once a trace is long enough to justify it
  • Add typecheck-free version of stelem_ref (only possible for sealed types, must be generated in interp) ([mono] Add unchecked version of stelem_ref interpreter opcode #99829)
  • Update the msbuild targets to generate a single export arg to emcc instead of one per exported function
  • Ensure IEEE spec compliance for the f32 and f64 opcodes that rely on libc or wasm opcodes
  • Cache the this-reference (locals[0]) in a wasm local since it can't change
  • Zero region optimizations
    • Fuse null check and length check for arrays
    • Fuse null check and length check for strings
    • Fuse null check and length check for spans
    • Fuse null check and type check for MINT_CASTCLASS/MINT_ISINST
  • Interpreter integration
    • Move cpblk unrolling into interpreter superinsn pass as mint_cpblk_imm
    • Add new null-check-free versions of hot field opcodes
    • Add new information table tracking things like known not-null state per local that are exposed to jiterpreter
    • Consume information table from jiterpreter to do null check elimination
    • Optimize size of null check bitset as described in [wasm] Re-enable null check optimization for mid-method traces #84058 (comment)
    • Investigate migrating the trace generator into transform.c and doing it during the tiering process
    • If interpreter verbose is set for a method the jiterpreter should honor that
  • SIMD
  • Raise interpreter inlining limit to 30
    • Investigate raising it a bit further
  • Caching / PGO
    • Record a list of which methods are tiered in the interp so they can tier immediately on future runs
    • Record a list of which traces we compile so that we can compile them early on future runs
    • Cache jitted traces across page loads
    • Cache do_jit_call trampolines across page loads
    • Cache interp_entry wrappers across page loads
  • Make sure that call_handler/leave work correctly in the event that we bail out from a trace into the interp ([wasm] Jiterpreter implementation of CALL_HANDLER is incorrect #98577)
  • Cleanup
    • Remove most jiterp cprop once we can rely on the interpreter to do it, for correctness reasons

Archived items

  • Write a custom assembler and use it to generate and inline do-jit-call and simd detect modules ( #81691 )
  • Also unroll memcpy like memset
  • Investigate possible startup time regressions
  • Investigate possible .wasm size regressions
  • Update memmove unroller to ensure it does the correct thing for overlapping src/dest
  • Enable jiterpreter jitcall and interp_entry JITs by default
  • Enable jiterpreter traces by default
  • Don't bail out for safepoints
    • Do the 'is a safepoint needed' check inline in the trace instead of in the import
  • Inline strlen into traces
  • Inline getchr_ref into traces
  • Inline getitem_span into traces
  • Inline get_element_address_with_size_ref into traces
  • Optimize out the eip local and initialization for traces containing no branches
  • Generate import section after generating function body and omit unused imports
  • Do another pass over intrinsics and superinsns to add any missing ones (like the log2 used for vectorization)
  • Remove generated opcode info table and fetch opcode info from the interpreter's tables on demand to reduce file size
  • Don't discard known not-null / known constant information when crossing branches, only branch targets
  • Migrate configuration to options.h (requires improvements to the API)
  • Verify that no debugging scenarios regress
  • Better error handling for jiterpreter runtime failures (shut them off after a handful of JIT failures to avoid spamming the console and wasting CPU time)
  • Optimize out memory.fill for common sizes (it produces an expensive function call on x86 and x64)
  • Handle jiterpreter opcodes in non-wasm interp using the same path as other unreachable opcodes
  • Fix floating point compares in jiterpreter
@kg kg added the arch-wasm WebAssembly architecture label Nov 15, 2022
@kg kg self-assigned this Nov 15, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 15, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Nov 15, 2022
@ghost
Copy link

ghost commented Nov 15, 2022

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

The jiterpreter has pending work needed:

  • Migrate configuration to options.h (requires improvements to the API)
  • Enable jiterpreter features by default
  • Better error handling for jiterpreter runtime failures (shut them off after a handful of JIT failures to avoid spamming the console and wasting CPU time)
  • Run statistics on blazor applications once the jiterpreter is integrated, to identify any instructions that need to be added
  • Investigate integrating jit calls directly into compiled traces
  • Remove more unnecessary transition/wrapper glue from do_jit_call and interp_entry paths, as seen here: image
  • Threading support (incomplete draft to-do list)
    • Synchronize wasm function pointer table growth across threads
    • Ensure empty function pointer slots are filled with appropriate 'dummy' functions so that threads will not crash when calling them
    • When jitting a new function, RPC the wasm blob or compiled module to threads so they can register the pointer
    • Thread-safe interpreter opcode patching
    • Thread-safe do_jit_call pointer/cache updates
  • Caching
    • Cache jitted traces across page loads
    • Cache do_jit_call trampolines across page loads
    • Cache interp_entry wrappers across page loads
Author: kg
Assignees: kg
Labels:

arch-wasm

Milestone: -

@teo-tsirpanis teo-tsirpanis added area-Codegen-meta-mono and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Nov 15, 2022
@lewing lewing added this to the 8.0.0 milestone Nov 18, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Nov 18, 2022
@kg kg changed the title [wasm] Tracking issue: Jiterpreter cleanup and improvements [wasm] Jiterpreter tracking issue Mar 17, 2023
@ghost
Copy link

ghost commented Jun 20, 2023

Tagging subscribers to this area: @BrzVlad, @kotlarmilos
See info in area-owners.md if you want to be subscribed.

Issue Details

The jiterpreter (#76477) has pending work needed:

  • Introduce a Jiterpreter CI lane that sets all the tiering thresholds low so that we flush out any issues with obscure interp opcodes or cold code
  • Investigate integrating jit calls directly into compiled traces
  • Investigate integrating icalls directly into compiled traces
  • Run statistics on blazor applications once the jiterpreter is integrated, to identify any instructions that need to be added
  • Remove more unnecessary transition/wrapper glue from do_jit_call and interp_entry paths, as seen here: image
    • Maintain a table of vtable slots containing interp_entry _in wrappers, then patch the vtables (design pending)
    • When generating dedicated do_jit_call routine, punch through the _out wrapper (model on the mini-generic-sharing.c code generator)
    • Optimize direct jit calls to turn the common ldloca sp + offset -> tnn.load pair into tnn_load offset
    • Optimize out passing of ftndesc arg to direct jit call wrappers, target and rgctx can be compiled in (not possible due to generic sharing)
  • Cache interpreter stack locals in wasm locals, then flush them back to the interpreter stack on exit
  • Cache non-volatile fields in wasm locals, then flush them back to the heap on exit
  • Threading support (incomplete draft to-do list)
    • Synchronize wasm function pointer table growth across threads
    • Ensure empty function pointer slots are filled with appropriate 'dummy' functions so that threads will not crash when calling them
    • When jitting a new function, RPC the wasm blob or compiled module to threads so they can register the pointer
    • Thread-safe interpreter opcode patching
    • Thread-safe do_jit_call pointer/cache updates
  • Multi-trace optimizations
    • For traces with an offset other than 0 (large ones only?) attempt to reuse other existing traces?
    • Stop compiling traces when we encounter an already-compiled trace (likely function prologue -> loop body)
    • When we encounter an already compiled trace, call it directly from the current trace
  • Heuristic improvements
    • Don't put trace entry points too close together
    • If a trace is likely to conditionally abort early in its execution, don't insert an entry point (requires interpreter to mark blocks as unlikely if they contain a throw)
    • Identify causes of heuristic accuracy only being ~95% on S.R.T.
    • Add 'estimated cost' value for each opcode to mintops.def that estimates cost of running it in interp
    • Add estimated cost value for each jiterpreter opcode that estimates the quality of generated wasm code
    • Instead of using trace length heuristic, only keep traces where estimated jiterpreter cost <= interp cost
    • Ensure new system keeps short high value traces like Vector128.Add-with-SIMD
    • Improve estimated jiterpreter cost by factoring in (measured on v8 and/or spidermonkey) cost of entering a trace
    • Factor in the lack of branch prediction when estimating cost of jiterpreter branches like null checks
  • Control flow improvements
    • Basic backwards branch implementation
    • Implement CFG tracker that assembles module at the end
    • Eliminate branch block comparison(s) for forward branches
    • Eliminate branch block comparison(s) for backward branches
    • Don't generate dispatch table entries for branch targets that cannot be reached by backward branches
    • Don't generate a dispatch table if all back branches in a trace go to a single place
    • Identify cases where each back branch target is independent, and generate separate loops
    • Record each CALL_HANDLER target and use that to implement ENDFINALLY
    • When we emit an unconditional bailout, set a 'prune opcodes' flag and don't translate any unreachable opcodes after it until we hit a branch target block
    • Outline bailouts and exits to a shared return at the end of traces
    • Change all bailouts to be the form if (cond) { br bailout_block } or br_if bailout_block
  • Monitoring phase improvements
    • Tune threshold
    • Fix Span<byte>.Reverse regression
    • Generate a mapping table from return values (we know the possible set) to executed opcode or uop count
    • Set threshold in terms of opcodes or uops
    • Discard mapping table after monitoring phase
  • Load-to-store forwarding
    • If a series of opcodes r/w overwrite a dreg, drop the store/load pair for the leading opcodes, i.e. a = b * 2; a = a + 1; (this turns out to make things slower in v8 for some reason, so prototype won't land)
    • Use a wasm local instead of leave-on-stack
    • Fully optimize out stores and loads for cases where the dreg is only read once by leaving it on the wasm stack
  • Re-enable early trace abort with back branches active but only once a trace is long enough to justify it
  • Add typecheck-free version of stelem_ref (only possible for sealed types, must be generated in interp)
  • Update the msbuild targets to generate a single export arg to emcc instead of one per exported function
  • Ensure IEEE spec compliance for the f32 and f64 opcodes that rely on libc or wasm opcodes
  • Zero region optimizations
    • Fuse null check and length check for arrays
    • Fuse null check and length check for strings
    • Fuse null check and length check for spans
    • Fuse null check and type check for MINT_CASTCLASS/MINT_ISINST
  • Interpreter migration
    • Move cpblk unrolling into interpeter superinsn pass as mint_cpblk_imm
    • Add new null-check-free versions of hot field opcodes
    • Add new information table tracking things like known not-null state per local that are exposed to jiterpreter
    • Consume information table from jiterpreter to do null check elimination
    • Optimize size of null check bitset as described in [wasm] Re-enable null check optimization for mid-method traces #84058 (comment)
    • Investigate migrating the trace generator into transform.c and doing it during the tiering process
  • SIMD
    • Implement interpreter V128 intrinsics
    • Implement PackedSimd intrinsics
    • Implement PackedSimd in interpreter or implement a jiterpreter passthrough mechanism
    • Identify and fix the simd issue that causes testResults XML truncation on CI
    • Enable interpreter V128 support on WASM by default
    • Enable PackedSimd in interpreter mode by default
    • Implement I2 and I4 shuffles
    • Use splat encoding for v128.const 0 once v8 ships optimization for it, or use an implicitly zero-initialized local
    • Optimize constant I2 and I4 shuffle vectors
    • Implement the rest of PackedSimd
  • Raise interpreter inlining limit to 30
    • Investigate raising it a bit further
  • Caching
    • Record a list of which methods are tiered in the interp so they can tier immediately on future runs
    • Record a list of which traces we compile so that we can compile them early on future runs
    • Cache jitted traces across page loads
    • Cache do_jit_call trampolines across page loads
    • Cache interp_entry wrappers across page loads
  • Interp integration
    • If interpreter verbose is set for a method the jiterpreter should honor that

Archived items

  • Write a custom assembler and use it to generate and inline do-jit-call and simd detect modules ( #81691 )
  • Also unroll memcpy like memset
  • Investigate possible startup time regressions
  • Investigate possible .wasm size regressions
  • Update memmove unroller to ensure it does the correct thing for overlapping src/dest
  • Enable jiterpreter jitcall and interp_entry JITs by default
  • Enable jiterpreter traces by default
  • Don't bail out for safepoints
    • Do the 'is a safepoint needed' check inline in the trace instead of in the import
  • Inline strlen into traces
  • Inline getchr_ref into traces
  • Inline getitem_span into traces
  • Inline get_element_address_with_size_ref into traces
  • Optimize out the eip local and initialization for traces containing no branches
  • Generate import section after generating function body and omit unused imports
  • Do another pass over intrinsics and superinsns to add any missing ones (like the log2 used for vectorization)
  • Remove generated opcode info table and fetch opcode info from the interpreter's tables on demand to reduce file size
  • Don't discard known not-null / known constant information when crossing branches, only branch targets
  • Migrate configuration to options.h (requires improvements to the API)
  • Verify that no debugging scenarios regress
  • Better error handling for jiterpreter runtime failures (shut them off after a handful of JIT failures to avoid spamming the console and wasting CPU time)
  • Optimize out memory.fill for common sizes (it produces an expensive function call on x86 and x64)
  • Handle jiterpreter opcodes in non-wasm interp using the same path as other unreachable opcodes
  • Fix floating point compares in jiterpreter
Author: kg
Assignees: kg
Labels:

arch-wasm, area-Codegen-Interpreter-mono

Milestone: 8.0.0

@SamMonoRT SamMonoRT added the tracking This issue is tracking the completion of other related issues. label Oct 4, 2023
@SamMonoRT SamMonoRT modified the milestones: 8.0.0, 9.0.0 Oct 4, 2023
@SamMonoRT
Copy link
Member

Moving tracking issues to 9.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-wasm WebAssembly architecture area-Codegen-Interpreter-mono tracking This issue is tracking the completion of other related issues.
Projects
None yet
Development

No branches or pull requests

5 participants