-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pr/batched libs #1345
Pr/batched libs #1345
Conversation
9318aa2
to
0fdfb4a
Compare
…c, lib_b8_AVX2_oslexec, lib_b8_AVX512_oslexec, lib_b8_AVX_oslexec to house precompiled OSL library functions that execute over batches of 8 or 16 in SIMD for the ISA. Compiler flags for OpenMP simd code gen and ISA targets has been added for Intel(r) C++ Compiler (ICC) and CLang (newer versions of GCC 6+ might be possible, but untested). Implement batched llvm code gen for: generic function calls, useparam, compare ops, addition, subtraction, multiplication, division, modulus, assignment, component reference, construct triple, construct color, derivative extraction. Stubbed out all other code gen functions with TBD asserts. Populate OpDescriptiors with valid wide version of llvm-generating routine Added wide_opalgebraic.cpp which uses X-macros (instead of #define like llvm_ops.cpp) to define wide(batched) versions of OSL library functions: sqrt, inversesqrt, floor, ceil, trunc, round, sign, abs, fabs, fmod, and step. The X-macro wrappers follow a pattern of manufacturing a target specific library function name with enough parameter types embeded in its name to uniquely identify it (vs. other versions). Then it declares local Wide<T> or Masked<T> wrappers that convert any void */char * parameters to references to Block<T,WidthT> data blocks of wide SOA data. Then an explicit OpenMP simd loop iterates over the data lanes and extract a local scalar values from the Wide|Masked wrappers, then the scalar implementation of the library function is then inlined using the local scalar values. Finally the result is writen back out to the data lane inside the Wide|Masked wrapper. This paradigm allows scalar implementations to be resused inside simd loops and avoid having to use intrinsics or assembly. It also allows the same implementation to be recompiled for different target ISA's and varius Widths (8|16). The build system will create copy of each wide_*.cpp to a target and batch size specific named b(8|16)_(AVX512|AVX2|AVX)_wide*.cpp and build it with different -D__OSL_TARGET_ISA and -D__OSL_WIDTH values which inturn will manufacture uniquie function names. Sometimes scalar algorithms/functions can be refactored to provide better performance when executing inside a SIMD loop. sfmath.h (SIMD friendly math) houses these alternative math functions, although many improvements have already been moved into OIIO as they benefit (or do no harm) to scalar code gen. Made ShadingContext remember the ShaderGroup it just optimized. This allows symbol queries without actually jitting or executing a shader. Improved TestShade to not actually execute the shader during setup_output_images, but to instead explicitly JIT scalar or batched version of the ShaderGroup (primarily to make sure JIT happens during the "setup" stage vs. lazily later). Fix TestShade to explicitly set the number of OIIO worker threads to avoid overhead (and debugging confusion) of OIIO thread pools being created even when "-t 1" was requested. Modified ShadingSystem to only perform group_post_jit_cleanup (delete operations of shader group) only if both scalar and wide JIT's have occured or if RendererServices doesn't support batching. Without this changed the operations were being deleted before a batched JIT could occur. Extended testsuite framework to look for file named "BATCHED" which causes another run of the test with TESTSHADE_BATCHED=1 Added testsuite new tests with BATCHED enabled for passing tgh shaderglobal values, and increased coverage of arithmetic tests with reference images for float, color, point, vector, normal data types along with Dx Dy results. Added utility macros __OSL_CONCAT, __OSL_CONCAT3, ..., __OSL_CONCAT10 to be able to easily manufacture function names. Added macro __OSL_WIDE_PVT to give each target specific library its own namespace avoiding collisions should multiple libraries be loaded. Added sfm::negate(const T &x) with optimized implementation. Disabled some unreferenced functions warnings for ICC and removed some unused functions from batched_analysis.cpp Updated BatchedBackendLLVM to match behavior of BackendLLVM by configuring its LLVM_Util based on ShadingSystem attributes. Disable clang format for X macro based building of initializer arrays to prevent clang format from reordering the #include files. Fix control flow in factory function TargetLibraryHelper::build to not trigger assert unnecessarily. Limit list of OSL library functions in builtindecl_wide_xmacro to just those we have implemented so far because all functions listed must exist in the target specific library for it to successfully be loaded and resolved. Added LLVM_Util::op_zero_if(llvm::Value *cond, llvm::Value *v) which allows its implementation to work around an LLVM issue where expensive instructions to produce the value (div, sqrt, etc) are duplicated (once with a mask, once without). Fix bug in ShadingSystem::supports_batch_execution_at where jit_fma was being accidentally negated causing rest of logic to fail. Implement ShadingSystem::BatchedExecutor<WidthT>::jit_group Signed-off-by: Alex M. Wells <[email protected]>
… so that on OSX they will be named .so instead of .dylib. Moved new arithmetic tests into their own subdirectory arithmetic-cov so that the existing arithmetic test can run in non-batched mode as it uses printf which is still TBD for the batched version. Fixed bug where shader ops weren't being cleaned up unless batched jit had occured. Fixed compilation issue with newer LLVM versions and LLVMUtil::zero_if Signed-off-by: Alex M. Wells <[email protected]>
7806355
to
049f4bc
Compare
OK, this is all fine for now, will merge. I know it's a work in progress and more changes are coming right behind it. I do have a few minor concerns/requests, which we can address in subsequent patches:
|
We asked around in Nov. - Dec. timeframe, and there were still people using AVX machines at that time. |
@sfriedmapixar Thanks, good to know. |
Because the batched OSL uses native Structure of Array's data types, we can still get a nice speedup with AVX (even though it has no gather/scatter instruction). As far as would you want to build and deploy AVX-1 in a Renderer, that's another story. But felt it should build and work as part of OSL. In future if a USE_BATCHED=... setting were added, perhaps its absence could trigger the additional batched_*.cpp's to not be built. |
Yeah, that's what I'm thinking, Alex. |
Description
Added building of ISA specific shared libraries lib_b16_AVX512_oslexec, lib_b8_AVX2_oslexec, lib_b8_AVX512_oslexec, lib_b8_AVX_oslexec to house precompiled OSL library functions that execute over batches of 8 or 16 in SIMD for the ISA. Compiler flags for OpenMP simd code gen and ISA targets has been added for Intel(r) C++ Compiler (ICC) and CLang (newer versions of GCC 6+ might be possible, but untested).
Implement batched llvm code gen for: generic function calls, useparam, compare ops, addition, subtraction, multiplication, division, modulus, assignment, component reference, construct triple, construct color, derivative extraction. Stubbed out all other code gen functions with TBD asserts.
Populate OpDescriptors with valid wide version of llvm-generating routine
Added wide_opalgebraic.cpp which uses X-macros (instead of #define like llvm_ops.cpp) to define wide(batched) versions of OSL library functions: sqrt, inversesqrt, floor, ceil, trunc, round, sign, abs, fabs, fmod, and step.
The X-macro wrappers follow a pattern of manufacturing a target specific library function name with enough parameter types embeded in its name to uniquely identify it (vs. other versions). Then it declares local Wide or Masked wrappers that convert any void /char * parameters to references to Block<T,WidthT> data blocks of wide SOA data. Then an explicit OpenMP simd loop iterates over the data lanes and extract a local scalar values from the Wide|Masked wrappers, then the scalar implementation of the library function is then inlined using the local scalar values. Finally the result is written back out to the data lane inside the Wide|Masked wrapper. This paradigm allows scalar implementations to be reused inside simd loops and avoid having to use intrinsics or assembly. It also allows the same implementation to be recompiled for different target ISA's and varius Widths (8|16). The build system will create copy of each wide_.cpp to a target and batch size specific named b(8|16)_(AVX512|AVX2|AVX)_wide*.cpp and build it with different -D__OSL_TARGET_ISA and -D__OSL_WIDTH values which in turn will manufacture unique function names. Sometimes scalar algorithms/functions can be refactored to provide better performance when executing inside a SIMD loop. sfmath.h (SIMD friendly math) houses these alternative math functions, although many improvements have already been moved into OIIO as they benefit (or do no harm) to scalar code gen.
Made ShadingContext remember the ShaderGroup it just optimized. This allows symbol queries without actually JITing or executing a shader.
Improved TestShade to not actually execute the shader during setup_output_images, but to instead explicitly JIT scalar or batched version of the ShaderGroup (primarily to make sure JIT happens during the "setup" stage vs. lazily later).
Fix TestShade to explicitly set the number of OIIO worker threads to avoid overhead (and debugging confusion) of OIIO thread pools being created even when "-t 1" was requested.
Modified ShadingSystem to only perform group_post_jit_cleanup (delete operations of shader group) only if both scalar and wide JIT's have occurred or if RendererServices doesn't support batching. Without this changed the operations were being deleted before a batched JIT could occur.
Added utility macros __OSL_CONCAT, __OSL_CONCAT3, ..., __OSL_CONCAT10 to be able to easily manufacture function names.
Added macro __OSL_WIDE_PVT to give each target specific library its own namespace avoiding collisions should multiple libraries be loaded.
Added sfm::negate(const T &x) with optimized implementation.
Disabled some unreferenced functions warnings for ICC and removed some unused functions from batched_analysis.cpp
Updated BatchedBackendLLVM to match behavior of BackendLLVM by configuring its LLVMUtil based on ShadingSystem attributes.
Disable clang format for X macro based building of initializer arrays to prevent clang format from reordering the #include files.
Fix control flow in factory function TargetLibraryHelper::build to not trigger assert unnecessarily.
Limit list of OSL library functions in builtindecl_wide_xmacro to just those we have implemented so far because all functions listed must exist in the target specific library for it to successfully be loaded and resolved.
Added LLVM_Util::op_zero_if(llvm::Value *cond, llvm::Value *v) which allows its implementation to work around an LLVM issue where expensive instructions to produce the value (div, sqrt, etc) are duplicated (once with a mask, once without).
Fix bug in ShadingSystem::supports_batch_execution_at where jit_fma was being accidentally negated causing rest of logic to fail.
Implement ShadingSystem::BatchedExecutor::jit_group
Tests
Extended testsuite framework to look for file named "BATCHED" which causes another run of the test with TESTSHADE_BATCHED=1
Added new tests with BATCHED enabled for passing shaderglobal values, and increased coverage of arithmetic tests with reference images for float, color, point, vector, normal data types along with Dx Dy results.
Adopt X-macro to build OSL test shaders to enforce coverage and unify names and reduce number of unique shaders to maintain.
Checklist: