Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pr/batched libs #1345

Merged
merged 2 commits into from
Apr 2, 2021
Merged

Conversation

AlexMWells
Copy link
Contributor

@AlexMWells AlexMWells commented Mar 11, 2021

Description

Added building of ISA specific shared libraries lib_b16_AVX512_oslexec, lib_b8_AVX2_oslexec, lib_b8_AVX512_oslexec, lib_b8_AVX_oslexec to house precompiled OSL library functions that execute over batches of 8 or 16 in SIMD for the ISA. Compiler flags for OpenMP simd code gen and ISA targets has been added for Intel(r) C++ Compiler (ICC) and CLang (newer versions of GCC 6+ might be possible, but untested).

Implement batched llvm code gen for: generic function calls, useparam, compare ops, addition, subtraction, multiplication, division, modulus, assignment, component reference, construct triple, construct color, derivative extraction. Stubbed out all other code gen functions with TBD asserts.

Populate OpDescriptors with valid wide version of llvm-generating routine

Added wide_opalgebraic.cpp which uses X-macros (instead of #define like llvm_ops.cpp) to define wide(batched) versions of OSL library functions: sqrt, inversesqrt, floor, ceil, trunc, round, sign, abs, fabs, fmod, and step.
The X-macro wrappers follow a pattern of manufacturing a target specific library function name with enough parameter types embeded in its name to uniquely identify it (vs. other versions). Then it declares local Wide or Masked wrappers that convert any void /char * parameters to references to Block<T,WidthT> data blocks of wide SOA data. Then an explicit OpenMP simd loop iterates over the data lanes and extract a local scalar values from the Wide|Masked wrappers, then the scalar implementation of the library function is then inlined using the local scalar values. Finally the result is written back out to the data lane inside the Wide|Masked wrapper. This paradigm allows scalar implementations to be reused inside simd loops and avoid having to use intrinsics or assembly. It also allows the same implementation to be recompiled for different target ISA's and varius Widths (8|16). The build system will create copy of each wide_.cpp to a target and batch size specific named b(8|16)_(AVX512|AVX2|AVX)_wide*.cpp and build it with different -D__OSL_TARGET_ISA and -D__OSL_WIDTH values which in turn will manufacture unique function names. Sometimes scalar algorithms/functions can be refactored to provide better performance when executing inside a SIMD loop. sfmath.h (SIMD friendly math) houses these alternative math functions, although many improvements have already been moved into OIIO as they benefit (or do no harm) to scalar code gen.

Made ShadingContext remember the ShaderGroup it just optimized. This allows symbol queries without actually JITing or executing a shader.

Improved TestShade to not actually execute the shader during setup_output_images, but to instead explicitly JIT scalar or batched version of the ShaderGroup (primarily to make sure JIT happens during the "setup" stage vs. lazily later).
Fix TestShade to explicitly set the number of OIIO worker threads to avoid overhead (and debugging confusion) of OIIO thread pools being created even when "-t 1" was requested.

Modified ShadingSystem to only perform group_post_jit_cleanup (delete operations of shader group) only if both scalar and wide JIT's have occurred or if RendererServices doesn't support batching. Without this changed the operations were being deleted before a batched JIT could occur.

Added utility macros __OSL_CONCAT, __OSL_CONCAT3, ..., __OSL_CONCAT10 to be able to easily manufacture function names.
Added macro __OSL_WIDE_PVT to give each target specific library its own namespace avoiding collisions should multiple libraries be loaded.
Added sfm::negate(const T &x) with optimized implementation.

Disabled some unreferenced functions warnings for ICC and removed some unused functions from batched_analysis.cpp
Updated BatchedBackendLLVM to match behavior of BackendLLVM by configuring its LLVMUtil based on ShadingSystem attributes.

Disable clang format for X macro based building of initializer arrays to prevent clang format from reordering the #include files.
Fix control flow in factory function TargetLibraryHelper::build to not trigger assert unnecessarily.

Limit list of OSL library functions in builtindecl_wide_xmacro to just those we have implemented so far because all functions listed must exist in the target specific library for it to successfully be loaded and resolved.

Added LLVM_Util::op_zero_if(llvm::Value *cond, llvm::Value *v) which allows its implementation to work around an LLVM issue where expensive instructions to produce the value (div, sqrt, etc) are duplicated (once with a mask, once without).

Fix bug in ShadingSystem::supports_batch_execution_at where jit_fma was being accidentally negated causing rest of logic to fail.
Implement ShadingSystem::BatchedExecutor::jit_group

Tests

Extended testsuite framework to look for file named "BATCHED" which causes another run of the test with TESTSHADE_BATCHED=1
Added new tests with BATCHED enabled for passing shaderglobal values, and increased coverage of arithmetic tests with reference images for float, color, point, vector, normal data types along with Dx Dy results.
Adopt X-macro to build OSL test shaders to enforce coverage and unify names and reduce number of unique shaders to maintain.

Checklist:

  • [X ] I have read the contribution guidelines.
  • [X ] I have previously submitted a Contributor License Agreement.
  • I have updated the documentation, if applicable.
  • I have ensured that the change is tested somewhere in the testsuite (adding new test cases if necessary).
  • My code follows the prevailing code style of this project.

@AlexMWells AlexMWells force-pushed the PR/BatchedLibs branch 3 times, most recently from 9318aa2 to 0fdfb4a Compare March 12, 2021 01:33
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 12, 2021

CLA Signed

The committers are authorized under a signed CLA.

…c, lib_b8_AVX2_oslexec, lib_b8_AVX512_oslexec, lib_b8_AVX_oslexec to house precompiled OSL library functions that execute over batches of 8 or 16 in SIMD for the ISA. Compiler flags for OpenMP simd code gen and ISA targets has been added for Intel(r) C++ Compiler (ICC) and CLang (newer versions of GCC 6+ might be possible, but untested).

Implement batched llvm code gen for:  generic function calls, useparam, compare ops,  addition, subtraction, multiplication, division, modulus, assignment, component reference, construct triple, construct color, derivative extraction.  Stubbed out all other code gen functions with TBD asserts.

Populate OpDescriptiors with valid wide version of llvm-generating routine

Added wide_opalgebraic.cpp which uses X-macros (instead of #define like llvm_ops.cpp) to define wide(batched) versions of OSL library functions:  sqrt, inversesqrt, floor, ceil, trunc, round, sign, abs, fabs, fmod, and step.
The X-macro wrappers follow a pattern of manufacturing a target specific library function name with enough parameter types embeded in its name to uniquely identify it (vs. other versions).  Then it declares local Wide<T> or Masked<T> wrappers that convert any void */char * parameters to references to Block<T,WidthT> data blocks of wide SOA data.  Then an explicit OpenMP simd loop iterates over the data lanes and extract a local scalar values from the Wide|Masked wrappers, then the scalar implementation of the library function is then inlined using the local scalar values.  Finally the result is writen back out to the data lane inside the Wide|Masked wrapper.  This paradigm allows scalar implementations to be resused inside simd loops and avoid having to use intrinsics or assembly.  It also allows the same implementation to be recompiled for different target ISA's and varius Widths (8|16).  The build system will create copy of each  wide_*.cpp to a target and batch size specific named b(8|16)_(AVX512|AVX2|AVX)_wide*.cpp and build it with different -D__OSL_TARGET_ISA and -D__OSL_WIDTH values which inturn will manufacture uniquie function names.  Sometimes scalar algorithms/functions can be refactored to provide better performance when executing inside a SIMD loop.  sfmath.h (SIMD friendly math) houses these alternative math functions, although many improvements have already been moved into OIIO as they benefit (or do no harm) to scalar code gen.

Made ShadingContext remember the ShaderGroup it just optimized.  This allows symbol queries without actually jitting or executing a shader.

Improved TestShade to not actually execute the shader during setup_output_images, but to instead explicitly JIT scalar or batched version of the ShaderGroup (primarily to make sure JIT happens during the "setup" stage vs. lazily later).
Fix TestShade to explicitly set the number of OIIO worker threads to avoid overhead (and debugging confusion) of OIIO thread pools being created even when "-t 1" was requested.

Modified ShadingSystem to only perform group_post_jit_cleanup (delete operations of shader group) only if both scalar and wide JIT's have occured or if RendererServices doesn't support batching.  Without this changed the operations were being deleted before a batched JIT could occur.

Extended testsuite framework to look for file named "BATCHED" which causes another run of the test with TESTSHADE_BATCHED=1
Added testsuite new tests with BATCHED enabled for passing tgh shaderglobal values, and increased coverage of arithmetic tests with reference images for float, color, point, vector, normal data types along with Dx Dy results.

Added utility macros  __OSL_CONCAT,  __OSL_CONCAT3, ...,  __OSL_CONCAT10 to be able to easily manufacture function names.
Added macro __OSL_WIDE_PVT to give each target specific library its own namespace avoiding collisions should multiple libraries be loaded.
Added sfm::negate(const T &x) with optimized implementation.

Disabled some unreferenced functions warnings for ICC and removed some unused functions from batched_analysis.cpp
Updated BatchedBackendLLVM to match behavior of BackendLLVM by configuring its LLVM_Util based on ShadingSystem attributes.

Disable clang format for X macro based building of initializer arrays to prevent clang format from reordering the #include files.
Fix control flow in factory function TargetLibraryHelper::build to not trigger assert unnecessarily.

Limit list of OSL library functions in builtindecl_wide_xmacro to just those we have implemented so far because all functions listed must exist in the target specific library for it to successfully be loaded and resolved.

Added LLVM_Util::op_zero_if(llvm::Value *cond, llvm::Value *v) which allows its implementation to work around an LLVM issue where expensive instructions to produce the value (div, sqrt, etc) are duplicated (once with a mask, once without).

Fix bug in ShadingSystem::supports_batch_execution_at where jit_fma was being accidentally negated causing rest of logic to fail.
Implement ShadingSystem::BatchedExecutor<WidthT>::jit_group

Signed-off-by: Alex M. Wells <[email protected]>
… so that on OSX they will be named .so instead of .dylib.

Moved new arithmetic tests into their own subdirectory arithmetic-cov so that the existing arithmetic test can run in non-batched mode as it uses printf which is still TBD for the batched version.
Fixed bug where shader ops weren't being cleaned up unless batched jit had occured.
Fixed compilation issue with newer LLVM versions and LLVMUtil::zero_if

Signed-off-by: Alex M. Wells <[email protected]>
@lgritz
Copy link
Collaborator

lgritz commented Apr 2, 2021

OK, this is all fine for now, will merge. I know it's a work in progress and more changes are coming right behind it.

I do have a few minor concerns/requests, which we can address in subsequent patches:

  • I think that the set of ISA-specific dynamic libraries should be customizeable via a CMake variable, including the ability to turn it off altogether and not compile any of those modules (not unlike how we have "USE_SIMD=..." taking a comma-separated set of names that determines which HW capabilities we should build for).

  • I think that we probably also want a way to disable batch shading entirely, for users who know they don't need it, or who are running on architectures where we don't expect it to be able to run. (For example, we will surely soon need to support ARM Mac varieties.)

  • I noticed that you are currently building AVX512 16-wide and 8-wide, AVX 2.0 8-wide, and AVX 8-wide. AVX-2 I understand, but I wonder if there is a realistic use case for AVX(-1)? Maybe that's something we just don't need to build by default, and if somebody needs it, they can use the config settings to enable it. I also wonder if there are enough people who would want it at all. I don't come across a lot of AVX-1 machines. Believe it or not, we still have machines in use that are SSE4.2 and have no AVX at all, but I think the next oldest batch of machines we still own are AVX-2 (and a growing number of the newer ones are AVX-512). But no AVX-1, and I suspect it may be the same for others.

@lgritz lgritz merged commit 8727d61 into AcademySoftwareFoundation:master Apr 2, 2021
@sfriedmapixar
Copy link
Contributor

We asked around in Nov. - Dec. timeframe, and there were still people using AVX machines at that time.

@lgritz
Copy link
Collaborator

lgritz commented Apr 2, 2021

@sfriedmapixar Thanks, good to know.

@AlexMWells
Copy link
Contributor Author

AlexMWells commented Apr 2, 2021

Because the batched OSL uses native Structure of Array's data types, we can still get a nice speedup with AVX (even though it has no gather/scatter instruction). As far as would you want to build and deploy AVX-1 in a Renderer, that's another story. But felt it should build and work as part of OSL.

In future if a USE_BATCHED=... setting were added, perhaps its absence could trigger the additional batched_*.cpp's to not be built.

@lgritz
Copy link
Collaborator

lgritz commented Apr 2, 2021

Yeah, that's what I'm thinking, Alex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants