-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pr/batched rs #1276
Pr/batched rs #1276
Conversation
to store upto WidthT instances of ShaderGlobal data. The data for a batch is broken into underlying uniform and varying data structures. Uniform contains a fixed set of shader global values that are shared between all instances. Varying has Block<DataT, WidthT> members so each instance can hold a different value. The Block data structure is specialized to hold the underlying data type in a SIMD friendly Structure of Arrays (SOA) data layout but provides an array subscript interface to a proxy that can import/export the DataT in its orignal data layout. Introduce template<int WidthT> class ShadingSystem::BatchedExecutor as an interface to execute batches of shader globals through a shader network. BatchedExecutor provides similar execute methods that accept a BatchedShaderGlobals and batchSize instead of individual ShaderGlobal(s). This is just the interface, batched backend implemention to follow in subsequent pull requests. Introduce template<typename DataT, int WidthT> Wide to hold a reference to Block<DataT, WidthT> and provide access to the underlying SOA data layout. Wide can handle array types and const correctness. The Wide adapter is lightweight and can be copied by value, used as function parameter. Once batched execution finishes, the resulting symbol address can be passed to a Wide<DataT, Width> to access the resulting data in a type safe manner. Updated testshade with command line option "--batched" causing batched execution of 16 or 8 instances through the shader network and to extract the results. A later pull request will add batched interface to renderer services.
…ich has no overload for double
… during SIMD execution, BatchedTextureOptions, Masked<typename DataT, int WidthT>
Conflicts: src/include/OSL/wide.h
Conflicts: src/include/OSL/oslexec.h src/include/OSL/wide.h
…ions is_overriden_*(). Improved some comments for BatchedRendererServices.h Workaround some QT macro issues.
…its methods are never directly called. Fixed up issues with OSL_USING_DATA_WIDTH(WIDTH) macro.
…and ::first_on() Moved documentation for Wide, Masked, and other classes in wide.h from top overview to be near class definition. Fixed some typos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I've made several minor comments to consider, which can be tackled now, or in a subsequent review, or not at all. Just let me know when you want me to merge this one.
static_assert(offsetof(OIIO::TextureOptBatch, sblur) % 64 == 0, | ||
"oops unaligned wide variable"); | ||
static_assert(offsetof(OIIO::TextureOptBatch, tblur) % 64 == 0, | ||
"oops unaligned wide variable"); | ||
static_assert(offsetof(OIIO::TextureOptBatch, rblur) % 64 == 0, | ||
"oops unaligned wide variable"); | ||
static_assert(offsetof(OIIO::TextureOptBatch, swidth) % 64 == 0, | ||
"oops unaligned wide variable"); | ||
static_assert(offsetof(OIIO::TextureOptBatch, twidth) % 64 == 0, | ||
"oops unaligned wide variable"); | ||
static_assert(offsetof(OIIO::TextureOptBatch, rwidth) % 64 == 0, | ||
"oops unaligned wide variable"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't even want to change it in this PR, but just food for future thought: Would it be better to tuck away these static checks in just one .cpp module (any one will do, right?) rather than have them parsed and evaluated for every translation unit that includes the header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on one's confidence level, they could be moved into some type of unit test that builds and enforces this. But that moves intent further away from enforcement. Do we think there is a compile time cost (measurable, I'm sure its not free). Honestly the motivation for moving it out would just be so reader doesn't have to see it there.
src/include/OSL/mask.h
Outdated
#if __INTEL_COMPILER | ||
if (value_width <= 32) | ||
return _mm_popcnt_u32(m_value); | ||
else | ||
return _mm_popcnt_u64(m_value); | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a clause for gcc and clang to use the __builtin_popcount ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to add that, I didn't have all the gcc test matrix setup locally, does clang have a similar builtin? However if gcc and clang recognize the idiom, that might be more maintainable. Will have to test in godbolt.org
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified idiom is not detected, correct code though. I'll try adding case for clang, gcc, then else to the idiom
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like __builtin_popcount
works with both gcc and clang, and ought to be able to be counted on to do the best thing on any architecture that those compilers support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stand corrected, gcc and clang both recognized the idiom when targeting an ISA that supports popcount instruction. I'll update it to be compiler intrinsics anyway, with a comment to use std::bitcount for c++20 in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include a comment
// FIXME(C++20)
as a breadcrumb that will let us find those spots when it comes time to fix things up nicely for C++20.
src/include/OSL/mask.h
Outdated
#else // Linux, OS X, gcc or clang | ||
return __builtin_ctzll(m_value); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change this to an #if defined(__GNUC__) || defined(__clang__)
, then have a final #else
clause with an appropriate #error
inside, just in case anybody tries to compile this with some unforeseen compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How's this:
#if (defined(_WIN32) || defined(_WIN64))
unsigned long index;
_BitScanForward64(&index, m_value);
return static_cast<int>(index);
#elif defined(__GNUC__) || defined(__clang__)
return __builtin_ctzll(m_value);
#else
#error unknown compiler, update Mask<int WidthT>::first_on() to support compiler
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to build OSL in Windows, from latest changes
this line broke the build
OpenShadingLanguage\src\include\OSL/mask.h(166,20): error C3861: '__builtin_ctz': identifier not found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, I see. Hang on, I will propose a fix...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @lgritz
I can test it on Windows, any changes you apply, I am preparing Docs for building in windows, so can count on me on any OS based changes on the repo (for building in windows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1291
template<typename DataT, int WidthT> | ||
struct Masked : public pvt::MaskedImpl<DataT, WidthT> { | ||
static_assert(std::is_const<DataT>::value == false, | ||
"Masked<> is only valid when DataT is NOT const"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that for this and other templates, maybe the overview of what they do and why should be here in doxygen-like comments? I know I saw you explain it elsewhere, but I think probably each of these classes should have their primary documentation right at their declaration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the top there is a "overview" with documentation by each class's forward declaration, I could repeat (maybe more detailed) the specific portion above each class, or move it down above each class..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think maybe just move the explanations at the top to be near each class so that if we want to generate automatic docs for them later, it will all just work. I think people mostly expect to find the main documentation for a class at the point of its declaration.
This isn't really critical, we can move it around later, but I'm just wanting to be very careful to leave all the right breadcrumbs for whoever needs to modify or use these classes next.
src/osltoy/osltoyapp.cpp
Outdated
// QT's extension foreach defines a foreach macro which interferes | ||
// with an OSL internal foreach method. So we will undefine it here | ||
#undef foreach | ||
// It is recommended any uses ofQT's foreach be migrated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ofQT -> of Qt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/osltoy/osltoymain.cpp
Outdated
// QT's extension foreach defines a foreach macro which interferes | ||
// with an OSL internal foreach method. So we will undefine it here | ||
#undef foreach | ||
// It is recommended any uses ofQT's foreach be migrated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ofQT -> of Qt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
I just updated the files incorporating the feedback so far. |
All LGTM except for the one comment about __builtin_popcount vs __builtin_popcountll. Don't worry about the Mac failure, that's unrelated to your changes and is something I'm tracking down separately. |
__builtin_popcountll(m_value); added, so should be good to go |
Add Structure of Array data layout wrappers, see <OSL/wide.h> for detailed documentation and implementation:
Add BatchedRendererServices interface for renderers to support batched execution:
It mimics RendererServices but its methods parameters are Wide and Masked parameters where underlying DataT is stored in Structure Of Arrays (SOA) layout with WidthT entries. Concrete BatchedRendererServices can implement the methods by looping over WidthT, extract per lane values from Wide & Masked parameters with val = wide_param[i], use the value to calculate/lookup a result, then assign results to Masked parameters with masked_param[i] = result. Some BatchedRendererServices methods have a "_uniform" version which can be called which code generator has proved that the parameters cannot be varying. Additional virtual methods exist to identify if a particular method is overriden, required to be able to generate calls to target specific optimized default implementation vs. calling a virtual function.
Add BatchedTextureOptions which is binary compatible with OIIO::TextureOptBatch
Add BatchedTextureOutputs which bundles result *, alpha , errormessage, bool resultHasDerivs, bool alphaHasDerivs, and a Mask into a single object which can produce correct MaskData for result(), alpha(), and message() all sharing a single Mask instance.
Workaround some QT "foreach" macro issues. It is recommended any uses of QT's foreach be migrated to use C++11 range based loops.