-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add col/s throughput metric in performance reports, add vector_length(NPROMA) to gang loops #42
Conversation
While testing this stuff I noticed problems building the C variant with GNU 11. The declaration of variables in the header lead to duplicate symbols there, which I fixed by cleanly putting them into separate compilation units and declaring them as |
6803b4c
to
beffc7d
Compare
beffc7d
to
4a13656
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice and very happy with this as is. However, we'll probably have to do the same thing for the loki variants, and possibly retrofit this to some of the scc-cuf (hand-rolled and auto-generated) ones. We can merge this now, as is, or wait and funnel all the other variant updates into this one. Opinions?
@reuterbal Marking this as a comment for now, but happy to turn this into approval once we decide how to proceed.
The Loki variants are covered (assuming you're talking about the col/s metric), as they also use the common lib for this. CUDA variants are of course separate and need either a subsequent PR or can be handled here after a rebase. |
Oh yes, of course, sorry! I got confused by only some of the driver files being present, but, of course, we'll need to fix that annotation issue on the Loki side! Ok, cool, GTG then. |
A repeatedly voiced requirement was the addition of a more objective performance metric than the MFlop/s number currently reported, which is based on a historic estimate of the FLOP count for the 100 col data set.
For this reason, an additional column has been added to the performance output table, reporting the col/s throughput:
Included is also the addition of the
vector_length(NPROMA)
attribute on the gang loop in SCC and SCC-hoist, for which we know that this fixes performance degradation for NPROMA values that don't match the hardware vector length 128 on A100.The corresponding OpenMP attribute
safelen
is not yet supported by NVHPC 22.1 (maybe later versions), therefore not yet added.