Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add col/s throughput metric in performance reports, add vector_length(NPROMA) to gang loops #42

Merged
merged 6 commits into from
Feb 28, 2023

Conversation

reuterbal
Copy link
Collaborator

A repeatedly voiced requirement was the addition of a more objective performance metric than the MFlop/s number currently reported, which is based on a historic estimate of the FLOP count for the 100 col data set.
For this reason, an additional column has been added to the performance output table, reporting the col/s throughput:

$ ctest -VV
...
9: Test command: /home/nabr/dwarf-p-cloudsc/nabr-col-throughput-metric/build/bin/dwarf-cloudsc-gpu-omp-scc-hoist "1" "1000" "128"
9: Environment variables:
9:  OMP_NUM_THREADS=1
9: Test timeout computed to be: 1500
9:      NUMPROC=1, NUMOMP=1, NGPTOTG=1000, NPROMA=128, NGPBLKS=8
9:  Reference MFLOP count for 100 columns :  12.48232900
9:      NUMOMP    NGPTOT  #GP-cols     #BLKS    NPROMA tid# : Time(msec)  MFlops/s     col/s
9:           1      1000      1000         1       128    0 :         46      2696     21600 @ rank#0:core#174
9:           1      1000      1000         8       128   -1 :         63      1950     15625 : TOTAL @ rank#0
9:     1 x   1      1000      1000         8       128   -1 :         63      1950     15625 : TOTAL
9:  Executing CLOUDSC-GPU, "SCC-hoist" variant
9:              Variable Dim             MinValue             MaxValue            AbsMaxErr         AvgAbsErr/GP          MaxRelErr-%
9:                 PLUDE 2D1  0.0000000000000E+00  0.1026720108982E-03  0.0000000000000E+00  0.0000000000000E+00  0.0000000000000E+00
9:              PCOVPTOT 2D3  0.0000000000000E+00  0.1000000000000E+01  0.3663735981263E-14  0.9876821582822E-15  0.3706367622958E-14
9:      PRAINFRAC_TOPRFZ 1D1  0.0000000000000E+00  0.0000000000000E+00  0.0000000000000E+00  0.0000000000000E+00  0.0000000000000E+00
9:                PFSQLF 2D3 -0.2581756360096E-05  0.4080862802487E-05  0.1719443795702E-18  0.2390299130407E-18  0.2347330463725E-11 !!!!
9:                PFSQIF 2D3 -0.2810873821405E-05  0.3887836598167E-05  0.1761828530289E-18  0.2313065286387E-18  0.1032231574843E-11 !!!!
9:               PFCQLNG 2D1 -0.3555564290808E-07  0.1413389467871E-08  0.3308722450212E-23  0.4230994440106E-23  0.0000000000000E+00
9:               PFCQNNG 2D1 -0.5285014021662E-06  0.8788562329576E-08  0.5293955920339E-22  0.2312968244934E-22  0.0000000000000E+00
9:                PFSQRF 2D3 -0.2581756360096E-05  0.4080862802487E-05  0.1719443795702E-18  0.2296847263331E-18  0.2375157255297E-11 !!!!
9:                PFSQSF 2D3 -0.2768876986308E-05  0.3860349895496E-05  0.1850766989751E-18  0.2298748684609E-18  0.1031770470511E-11 !!!!
9:               PFCQRNG 2D1 -0.3555564290808E-07  0.1413389467871E-08  0.3308722450212E-23  0.4029417633410E-23  0.0000000000000E+00
9:               PFCQSNG 2D1 -0.5285014021662E-06  0.8788562329576E-08  0.5293955920339E-22  0.2107105285572E-22  0.0000000000000E+00
9:              PFSQLTUR 2D1 -0.8115949343062E-06  0.2653456385227E-06  0.1058791184068E-21  0.3489066860269E-22  0.0000000000000E+00
9:              PFSQITUR 2D1 -0.2651690301090E-05  0.3864567769232E-06  0.2117582368136E-21  0.1205586209974E-21  0.0000000000000E+00
9:                PFPLSL 2D1  0.0000000000000E+00  0.0000000000000E+00  0.0000000000000E+00  0.0000000000000E+00  0.0000000000000E+00
9:                PFPLSN 2D3  0.0000000000000E+00  0.2969958004108E-04  0.2846030702774E-18  0.3958588803080E-18  0.2285986820588E-12 !!!!
9:                PFHPSL 2D1 -0.0000000000000E+00 -0.0000000000000E+00  0.0000000000000E+00  0.0000000000000E+00  0.0000000000000E+00
9:                PFHPSN 2D3 -0.8418345962643E+02 -0.0000000000000E+00  0.8064660050877E-12  0.1121794437263E-11  0.2285441720334E-12 !!!!
9:        TENDENCY_LOC%A 2D3 -0.2777777777778E-03  0.2631309809623E-03  0.8402566836763E-17  0.1145087841548E-17  0.2508471529207E-12 !!!!
9:        TENDENCY_LOC%Q 2D1 -0.1997382404851E-07  0.1495580073796E-07  0.1747005453712E-20  0.6706493723483E-21  0.0000000000000E+00
9:        TENDENCY_LOC%T 2D3 -0.4132200394167E-04  0.5435278446361E-04  0.4919567357653E-17  0.2210615873862E-17  0.9947677523570E-12 !!!!
9:      TENDENCY_LOC%CLD 3D1 -0.1222481036148E-07  0.9816546350973E-08  0.7773946794350E-21  0.3708336006542E-21  0.0000000000000E+00
 9/10 Test  #9: dwarf-cloudsc-gpu-omp-scc-hoist-serial .......   Passed    1.42 sec

Included is also the addition of the vector_length(NPROMA) attribute on the gang loop in SCC and SCC-hoist, for which we know that this fixes performance degradation for NPROMA values that don't match the hardware vector length 128 on A100.
The corresponding OpenMP attribute safelen is not yet supported by NVHPC 22.1 (maybe later versions), therefore not yet added.

@reuterbal reuterbal requested a review from mlange05 February 27, 2023 11:52
@reuterbal reuterbal changed the base branch from main to develop February 27, 2023 11:52
@reuterbal
Copy link
Collaborator Author

While testing this stuff I noticed problems building the C variant with GNU 11. The declaration of variables in the header lead to duplicate symbols there, which I fixed by cleanly putting them into separate compilation units and declaring them as extern in the header. Also found odd behaviour with the multi-threaded OpenMP variant which seems to work fine only as long as OMP_NUM_THREADS is not set.

@reuterbal reuterbal force-pushed the nabr-col-throughput-metric branch from 6803b4c to beffc7d Compare February 27, 2023 14:25
@reuterbal reuterbal force-pushed the nabr-col-throughput-metric branch from beffc7d to 4a13656 Compare February 27, 2023 14:52
Copy link
Collaborator

@mlange05 mlange05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice and very happy with this as is. However, we'll probably have to do the same thing for the loki variants, and possibly retrofit this to some of the scc-cuf (hand-rolled and auto-generated) ones. We can merge this now, as is, or wait and funnel all the other variant updates into this one. Opinions?

@reuterbal Marking this as a comment for now, but happy to turn this into approval once we decide how to proceed.

@reuterbal
Copy link
Collaborator Author

The Loki variants are covered (assuming you're talking about the col/s metric), as they also use the common lib for this. CUDA variants are of course separate and need either a subsequent PR or can be handled here after a rebase.

@mlange05
Copy link
Collaborator

Oh yes, of course, sorry! I got confused by only some of the driver files being present, but, of course, we'll need to fix that annotation issue on the Loki side! Ok, cool, GTG then.

@mlange05 mlange05 merged commit ce3b4ee into develop Feb 28, 2023
@mlange05 mlange05 deleted the nabr-col-throughput-metric branch March 1, 2023 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants