Fix Blas Autotuning #11

bjoo · 2011-02-18T13:59:08Z

Hi, a user was trying to run QUDA and came accross this error:

(CUDA) too many resources requested for launch (node 0, blas_quda.cu:929)

He was trying to run a 16^4 clover lattice on a single C2050 (but in Multi-GPU mode - ie with wraparound QMP comms).

Probably the blas_params are not optimal (he probably did not run a make tune -- cos my script that I gave him did not have that in), and he used the default blas_params.h file. The curiosity is that for me on qcd10g0310, also with C2050-s this error does not occur when I try to emulate what he is doing (I used exactly the same package for the build that I gave him.)

However, I am using CUDA3.0 and he's using 3.2.

I think in principle, a make tune could fix his problem but that makes automation really quite difficult. (Need to know/edit lattice size in blas_test, and have to do it interactively / submit a job to a compute node for systems where there is no GPU on the interactive node).

Any ideas? Can it be done at runtime without having to wait the 15 minutes for the full BLAS tuning to go through like with make tune?

maddyscientist · 2011-02-20T15:43:30Z

This can be done at runtime, if:
1.) we only tune the kernels that we need
2.) perform the tuning when the inverter is first created, and keep the results resident after that

This is something I'm thinking about, and will work on this as soon as we have the multi-dim parallelization in shape.

maddyscientist · 2011-04-01T20:19:20Z

A partial fix for this would be to add command line setting of the volumes and spin to blas_test.cu. Propagating this to "make tune" would enable much easier blas tuning, e.g.,

make tune 16 16 16 16 4

would perform a tuning run on a 16^4 lattice, for Npsin = 4.

maddyscientist · 2011-04-19T18:40:23Z

Ron has proposed that we create cached tuned blas files. If one runs at a certain volume that has already been tuned, then this will be reused, else some fallback parameters will be used that are guaranteed to work regardless of volume. This seems to me like a an easy solution, and will drastically reduce the number of "make tunes" that are needed.

Hotfix/high mom loops

…e_PC_asym_traits closely follow how the PC twisted clover operator is called, avoiding…

ghost assigned rbabich Apr 19, 2011

maddyscientist mentioned this issue Jan 31, 2012

Possible memory leak ? #47

Closed

rbabich closed this as completed in aa2d301 Mar 27, 2012

fwinter mentioned this issue Nov 26, 2014

tunecache: *** buffer overflow detected *** #177

Closed

ckallidonis pushed a commit to ckallidonis/quda that referenced this issue May 8, 2018

Merge pull request lattice#11 from ETMC-QUDA/hotfix/HighMom-Loops

95ddadf

Hotfix/high mom loops

kostrzewa added a commit that referenced this issue Oct 26, 2021

Merge pull request #11 from qcdcode/feature/ndeg-twisted-clover_remov…

f73fe47

…e_PC_asym_traits closely follow how the PC twisted clover operator is called, avoiding…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Blas Autotuning #11

Fix Blas Autotuning #11

bjoo commented Feb 18, 2011

maddyscientist commented Feb 20, 2011

maddyscientist commented Apr 1, 2011

maddyscientist commented Apr 19, 2011

Fix Blas Autotuning #11

Fix Blas Autotuning #11

Comments

bjoo commented Feb 18, 2011

maddyscientist commented Feb 20, 2011

maddyscientist commented Apr 1, 2011

maddyscientist commented Apr 19, 2011