Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Blas Autotuning #11

Closed
bjoo opened this issue Feb 18, 2011 · 3 comments
Closed

Fix Blas Autotuning #11

bjoo opened this issue Feb 18, 2011 · 3 comments

Comments

@bjoo
Copy link
Member

bjoo commented Feb 18, 2011

Hi, a user was trying to run QUDA and came accross this error:

(CUDA) too many resources requested for launch (node 0, blas_quda.cu:929)

He was trying to run a 16^4 clover lattice on a single C2050 (but in Multi-GPU mode - ie with wraparound QMP comms).

Probably the blas_params are not optimal (he probably did not run a make tune -- cos my script that I gave him did not have that in), and he used the default blas_params.h file. The curiosity is that for me on qcd10g0310, also with C2050-s this error does not occur when I try to emulate what he is doing (I used exactly the same package for the build that I gave him.)

However, I am using CUDA3.0 and he's using 3.2.

I think in principle, a make tune could fix his problem but that makes automation really quite difficult. (Need to know/edit lattice size in blas_test, and have to do it interactively / submit a job to a compute node for systems where there is no GPU on the interactive node).

Any ideas? Can it be done at runtime without having to wait the 15 minutes for the full BLAS tuning to go through like with make tune?

@maddyscientist
Copy link
Member

This can be done at runtime, if:
1.) we only tune the kernels that we need
2.) perform the tuning when the inverter is first created, and keep the results resident after that

This is something I'm thinking about, and will work on this as soon as we have the multi-dim parallelization in shape.

@maddyscientist
Copy link
Member

A partial fix for this would be to add command line setting of the volumes and spin to blas_test.cu. Propagating this to "make tune" would enable much easier blas tuning, e.g.,

make tune 16 16 16 16 4

would perform a tuning run on a 16^4 lattice, for Npsin = 4.

@maddyscientist
Copy link
Member

Ron has proposed that we create cached tuned blas files. If one runs at a certain volume that has already been tuned, then this will be reused, else some fallback parameters will be used that are guaranteed to work regardless of volume. This seems to me like a an easy solution, and will drastically reduce the number of "make tunes" that are needed.

@ghost ghost assigned rbabich Apr 19, 2011
ckallidonis pushed a commit to ckallidonis/quda that referenced this issue May 8, 2018
kostrzewa added a commit that referenced this issue Oct 26, 2021
…e_PC_asym_traits

closely follow how the PC twisted clover operator is called, avoiding…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants