-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiple calls to loadGaugeQuda in minvcg branch #6
Comments
Balint, in freeCloverField(), void freeGaugeField(FullGauge *cudaGauge) { I do not have qmp installed in my machine so I cannot test it but I suspect this is the reason causing your error. |
I have tracked this. The source of the bug was that in uniform precision, cudaGaugePrecise was assigned to cudaGaugeSloppy via an assignment of the form *sloppy = *precise; (e.g. at the end of loadGaugeQuda() ) This assigns two structs, including the pointers inside them. So for example after the above call we have that sloppy->gauge == precise->gauge When we free precise->gauge and set precise->gauge = NULL; Consequently sloppy->gauge is NOT NULL at the end of a free to So subsequently calling freeGaugeField(&sloppy) will result in a double free The tests needed for correct freeing in pseudocode are: if( sloppy->gauge == precise->gauge ) { I have implemented this in freeGaugeQuda() and freeCloverQuda() in the minvcg branch. *sloppy = *precise will automagically increas a reference count on the memory This fix also fixes issue 5. |
Fixed some pointer issues
multiple calls to loadGaugeQuda produce
in the minvcg branch, when not using mixed precision, in multi-GPU mode, multiple calls to loadGaugeQuda can elicit the error:
QUDA error: (CUDA) invalid argument (node 0, gauge_quda.cpp:805)
Background:
We fixed issue 5 (https://github.com/lattice/quda/issues/#issue/5 ) in the minvcg branch by
adding freeGaugeQuda and freeCloverQuda calls, that can be called at the end of a solver
so that a subsequent call to loadGaugeQuda can happily re-allocate the gauge. However a new issue has arisen: in uniform precision gauge and gaugeSloppy are actually pointers to the same place. Somehow or other after multiple calls to loadGaugeQuda one can encounter the above error. This is pernicious in an HMC like situation when multiple calls to loadGaugeQuda are necessary as the gauge field evolves.
An additional data point: when using a mixed precision solver (eg precision=SINGLE, sloppy precision=HALF) , this situation does not arise, which makes me suspect that the underlying cause of this bug is the aliasing of gauge to gaugeSloppy in uniform precision.
Reproducing:
configure the minvcg branch with
./configure --enable-os=linux --enable-gpu-arch=sm_20 --disable-staggered-dirac
--enable-wilson-dirac --disable-domain-wall-dirac --disable-twisted-mass-dirac
--enable-multi-gpu --with- qmp=/home/bjoo/Devel/QCD/install/qmp/qmp2-1-6/openmpi
--with-mpi=/home/bjoo/Toolchain/install/openmpi-1.5
Then link chroma against this and run the t_leapfrog test with a QUDA solver in the MD using uniform precision.
NB: producing this error so far required an external client to make multiple calls to loadGaugeQuda (eg. chroma calling loadGaugeQuda during the MD evolution in HMC)
A small self contained test within QUDA reproducing this error (without chroma) would be desirable.
The text was updated successfully, but these errors were encountered: