multiple calls to loadGaugeQuda in minvcg branch #6

bjoo · 2011-02-12T01:53:52Z

multiple calls to loadGaugeQuda produce

in the minvcg branch, when not using mixed precision, in multi-GPU mode, multiple calls to loadGaugeQuda can elicit the error:

QUDA error: (CUDA) invalid argument (node 0, gauge_quda.cpp:805)

Background:
We fixed issue 5 (https://github.com/lattice/quda/issues/#issue/5 ) in the minvcg branch by
adding freeGaugeQuda and freeCloverQuda calls, that can be called at the end of a solver
so that a subsequent call to loadGaugeQuda can happily re-allocate the gauge. However a new issue has arisen: in uniform precision gauge and gaugeSloppy are actually pointers to the same place. Somehow or other after multiple calls to loadGaugeQuda one can encounter the above error. This is pernicious in an HMC like situation when multiple calls to loadGaugeQuda are necessary as the gauge field evolves.

An additional data point: when using a mixed precision solver (eg precision=SINGLE, sloppy precision=HALF) , this situation does not arise, which makes me suspect that the underlying cause of this bug is the aliasing of gauge to gaugeSloppy in uniform precision.

Reproducing:
configure the minvcg branch with

./configure --enable-os=linux --enable-gpu-arch=sm_20 --disable-staggered-dirac
--enable-wilson-dirac --disable-domain-wall-dirac --disable-twisted-mass-dirac
--enable-multi-gpu --with- qmp=/home/bjoo/Devel/QCD/install/qmp/qmp2-1-6/openmpi
--with-mpi=/home/bjoo/Toolchain/install/openmpi-1.5

Then link chroma against this and run the t_leapfrog test with a QUDA solver in the MD using uniform precision.

NB: producing this error so far required an external client to make multiple calls to loadGaugeQuda (eg. chroma calling loadGaugeQuda during the MD evolution in HMC)
A small self contained test within QUDA reproducing this error (without chroma) would be desirable.

gshi · 2011-02-12T19:38:52Z

Balint,

in freeCloverField(),
freeParityClover(&clover->even);
freeParityClover(&clover->odd);
the even/odd are not tested if they are null, nor are they set to NULL after they are freed. In that case, they could be freed twice if precision is uniform. Maybe you want to do something similar to freeGaugeField()

void freeGaugeField(FullGauge *cudaGauge) {
if (cudaGauge->even) cudaFree(cudaGauge->even);
if (cudaGauge->odd) cudaFree(cudaGauge->odd);
cudaGauge->even = NULL;
cudaGauge->odd = NULL;
}

I do not have qmp installed in my machine so I cannot test it but I suspect this is the reason causing your error.

bjoo · 2011-02-12T21:49:28Z

I have tracked this. The source of the bug was that in uniform precision,

cudaGaugePrecise was assigned to cudaGaugeSloppy

via an assignment of the form

*sloppy = *precise; (e.g. at the end of loadGaugeQuda() )

This assigns two structs, including the pointers inside them.

So for example after the above call we have that sloppy->gauge == precise->gauge

When we free precise->gauge and set precise->gauge = NULL;
the corresponding operations do not happen to sloppy->gauge.

Consequently sloppy->gauge is NOT NULL at the end of a free to
freeGaugeField(&precise);

So subsequently calling freeGaugeField(&sloppy) will result in a double free
even if freeGaugeField checks that its not freeing a NULL pointer. This is because
the pointer in sloppy is not NULL. This can result in undefined behaviour.

The tests needed for correct freeing in pseudocode are:

if( sloppy->gauge == precise->gauge ) {
freeGaugeField(precise);
sloppy_gauge = NULL;
}
else {
freeGaugeField(precise);
freeGaugeField(sloppy);
}

I have implemented this in freeGaugeQuda() and freeCloverQuda() in the minvcg branch.
One suggestion is reference counting on spaces on the device, using
reference counted smart pointers. In that case the line:

*sloppy = *precise

will automagically increas a reference count on the memory
pointed to by precise. The first free (on precise) would have reduced the reference count (but not freed the memory), tho the pointer in precise would have been made to point to NULL. The second free ( on sloppy ) would have reduced reference count to 0, freed the buffer and set sloppy's pointer to NULL.

This fix also fixes issue 5.

Fixed some pointer issues

fwinter mentioned this issue Oct 13, 2011

Not all device memory freed #37

Closed

mathiaswagner mentioned this issue Oct 16, 2014

hisq_paths_force_test --gauge-order milc crashes with Segmentation fault #163

Closed

fwinter mentioned this issue Nov 26, 2014

tunecache: *** buffer overflow detected *** #177

Closed

ckallidonis pushed a commit to ckallidonis/quda that referenced this issue May 8, 2018

Merge pull request lattice#6 from ETMC-QUDA/hotfix/pointer-mgnt

e78afe0

Fixed some pointer issues

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiple calls to loadGaugeQuda in minvcg branch #6

multiple calls to loadGaugeQuda in minvcg branch #6

bjoo commented Feb 12, 2011

gshi commented Feb 12, 2011

bjoo commented Feb 12, 2011

multiple calls to loadGaugeQuda in minvcg branch #6

multiple calls to loadGaugeQuda in minvcg branch #6

Comments

bjoo commented Feb 12, 2011

gshi commented Feb 12, 2011

bjoo commented Feb 12, 2011