Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple calls to loadGaugeQuda in minvcg branch #6

Closed
bjoo opened this issue Feb 12, 2011 · 2 comments
Closed

multiple calls to loadGaugeQuda in minvcg branch #6

bjoo opened this issue Feb 12, 2011 · 2 comments
Labels

Comments

@bjoo
Copy link
Member

bjoo commented Feb 12, 2011

multiple calls to loadGaugeQuda produce

in the minvcg branch, when not using mixed precision, in multi-GPU mode, multiple calls to loadGaugeQuda can elicit the error:

QUDA error: (CUDA) invalid argument (node 0, gauge_quda.cpp:805)

Background:
We fixed issue 5 (https://github.com/lattice/quda/issues/#issue/5 ) in the minvcg branch by
adding freeGaugeQuda and freeCloverQuda calls, that can be called at the end of a solver
so that a subsequent call to loadGaugeQuda can happily re-allocate the gauge. However a new issue has arisen: in uniform precision gauge and gaugeSloppy are actually pointers to the same place. Somehow or other after multiple calls to loadGaugeQuda one can encounter the above error. This is pernicious in an HMC like situation when multiple calls to loadGaugeQuda are necessary as the gauge field evolves.

An additional data point: when using a mixed precision solver (eg precision=SINGLE, sloppy precision=HALF) , this situation does not arise, which makes me suspect that the underlying cause of this bug is the aliasing of gauge to gaugeSloppy in uniform precision.

Reproducing:
configure the minvcg branch with

./configure --enable-os=linux --enable-gpu-arch=sm_20 --disable-staggered-dirac
--enable-wilson-dirac --disable-domain-wall-dirac --disable-twisted-mass-dirac
--enable-multi-gpu --with- qmp=/home/bjoo/Devel/QCD/install/qmp/qmp2-1-6/openmpi
--with-mpi=/home/bjoo/Toolchain/install/openmpi-1.5

Then link chroma against this and run the t_leapfrog test with a QUDA solver in the MD using uniform precision.

NB: producing this error so far required an external client to make multiple calls to loadGaugeQuda (eg. chroma calling loadGaugeQuda during the MD evolution in HMC)
A small self contained test within QUDA reproducing this error (without chroma) would be desirable.

@gshi
Copy link
Member

gshi commented Feb 12, 2011

Balint,

in freeCloverField(),
freeParityClover(&clover->even);
freeParityClover(&clover->odd);
the even/odd are not tested if they are null, nor are they set to NULL after they are freed. In that case, they could be freed twice if precision is uniform. Maybe you want to do something similar to freeGaugeField()

void freeGaugeField(FullGauge *cudaGauge) {
if (cudaGauge->even) cudaFree(cudaGauge->even);
if (cudaGauge->odd) cudaFree(cudaGauge->odd);
cudaGauge->even = NULL;
cudaGauge->odd = NULL;
}

I do not have qmp installed in my machine so I cannot test it but I suspect this is the reason causing your error.

@bjoo
Copy link
Member Author

bjoo commented Feb 12, 2011

I have tracked this. The source of the bug was that in uniform precision,

cudaGaugePrecise was assigned to cudaGaugeSloppy

via an assignment of the form

*sloppy = *precise; (e.g. at the end of loadGaugeQuda() )

This assigns two structs, including the pointers inside them.

So for example after the above call we have that sloppy->gauge == precise->gauge

When we free precise->gauge and set precise->gauge = NULL;
the corresponding operations do not happen to sloppy->gauge.

Consequently sloppy->gauge is NOT NULL at the end of a free to
freeGaugeField(&precise);

So subsequently calling freeGaugeField(&sloppy) will result in a double free
even if freeGaugeField checks that its not freeing a NULL pointer. This is because
the pointer in sloppy is not NULL. This can result in undefined behaviour.

The tests needed for correct freeing in pseudocode are:

if( sloppy->gauge == precise->gauge ) {
freeGaugeField(precise);
sloppy_gauge = NULL;
}
else {
freeGaugeField(precise);
freeGaugeField(sloppy);
}

I have implemented this in freeGaugeQuda() and freeCloverQuda() in the minvcg branch.
One suggestion is reference counting on spaces on the device, using
reference counted smart pointers. In that case the line:

*sloppy = *precise

will automagically increas a reference count on the memory
pointed to by precise. The first free (on precise) would have reduced the reference count (but not freed the memory), tho the pointer in precise would have been made to point to NULL. The second free ( on sloppy ) would have reduced reference count to 0, freed the buffer and set sloppy's pointer to NULL.

This fix also fixes issue 5.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants