-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not all device memory freed #37
Comments
Thanks for reporting this Frank. I'll look into it this coming week. We're currently doing a large rewrite of much of the library (removing all global variables) which will make it much easier to prevent memory leaks. |
Ok, got around to looking at this. This memory is freed when endBlas() is called which is invoked when endQuda() is called. These buffers represent a small amount of storage used for reductions which should be of minor impact on calculations. What are wanting? The option to be able to free some GPU memory, but not to do a complete endQuda()? |
Thanks for looking into this. initQuda() is called at Chroma initialization time. And endQuda() in turn should be called at the end. That means some (small) amount of memory stays allocated during individual QUDA inversions. Which is fine in principle. What worries me a bit is device memory fragmentation. The objects that are allocated between individual inversions, i.e. with QDP++, are rather large (propagators, etc.). These objects need continuous memory regions and even small allocated fragments might make it impossible to allocate such an object. Thus memory is not optimally used. There are ways around this. One might think of having separate memory domains on the device for small and large resp. objects. But this is not implemented yet. One workaround that occurs to me: Do you think its safe to call initBlas/endBlas each time an inversion starts/ends? If I understand you correctly this should make sure that these memory fractions are correctly freed before leaving the QUDA inverter. |
Yes, it should be safe to call endBlas and then initials inbetween solvers. Of course things will go bad if endBlas is called and a solver is then called......... |
Just realized that its not so straight-forward to call endBlas from Chroma. There are name clashes. E.g. "Complex" is defined in QDP and aliased to global namespace as you have a "Complex" type as well... |
Ok, this has motivated me to do something I've been planning for a while: to create a quda namespace. For a first step all I have done is moved the blas creation / destroy functions into the namespace, e.g., quda::initBlas, etc. This is pushed to master. Can you tell me what conflicts you have remaining, and I'll make the necessary changes to fix this? I won't move everything into the namespace quite yet, as it would take too long. This will be an evolutionary process.... |
blas_cuda.h still uses "Complex" from global namespace. If you could move this declaration to your new namespace we should be fine. |
I've moved this to the namespace now (commit 3de6e8f). Hopefully this closes this issue. |
No more name clashes! Now (with endBlas/initBlas), segfault in invertQuda. It seems its not safe to call endBlas and then initBlas again and hoping everything stays fine. I investigated this further: (At Chroma/init) calling just initQuda and nothing else works fine. But, calling initQuda; endBlas(); initBlas(); crashes then in invertQuda. Any other sideeffects? Here the backtrace (no debug symbols): #0 0x00007fffe92e2130 in ?? () from /usr/lib/libcuda.so |
That fixed it: No more memory leaks now! |
When using the QUDA clover inverter within Chroma, after the inversion some device memory areas remain allocated. This might be okay if QUDA was the only program part that accesses the GPU. However, there is work ongoing to extend QDP++ to use the GPU(s) as well. Thus when using the QDP++ extension along with QUDA in the same Chroma run, after exiting the QUDA inverter device memory remains allocated and can not be used in the remainder of Chroma, e.g. sink smear, hadspec, etc.
A thin CUDA layer inserted to QUDA provided for a dump of the allocation history made during QUDA Clover inverter:
0: 0x200300000 524288 1 blas_quda.cu:108
1: 0x200380000 1048576 1 blas_quda.cu:114
2: 0x200480000 1572864 1 blas_quda.cu:120
This refers to where cudaMalloc was called without calling cudaFree later.
(Master branch of QUDA pulled today, Sep 30 10am CET. Single GPU version.)
The text was updated successfully, but these errors were encountered: