-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA memory pool with async memory allocation/deallocation #3903
Conversation
This is a bit slower than the current pool, for me with 7B it reduces TG performance by ~2%, but it may still be worth doing for the memory savings. The best solution may be using virtual memory to expand the allocation size as needed as explained here. Then, since we can guarantee that deallocations happen in the reverse order of allocations, allocating memory would only require increasing a head pointer, and freeeing decreasing it. |
I am curios what is performance if there is more than one GPU. Did you test with multiple GPUs? |
I do not have a good way to test this with multiple GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it may still be worth doing for the memory savings.
Why does it use less memory? Is it because of the "lookahead" over-allocation in the custom implementation?
Custom mempool:main: n_kv_max = 2048, is_pp_shared = 0, n_gpu_layers = 999, mmq = 0
llama_print_timings: load time = 3058.06 ms CUDA Mempoolmain: n_kv_max = 2048, is_pp_shared = 0, n_gpu_layers = 999, mmq = 0
llama_print_timings: load time = 3078.71 ms |
The current implementation just makes an allocation with the same size as requested, and then tries to reuse these allocations when possible. So for example with cuBLAS it will allocate enough memory to convert the first weight that is used to F16, but when we need to convert a larger weight to F16 then this allocation will be too small and it will require a new allocation, nearly doubling the memory usage. The worst case is probably when evaluating with increasingly larger batch sizes, since the previous allocations for src1/dst will be too small to reuse, it will allocate new buffers with each evaluation. It's just really bad and has far more use than ever intended. |
We can always add toogle to switch between both implementations. |
This PR broke llama.cpp on my GTX 970 (Maxwell, compute 5.2):
|
I will try to add additional checks for CUDA pool. There is check but for some reason your gpu can get pool but can can't allocate so I assume it is not supported by your gpu so should fallback to old implementation when needed. I will check it. |
@cebtenzzre I added additional check in device properties: #3931 |
… when available (ggerganov#3903) * Using cuda memory pools for async alloc/dealloc. * If cuda device doesnt support memory pool than use old implementation. * Removed redundant cublasSetStream --------- Co-authored-by: Oleksii Maryshchenko <[email protected]>
* Revert "cuda : add ROCM aliases for CUDA pool stuff (ggerganov#3918)" This reverts commit 629f917. * Revert "cuda : use CUDA memory pool with async memory allocation/deallocation when available (ggerganov#3903)" This reverts commit d606905. ggml-ci
Custom implementation of memory pool was changed to CUDA device memory pool.
If device doesn`t support memory pool for some reasons it fallback to old implementation.
P.S. I changed in CUDA_CHECK and CUBLAS_CHECK
id
todev_id
because of warning "Declaration shadows a local variable"