-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporating cuda unified memory into caffe #2775
Comments
So did a few gemm operations using unified memory and it greatly decreases performance. I multiplied two 8196x8196 matrices and the operation took about 300ms using UVA, and ~45s using unified memory. Seems like its a no go. It still may help caffe consumers on workstations to process images that would be too large for workstation-grade GPUs, but its far too slow for research purposes. That still leaves the question of memory management on GPUs. I'll keep working on a solution... |
@jmerkow while we were aware of unified memory (UM), we decided to adopt only unified addressing (UVA) since Caffe is already careful about its allocation and communication and -- as you have confirmed -- is much faster than UM at least out of the box. For memory management, there are of course more sophisticated mechanisms than what we have now for lazy allocation and on-demand communication plus the special case of data layer prefetching, but many use cases should be covered by a memory pool such as #2738.
Note that as a workaround one can divide an input image / volume and partially compute a net on each fragment, then collect these fragments on the host to make an intermediate feature map to then compute the rest of the net on back on the device. This is obviously hands-on, and not dynamic like you requested, but can be done if it needs to be through the current interface. This could be sped up by doing async host-device transfers while computing the net on a given fragment. Closing, since this addresses the question of UM, but please open a further issue for any other memory proposals you might have. |
@jmerkow, you might want to take a look at this pull request: NVIDIA#11. |
Thank you for the responses. However, this is abysmally slow since it requires numerous host->device, and device->host memory copies. This is by far the slowest part of the process. This would require deallocating the bottom from the GPU and to my knowledge there is currently no way to do this. In addition, it looks like most (if not all) of the memory loaded on the GPU is persistent until the entire network is deallocated. Is this due to the lazy-allocation? There isn't a particular need to keep this data on GPU? Do you see any problem with freeing up GPU after layer computation? |
This is now a reality in CUDA 8. Are we likely to see this? |
Hello,
I have been working a number of development efforts to extend caffe to 3d and nd. With n-dimensional images, memory management on GPUs becomes a central issue; even on K40's, one quickly runs out of memory for moderate sized images. Though, I only have one PR publicly available here, a lot of my efforts have been working towards managing memory on the GPU.
Currently, I am using a shared blobs, to store temporary items on the gpu (similar to these PRs #2016, #2009) and I can train on very large images (Cx512x512x512), but this requires explicit copying to/from the gpu and cpu, which is slow. And doesn't scale to lower memory GPUs such as those in workstations. Ideally, this management would happen dynamically to minimize computation time.
Recently, Cuda's unified memory (UM) system was brought to my attention, which has been part of the Cuda SDK since version 6, link. It seems like a large improvement towards simplifying device abstraction in caffe to accompany development in recent PRs: device abstraction, de-duplication of forward/backward and openCL.
My questions are: is there a particular reason UM was not incorporated when caffe moved to UVA in this PR? It may be simply that UM was only about 6 months old at the time, and development on UVA probably began before UM was common knowledge (or possibly buggy), or it could be a larger architecture issue.
If there is no particular reason why UM was avoided, I am interested in contributing to caffe in this regard, (i.e. utilize UM in blobs), but this would involve a major change to blobs and synced memory. For example, the gpu/cpu_data functions would need major changes or may disappear completely. Due to the scope of the changes, I wanted to get a second opinion on this and would love tips/discussion from the main dev team to help guide development.
--Jameson
The text was updated successfully, but these errors were encountered: