Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporating cuda unified memory into caffe #2775

Closed
jmerkow opened this issue Jul 16, 2015 · 5 comments
Closed

Incorporating cuda unified memory into caffe #2775

jmerkow opened this issue Jul 16, 2015 · 5 comments

Comments

@jmerkow
Copy link

jmerkow commented Jul 16, 2015

Hello,

I have been working a number of development efforts to extend caffe to 3d and nd. With n-dimensional images, memory management on GPUs becomes a central issue; even on K40's, one quickly runs out of memory for moderate sized images. Though, I only have one PR publicly available here, a lot of my efforts have been working towards managing memory on the GPU.
Currently, I am using a shared blobs, to store temporary items on the gpu (similar to these PRs #2016, #2009) and I can train on very large images (Cx512x512x512), but this requires explicit copying to/from the gpu and cpu, which is slow. And doesn't scale to lower memory GPUs such as those in workstations. Ideally, this management would happen dynamically to minimize computation time.

Recently, Cuda's unified memory (UM) system was brought to my attention, which has been part of the Cuda SDK since version 6, link. It seems like a large improvement towards simplifying device abstraction in caffe to accompany development in recent PRs: device abstraction, de-duplication of forward/backward and openCL.

My questions are: is there a particular reason UM was not incorporated when caffe moved to UVA in this PR? It may be simply that UM was only about 6 months old at the time, and development on UVA probably began before UM was common knowledge (or possibly buggy), or it could be a larger architecture issue.

If there is no particular reason why UM was avoided, I am interested in contributing to caffe in this regard, (i.e. utilize UM in blobs), but this would involve a major change to blobs and synced memory. For example, the gpu/cpu_data functions would need major changes or may disappear completely. Due to the scope of the changes, I wanted to get a second opinion on this and would love tips/discussion from the main dev team to help guide development.

--Jameson

@jmerkow jmerkow changed the title incorporating cuda unified memory into caffe Incorporating cuda unified memory into caffe Jul 16, 2015
@jmerkow
Copy link
Author

jmerkow commented Jul 17, 2015

So did a few gemm operations using unified memory and it greatly decreases performance. I multiplied two 8196x8196 matrices and the operation took about 300ms using UVA, and ~45s using unified memory. Seems like its a no go. It still may help caffe consumers on workstations to process images that would be too large for workstation-grade GPUs, but its far too slow for research purposes.

That still leaves the question of memory management on GPUs. I'll keep working on a solution...

@shelhamer
Copy link
Member

@jmerkow while we were aware of unified memory (UM), we decided to adopt only unified addressing (UVA) since Caffe is already careful about its allocation and communication and -- as you have confirmed -- is much faster than UM at least out of the box.

For memory management, there are of course more sophisticated mechanisms than what we have now for lazy allocation and on-demand communication plus the special case of data layer prefetching, but many use cases should be covered by a memory pool such as #2738.

It still may help caffe consumers on workstations to process images that would be too large for workstation-grade GPUs

Note that as a workaround one can divide an input image / volume and partially compute a net on each fragment, then collect these fragments on the host to make an intermediate feature map to then compute the rest of the net on back on the device. This is obviously hands-on, and not dynamic like you requested, but can be done if it needs to be through the current interface. This could be sped up by doing async host-device transfers while computing the net on a given fragment.

Closing, since this addresses the question of UM, but please open a further issue for any other memory proposals you might have.

@lukeyeager
Copy link
Contributor

That still leaves the question of memory management on GPUs. I'll keep working on a solution...

@jmerkow, you might want to take a look at this pull request: NVIDIA#11.
It utilizes this library: https://github.com/NVIDIA/cnmem.

@jmerkow
Copy link
Author

jmerkow commented Jul 17, 2015

Thank you for the responses.
@shelhamer , I have a follow-up question before I work on a new proposal. First, I'll explain my current approach and its issues.
I am already doing something similar to what you suggested.
Breaking up the operation is trivial for layers without spatial dims, for vision layers its easiest to break up by output channel/input channel since its easiest to copy for row-major indexing.
This works great for all the vision layers that are within channel operations such as pooling.
And its easy to split up convolution-forward by output channel and produce one 'top channel' at a time.
However, this doesn't get you very far with convolution since the bottom column buffer and bottom occupy the majority of the GPU memory.
To break up by input channel, you need to accumulate the top over input channels. This reduces memory stored for both the original bottom data, and decreases the size of the im2col result.
(For backwards, this is pretty much the same operation except you produce the bottom_diff across the output channels)

However, this is abysmally slow since it requires numerous host->device, and device->host memory copies. This is by far the slowest part of the process.
Im thinking a better approach is to produce the entire im2col matrix, remove original the bottom from the GPU memory, then perform the gemm operation with as many rows of the im2col result as possible then increment/shift until done, this would reduce the memcopy operations significantly.

This would require deallocating the bottom from the GPU and to my knowledge there is currently no way to do this. In addition, it looks like most (if not all) of the memory loaded on the GPU is persistent until the entire network is deallocated. Is this due to the lazy-allocation? There isn't a particular need to keep this data on GPU? Do you see any problem with freeing up GPU after layer computation?
A memory pool doesn't address how and when to allocate GPU memory for operations, this would still need to be handled although a pool would certainly be useful for automatically maximizing memory usage.

@evolu8
Copy link

evolu8 commented Oct 19, 2017

This is now a reality in CUDA 8. Are we likely to see this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants