Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to Unified Virtual Address memory copies #555

Merged
merged 8 commits into from
Jul 4, 2014

Conversation

shelhamer
Copy link
Member

CUDA Unified Virtual Addressing makes host-device, device-host, and device-device communication transparent by distinguishing the cases through virtual addresses of the pointers. Switching to this mode is intended as a useful abstraction for parallelism so that blob data can be transferred by the same interface regardless of source and destination.

Here all cudaMemcpy calls are switched to cudaMemcpyDefault mode used by virtual addressing.

A counter-argument is that this makes host / device communication less explicit and perhaps confusing. However, the pointers and our practice of {cpu,gpu} prefixes keep this clear. Provided we continue in the direction of device abstraction, needing to explicitly reference cpu or gpu operation should go away on its own with the exception of data layers and future host / device parallelism.

To standardize the interface all memcpy are replaced by caffe_copy and all memset by caffe_set and caffe_gpu_set except for SyncedMem where it's awkward. Note this melds caffe_gpu_copy into caffe_copy now that addressing is virtual.

@jeffdonahue
Copy link
Contributor

Cool, looks good

Follow-up question: should all memcpy be replaced with caffe_copy (in its cpu / gpu variations)?

Yep, I think we should only ever call memcpy/cudaMemcpy inside the caffe_copy wrappers to abstract away the couple of details (count * sizeof(Dtype)) and have function args in the "standard" order (count, input_ptr, output_ptr) -- perhaps we could make the caffe_copys (and maybe all the math functions) inline since that jump is the only possible performance hit and it's literally a one line wrapper, but I really doubt it matters. Similarly we might still have some memsets that should probably all be replaced with caffe_set.

@shelhamer
Copy link
Member Author

^ Agreed. I'll take care of these in this PR and we can benchmark against dev. I doubt it will matter at all.

@shelhamer
Copy link
Member Author

How about moving caffe_{set,copy} to common instead of math_functions? Seems more natural than including math_functions.hpp all over the place. Just kidding, that might bring back that awful CUDA / boost issue we had during the boost integration since we'd need the gpu set_kernel.

@sguada
Copy link
Contributor

sguada commented Jun 28, 2014

@shelhamer vote for moving all memcopy to caffe_copy and all memset to caffe_set

@shelhamer
Copy link
Member Author

All the memcpy for cpu and gpu have been replaced by caffe_copy. Note there is no more caffe_gpu_copy! The virtual addressing takes care of it.

Performance as measured by net_speed_benchmark.bin is the same (or ~50 ms faster, which is to say the same).

@shelhamer
Copy link
Member Author

All the memset are now caffe_set / caffe_gpu_set, except SyncedMem because it has no type.

@jeffdonahue
Copy link
Contributor

Will it always be ok to use cudaMemcpy for a host to host copy (i.e. a regular memcpy)? Isn't it possible we'd want to support building without CUDA at some point (e.g. to support OpenCL)?

@shelhamer
Copy link
Member Author

There is a need for a CPU-only build and for abstracting whatever sort of device Caffe is executing on (for OpenCL support as you mentioned).

I did a cudaMemcpy here out of laziness -- we presently assume CUDA so it seemed fine. I don't think this adds much to the burden of abstracting from CUDA when it's time. Instead of singling out host-host copy I'd rather whatever CPU / GPU split solution comes out of the device-abstraction branch handle it. Or we could have ifdef guards for CPU only or OpenCL just like in mkl_alternate.hpp.

What do you think?

@@ -59,6 +59,8 @@ void caffe_gpu_axpby(const int N, const Dtype alpha, const Dtype* X,
template <typename Dtype>
void caffe_copy(const int N, const Dtype *X, Dtype *Y);

void caffe_copy(const size_t N, const void *X, void *Y);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about this overloading -- it seems a little hard to figure out whether this generic version or the Dtype version is being called since each argument of the function signature is a static_cast away from the Dtype version, and then if you accidentally call this one, the size of the copy will probably be wrong. Google style guide recommends only overloading functions when it's very clear which version will be called [1]. I'd prefer if this had some other name (caffe_void_copy? not sure).

[1] http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Function_Overloading

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point I had the void version called "caffe_transfer" since it is
only used by SyncedMem to go from CPU to GPU and back. Perhaps that's still
not a good name, since it is a copy and the source stays where it is.

I'll try to think of a name, or if you come up with one you like then
commit it.

Le samedi 28 juin 2014, Jeff Donahue [email protected] a écrit :

In include/caffe/util/math_functions.hpp:

@@ -59,6 +59,8 @@ void caffe_gpu_axpby(const int N, const Dtype alpha, const Dtype* X,
template
void caffe_copy(const int N, const Dtype *X, Dtype *Y);

+void caffe_copy(const size_t N, const void *X, void *Y);

I'm a little worried about this overloading -- it seems a little hard to
figure out whether this generic version or the Dtype version is being
called since each argument of the function signature is a static_cast away
from the Dtype version, and then if you accidentally call this one, the
size of the copy will probably be wrong. Google style guide recommends only
overloading functions when it's very clear which version will be called
[1]. I'd prefer if this had some other name (caffe_void_copy? not sure).

[1]
http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Function_Overloading


Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/pull/555/files#r14324838.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided on caffe_memcpy to distinguish it from caffe_copy and because it has exactly the same purpose as normal memcpy.

@jeffdonahue
Copy link
Contributor

K, I agree the cudaMemcpy seems reasonable

@shelhamer
Copy link
Member Author

Alright, this is done. @jeffdonahue @sguada please take a look for merge.

@jeffdonahue
Copy link
Contributor

nice, I like the name caffe_memcpy. LGTM

@kloudkl
Copy link
Contributor

kloudkl commented Jun 29, 2014

In the newly created branch device-abstraction, there are two methods in Device that can be simplified similarly.

template<typename Dtype>
void Device<Dtype>::copy(const int N, const Dtype *X, Dtype *Y);

template<typename Dtype>
void Device<Dtype>::copy_from_cpu(const int N, const Dtype *X, Dtype *Y);

The implementations of copy in the CPU and the GPU device subclasses use caffe_copy or caffe_gpu_copy and those of copy_from_cpu use caffe_copy or CUDA_CHECK(cudaMemcpy(...)).

shelhamer added 8 commits July 3, 2014 17:14
Host / device copies are distinguished by the virtual address of the
pointers instead of explicit memcpy modes.
Do all memory copies by `cudaMemcpy` in UVA mode so that the same
`caffe_copy()` interface works for all transfers.

`cudaMemcpy()` is used in lieu of BLAS copies because they do not
understand UVA.

Drop the now unnecessary `caffe_gpu_copy()` since location of the
pointers is now irrelevant to the interface.
...except for `SyncedMem` since it has no type.
shelhamer added a commit that referenced this pull request Jul 4, 2014
Switch to Unified Virtual Address memory copies
@shelhamer shelhamer merged commit aef3de7 into BVLC:dev Jul 4, 2014
@shelhamer shelhamer deleted the uva-memory branch July 4, 2014 00:27
@shelhamer shelhamer mentioned this pull request Jul 4, 2014
@kloudkl kloudkl mentioned this pull request Jul 7, 2014
mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014
Switch to Unified Virtual Address memory copies
RazvanRanca pushed a commit to RazvanRanca/caffe that referenced this pull request Nov 4, 2014
Switch to Unified Virtual Address memory copies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants