Switch to Unified Virtual Address memory copies #555

shelhamer · 2014-06-28T01:52:24Z

CUDA Unified Virtual Addressing makes host-device, device-host, and device-device communication transparent by distinguishing the cases through virtual addresses of the pointers. Switching to this mode is intended as a useful abstraction for parallelism so that blob data can be transferred by the same interface regardless of source and destination.

Here all cudaMemcpy calls are switched to cudaMemcpyDefault mode used by virtual addressing.

A counter-argument is that this makes host / device communication less explicit and perhaps confusing. However, the pointers and our practice of {cpu,gpu} prefixes keep this clear. Provided we continue in the direction of device abstraction, needing to explicitly reference cpu or gpu operation should go away on its own with the exception of data layers and future host / device parallelism.

To standardize the interface all memcpy are replaced by caffe_copy and all memset by caffe_set and caffe_gpu_set except for SyncedMem where it's awkward. Note this melds caffe_gpu_copy into caffe_copy now that addressing is virtual.

jeffdonahue · 2014-06-28T02:05:51Z

Cool, looks good

Follow-up question: should all memcpy be replaced with caffe_copy (in its cpu / gpu variations)?

Yep, I think we should only ever call memcpy/cudaMemcpy inside the caffe_copy wrappers to abstract away the couple of details (count * sizeof(Dtype)) and have function args in the "standard" order (count, input_ptr, output_ptr) -- perhaps we could make the caffe_copys (and maybe all the math functions) inline since that jump is the only possible performance hit and it's literally a one line wrapper, but I really doubt it matters. Similarly we might still have some memsets that should probably all be replaced with caffe_set.

shelhamer · 2014-06-28T02:07:19Z

^ Agreed. I'll take care of these in this PR and we can benchmark against dev. I doubt it will matter at all.

shelhamer · 2014-06-28T03:09:08Z

~~How about moving caffe_{set,copy} to common instead of math_functions? Seems more natural than including math_functions.hpp all over the place.~~ Just kidding, that might bring back that awful CUDA / boost issue we had during the boost integration since we'd need the gpu set_kernel.

sguada · 2014-06-28T04:36:52Z

@shelhamer vote for moving all memcopy to caffe_copy and all memset to caffe_set

shelhamer · 2014-06-28T08:17:07Z

All the memcpy for cpu and gpu have been replaced by caffe_copy. Note there is no more caffe_gpu_copy! The virtual addressing takes care of it.

Performance as measured by net_speed_benchmark.bin is the same (or ~50 ms faster, which is to say the same).

shelhamer · 2014-06-28T08:54:02Z

All the memset are now caffe_set / caffe_gpu_set, except SyncedMem because it has no type.

jeffdonahue · 2014-06-28T08:55:14Z

Will it always be ok to use cudaMemcpy for a host to host copy (i.e. a regular memcpy)? Isn't it possible we'd want to support building without CUDA at some point (e.g. to support OpenCL)?

shelhamer · 2014-06-28T09:06:20Z

There is a need for a CPU-only build and for abstracting whatever sort of device Caffe is executing on (for OpenCL support as you mentioned).

I did a cudaMemcpy here out of laziness -- we presently assume CUDA so it seemed fine. I don't think this adds much to the burden of abstracting from CUDA when it's time. Instead of singling out host-host copy I'd rather whatever CPU / GPU split solution comes out of the device-abstraction branch handle it. Or we could have ifdef guards for CPU only or OpenCL just like in mkl_alternate.hpp.

What do you think?

jeffdonahue · 2014-06-28T15:37:38Z

include/caffe/util/math_functions.hpp

@@ -59,6 +59,8 @@ void caffe_gpu_axpby(const int N, const Dtype alpha, const Dtype* X,
 template <typename Dtype>
 void caffe_copy(const int N, const Dtype *X, Dtype *Y);

+void caffe_copy(const size_t N, const void *X, void *Y);


I'm a little worried about this overloading -- it seems a little hard to figure out whether this generic version or the Dtype version is being called since each argument of the function signature is a static_cast away from the Dtype version, and then if you accidentally call this one, the size of the copy will probably be wrong. Google style guide recommends only overloading functions when it's very clear which version will be called [1]. I'd prefer if this had some other name (caffe_void_copy? not sure).

[1] http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Function_Overloading

At one point I had the void version called "caffe_transfer" since it is
only used by SyncedMem to go from CPU to GPU and back. Perhaps that's still
not a good name, since it is a copy and the source stays where it is.

I'll try to think of a name, or if you come up with one you like then
commit it.

Le samedi 28 juin 2014, Jeff Donahue [email protected] a écrit :

In include/caffe/util/math_functions.hpp:

@@ -59,6 +59,8 @@ void caffe_gpu_axpby(const int N, const Dtype alpha, const Dtype* X,
template
void caffe_copy(const int N, const Dtype *X, Dtype *Y);

+void caffe_copy(const size_t N, const void *X, void *Y);

I'm a little worried about this overloading -- it seems a little hard to
figure out whether this generic version or the Dtype version is being
called since each argument of the function signature is a static_cast away
from the Dtype version, and then if you accidentally call this one, the
size of the copy will probably be wrong. Google style guide recommends only
overloading functions when it's very clear which version will be called
[1]. I'd prefer if this had some other name (caffe_void_copy? not sure).

[1]
http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Function_Overloading

—
Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/pull/555/files#r14324838.

Decided on caffe_memcpy to distinguish it from caffe_copy and because it has exactly the same purpose as normal memcpy.

jeffdonahue · 2014-06-28T18:56:14Z

K, I agree the cudaMemcpy seems reasonable

shelhamer · 2014-06-28T21:07:56Z

Alright, this is done. @jeffdonahue @sguada please take a look for merge.

jeffdonahue · 2014-06-28T21:25:02Z

nice, I like the name caffe_memcpy. LGTM

kloudkl · 2014-06-29T01:31:52Z

In the newly created branch device-abstraction, there are two methods in Device that can be simplified similarly.

template<typename Dtype>
void Device<Dtype>::copy(const int N, const Dtype *X, Dtype *Y);

template<typename Dtype>
void Device<Dtype>::copy_from_cpu(const int N, const Dtype *X, Dtype *Y);

The implementations of copy in the CPU and the GPU device subclasses use caffe_copy or caffe_gpu_copy and those of copy_from_cpu use caffe_copy or CUDA_CHECK(cudaMemcpy(...)).

Host / device copies are distinguished by the virtual address of the pointers instead of explicit memcpy modes.

Do all memory copies by `cudaMemcpy` in UVA mode so that the same `caffe_copy()` interface works for all transfers. `cudaMemcpy()` is used in lieu of BLAS copies because they do not understand UVA. Drop the now unnecessary `caffe_gpu_copy()` since location of the pointers is now irrelevant to the interface.

...except for `SyncedMem` since it has no type.

Switch to Unified Virtual Address memory copies

shelhamer mentioned this pull request Jun 28, 2014

Unify the CPU, CUDA and OpenCL math functions API in the device wrapper classes #415

Closed

jeffdonahue reviewed Jun 28, 2014
View reviewed changes

shelhamer assigned jeffdonahue Jun 28, 2014

shelhamer added 8 commits July 3, 2014 17:14

report UVA in platform test

1fcd29f

switch to unified virtual addressing CUDA memcpy

9f74b6b

Host / device copies are distinguished by the virtual address of the pointers instead of explicit memcpy modes.

replace softmax cudaMemcpy with caffe_gpu_copy

803a7a9

replace all memcpy by caffe_copy

1da30a5

replace all memset with caffe_set() / caffe_gpu_set()

af624fa

...except for `SyncedMem` since it has no type.

reduce caffe_copy to instantiations, split off caffe_memcpy for void*

6461365

fix casts (static for void*)

19e909b

shelhamer added a commit that referenced this pull request Jul 4, 2014

Merge pull request #555 from shelhamer/uva-memory

aef3de7

Switch to Unified Virtual Address memory copies

shelhamer merged commit aef3de7 into BVLC:dev Jul 4, 2014

shelhamer deleted the uva-memory branch July 4, 2014 00:27

shelhamer mentioned this pull request Jul 4, 2014

Device Abstraction #610

Closed

kloudkl mentioned this pull request Jul 7, 2014

Support CPU only memcpy #633

Merged

robwhess mentioned this pull request Jul 7, 2014

Implement device abstraction for remaining classes #587

Merged

longjon mentioned this pull request Jul 8, 2014

error == cudaSuccess (38 vs. 0) no CUDA-capable device is detected #604

Closed

mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014

Merge pull request BVLC#555 from shelhamer/uva-memory

736d441

Switch to Unified Virtual Address memory copies

RazvanRanca pushed a commit to RazvanRanca/caffe that referenced this pull request Nov 4, 2014

Merge pull request BVLC#555 from shelhamer/uva-memory

29b1fe5

Switch to Unified Virtual Address memory copies

This was referenced Dec 26, 2014

Community issues and policy feedbacks #1623

Closed

Configurate Convolutional Auto-Encoders in Caffe #1647

Closed

jmerkow mentioned this pull request Jul 16, 2015

Incorporating cuda unified memory into caffe #2775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to Unified Virtual Address memory copies #555

Switch to Unified Virtual Address memory copies #555

shelhamer commented Jun 28, 2014

jeffdonahue commented Jun 28, 2014

shelhamer commented Jun 28, 2014

shelhamer commented Jun 28, 2014

sguada commented Jun 28, 2014

shelhamer commented Jun 28, 2014

shelhamer commented Jun 28, 2014

jeffdonahue commented Jun 28, 2014

shelhamer commented Jun 28, 2014

jeffdonahue Jun 28, 2014

shelhamer Jun 28, 2014

shelhamer Jun 28, 2014

jeffdonahue commented Jun 28, 2014

shelhamer commented Jun 28, 2014

jeffdonahue commented Jun 28, 2014

kloudkl commented Jun 29, 2014

Switch to Unified Virtual Address memory copies #555

Switch to Unified Virtual Address memory copies #555

Conversation

shelhamer commented Jun 28, 2014

jeffdonahue commented Jun 28, 2014

shelhamer commented Jun 28, 2014

shelhamer commented Jun 28, 2014

sguada commented Jun 28, 2014

shelhamer commented Jun 28, 2014

shelhamer commented Jun 28, 2014

jeffdonahue commented Jun 28, 2014

shelhamer commented Jun 28, 2014

jeffdonahue Jun 28, 2014

Choose a reason for hiding this comment

shelhamer Jun 28, 2014

Choose a reason for hiding this comment

shelhamer Jun 28, 2014

Choose a reason for hiding this comment

jeffdonahue commented Jun 28, 2014

shelhamer commented Jun 28, 2014

jeffdonahue commented Jun 28, 2014

kloudkl commented Jun 29, 2014