-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
@Fabian : how to plug your convolutional layer into other frameworks? #4155
Comments
@hughperkins I do not provide benchmarks yet as the autotuning part is not completely done. Without Autotuning, it performs a little better than OpenCL im2col on AMD hardware, and about the same as CUDA im2col on nVidia hardware (with both OpenCL and CUDA backend). As for using it in other frameworks:
struct LibDNNConfig {
LibDNNConfig() :
in_shape(3, 1),
out_shape(3, 1),
kernel(1, 1),
pad(1, 0),
stride(1, 1),
dilation(1, 0)
{}
device* dev_ptr = nullptr;
std::vector<int_tp> in_shape;
std::vector<int_tp> out_shape;
std::vector<int_tp> kernel;
std::vector<int_tp> pad;
std::vector<int_tp> stride;
std::vector<int_tp> dilation;
int_tp group = 1;
bool bias_term = false;
bool fast_unsafe_math = false;
bool weights_backward = true;
bool bias_backward = true;
libdnnConvolutionWeightAlgo_t wgalgo = LIBDNN_CONVOLUTION_WG_ALGO_DIRECT;
}; The framework is expected to fill out this struct and pass it to libDNN for initialization.
void Forward(const Dtype* bottom_data, const Dtype* weight,
const Dtype* bias,
Dtype* top_data, int_tp batch_size);
void Backward(bool prop_down_data, bool prop_down_weights,
const Dtype* top_data, const Dtype* top_diff,
const Dtype* weight, Dtype* weight_diff,
const Dtype* bias, Dtype* bias_diff,
const Dtype* bottom_data, Dtype* bottom_diff,
int_tp batch_size);
void Tune(Dtype* top_data, Dtype* top_diff,
Dtype* weight, Dtype* weight_diff,
Dtype* bias, Dtype* bias_diff,
Dtype* bottom_data, Dtype* bottom_diff,
int_tp batch_size); Here Dtype* is expected to be cl_mem (aka cl_mem_*) on OpenCL platforms and device memory pointers on CUDA hardware. |
Ok. Some questions:
What does it rely on ViennaCL for? If it is removed, what are the implications?
Presumably your framework works only on cl_mem's right? So you could factorize this interface I guess, to strip the device-independence wrapper, in the outer layer, then pass just the cl_mem and offset to the lower-level layer? On a related topic, how to compile it? Does it need any fancy c++11 compilers and stuff? or can it be built on eg msvc10, gcc-4 ,and so on?
Ok. Couple of questions:
Whilst I hate to give up my own conv optimization efforts, I seem to be not really doing much work on optimization lately, not having an AMD card :-P , and anyway, seems no point in having a zillion redundant amd/opencl conv implementations, so I'm reasonably up for just switching to yours, and then I can contribute to optimizing yours plausibly. Oh, one othre question: cos my current frameworks all actually run on nvidia too, despite its opencl1.1-ness. Will I be able to use your conv layer on nvidia? Or should I have like |
(Note: I'm not saying you should support anything other than AMD, I'm just posing the question. I reckon it's probably better to have one really super insanely fast conv library for AMD, than a kind of jack-of-all-trades slow-everywhere conv library :-P If you only support AMD, I'm fine with |
(Thinking about how to make it easy for me to use your library on amd platforms, and build my library anyway on non-amd platforms, I'm kind of ... well, I'm jumping ahead, because I dont know your answers yet :-D , but ... :
|
@hughperkins
The thing with dilated convolutions in all these applications is that they are not very efficient (in im2col GEMM) at the moment, using large amounts of convolution buffer memory. My kernels also attempt to solve that, as there is no cuDNN alternative to it yet. |
Ok. For linux this is probably ok. Actually I have two libraries:
cltorch is only really targeted at linux and Mac, for 99.9% of users, I think. DeepCL is heavily targeted at Windows users, using python 2.7 and python 3.4. Therefore DeepCL needs to be buildable using msvc2008 and msvc2010. Or actually, I build it with msvc2010, and the python bit in msvc2008 or msvc2010, depending on python version, and link those together. The windows bit, ie deepcl, sounds complicated, so we could put that to one side for now. However, you could perhaps think about providing Windows 32-bit and 64-bit binaries, in an automated sustainable way? Then, it would be easy for me to migrate deepcl to use greentea convolutions too. And on the subject of 'greentea convolutions', I only really want the convolutional layer. I dont really want to bring the entirety of opencl caffe into my projects :-) To what extent could it be possible to factorize the convolutional implementation into eg a separate repo, so it is fairly lightweight to clone it and build it?
Ok sounds good. Sounds quite nvidia heavy actually... I think it'd be good to have an R9 Fury and an R9 390x in those benchmarks. I'm trying to persuade my AMD contact(s) to provide some for OpenCL/hcc library development, but no success so far :-)
Hmmm. Are you saying your library handles both cuda and OpenCL? That sounds ... heavy.... I guess that if I feed it a Wait... when you say you test on nvidia, you mean, you test the opencl version on nvidia right? I'm not interested in anything that can't handle receiving the data as a
Ok.
sounds good
sounds good
Ok. Actually, groups are generally available. At lesat, torch supports them. It's possible dilation is supported by torch too, and I just didnt notice yet :-) |
PS Why arent you testing/optimizing on R9 Fury and R9-390x? If I can somehow persuade my AMD contact(s) to provide a box with an R9 Fury or four in, could that be useful to you? Not saying I can do that, but the more people who could use such a box to do useful work, the stronger the case I can make :-D |
(By the way, I think it'd be good for you to factorize the opencl convolutional layer into a separate repo, out of caffe. so issues and discussions on it are all together. Otherwise we'll have to cram everything into enormous single threads, as happened earlier for your and Robert's pull requests, which became both a little crazy :-P ) Edit: oh, I already said this :-D |
@hughperkins The CUDA part can of course be disabled at compile time. Same for the OpenCL part. nVidia chips can both be tuned on OpenCL and CUDA kernels, however my tests so far show they are almost equal (but NVRTC/NVCC takes longer to compile than ViennaCL/OpenCL). Cross compiling is not very "heavy" as all it needs is some "#define" declarations to preprocess-replace OpenCL functions with CUDA ones and then compile and execute with NVRTC instead of ViennaCL (you can look this up in the existing source code, it's already there). There will be stand-alone binaries and sources from Caffe, stripped of the unnecessary Caffe elements. But Caffe is a very neat framework to tune the kernels in initially. The plan for development is the following:
Feel free to comment on if that seems reasonable :) |
@naibaf7. Let us setup an email group |
Yes a google group could be interesting on this thread also for OpenCV. /cc @mtamburrano @vpisarev |
@naibaf7 @hughperkins I also add @edgarriba because probably we want to experiment libdnn engine in tinycnn during the gsoc |
Hi Fabian, Ok, sounds good. Please let me know when you reach each of the following two milestones:
The first one (plausibly without binaries actually) is a pre-requisite for switching cltorch to use your convolutional implementation. The second one is a pre-requisite for switching DeepCL to use your convolutional implementation.
Yes, I need something hosted though. A card without anywehre to put it is just extra weight to carry in my backpack :-D
Yes, that's kind of what I would like :-)
Ok
Sure. And I think Caffe is a good framework to tune the kernels in, both now and in the future. Having said that, for example, in Torch project, the code is factorized into multiple repos like:
Having the code factorized like this makes mixing and matching very easy. It's one reason why porting torch to get alexnet working on OpenCL was only like somewhere around 6-12 weeks work :-) |
I have completed the 1st optimized version of backPropWeights. Which is 500x faster than deepCL's original one. I am not CNN expert while I have 10 years' experience on GPU performance analysis and tuning. |
Hi @fsword73 That sounds humblingly impressive :-) I'll try try to try this out this weekend, and report my results. By the way, is this optimization general to eg NVIDIA, or specific to AMD? I actually strongly prefer some optimizations that give really excellent performance on AMD, but not that useful elsewhere, but .... I only have NVIDIA cards to run on :-P So if it is general, I can try it easily, otherwise I will have to ... I'm not sure.... I've been poking AMD for like a year and a half to get like one single cloud AMD gpu, but, not getting very far so far :-P |
Hi Fabian, I'm writing a paper about cltorch, which I plan to throw onto arxiv. I have a section about how cltorch compares with other OpenCL implementations, which I'm citing your OpenCL caffe implementation in. Question, what algorithm(s) are you using? I believe originally you were using im2col + viennacl blas? And currently? |
The current version supports:
However I do not have up-to-date benchmarks of all variants at the moment... that'd be a huge effort to go through. |
Whoa, cool :-O Can you clarify which of these will be used when the data is stored in OpenCL buffers, ie referenced via
When you say 'direct convolution', do you mean that it is using im2col + gemm, but that the im2col bit is not carried out in serial with the gemm, but somehow at the same time as the gemm, so that the memory usage is much smaller than im2col + gemm. In any case, I havent explained it very well (probably because I dont fully understand it fully, if I'm fair), but the algorithm detailed in "cuDNN: Efficient Primitives for Deep Learning", is that right?
What hardware do these target? It sounds like it is Intel-specific? is it only for Intel CPUs? Or also for Intel HD Integrated GPUs? How does selection between these algorithms take place? |
/cc @nyanp |
@hughperkins Intel prepared the algorithms to also work on other hardware, however they are mainly targeted at Intel GPUs, such as Iris Pro. im2col + cBLAS is targeted at CPUs, however it uses the network kernels (inclusive im2col) in OpenCL mode for more parallelism than standard Caffe. The memory (cl_mem structs) is mapped to host pointers in order to call the cBLAS functions. "direct convolution (implicit GEMM)" is, as you point out, similar to the cuDNN paper, however I don't know how similar since I don't know their exact code. My version uses no additional memory in the current versions (atomic and direct weight update). I might do another version with reduction weight update, that will use a little extra memory. |
Yes with all this backends an engine selection it is becoming quite strategic. |
Fabian, Cool. Thanks! :-) Question: for AMD hardware, and running say VGG A, which implementation(s) will tend to be used and/or produce the fastest timings? By the way, in the current convnet-benchmarks timings for greentea, am I right in thinking that was using im2col + viennacl blas? Hugh |
@hughperkins @hughperkins I could run the convnet-benchmarks on a, sadly thermally limited, W9100 if you want and post it here? I can't do Titan X benchmarks right now and the GTX 980 is sadly not enough memory for the official benchmark. |
@hughperkins typedef enum {
// Stack the batch update into one GEMM block
// (deterministic, 1 kernel call)
// Serializes the batch and may therefore under use
// the GPUs compute units.
LIBDNN_CONVOLUTION_WG_ALGO_DIRECT = 0,
// Use multiple GEMM blocks in parallel and update weights atomically
// (non deterministic, 1 kernel call, not supported on all devices)
// Parallelizes the batch and has therefore higher GPU usage.
LIBDNN_CONVOLUTION_WG_ALGO_ATOMIC = 1
} libdnnConvolutionWeightAlgo_t;
typedef enum {
// Transform data before GEMM (load, im2col, gemm, store)
// This method is suitable for convolutions with similar
// spatial input == output sizes, but can become inefficient
// if input >> output (with large strides and kernels).
LIBDNN_CONVOLUTION_BW_ALGO_IM2COL = 0,
// Transform data after GEMM (load, gemm, col2im, store)
// Sometimes faster than im2col method, but uses
// atomic operations and is not deterministic.
LIBDNN_CONVOLUTION_BW_ALGO_COL2IM_ATOMIC = 1
} libdnnConvolutionBackwardAlgo_t; I also asked @soumith to do the convnet benchmarks on it. The configuration is not autotuned for the Titan X yet but should give a general hint on if I'm on the right track or not. |
Fabian, Nice! Actually, I believe your convolutions are fast. The only thing I need really is to ensure that using your convolutional implementation wont make building my own frameworks really painful :-) Actually, if you reckon it's possible for me to add your convolutions now, without much pain, one option could be to fork https://github.com/hughperkins/clnn , and show how it would look like with your convolutions slotted in. Otherwise, I guess what I'd need is something like that the convolution implementation is:
I'm tempted to say it should accept tensors in the exact same format that torch provides, ie same layout, and with similar metadata (sizes, strides, offset). but, actually, the time to rearrange the tensor is probably small compared to convolution time? eg I think actually eg Scott Gray dim-shuffles the tensors around anyway for Winograd eg soumith/convnet-benchmarks#93 (comment) . So I dont think the actual tensor layout and metadata of your library are too critical perhaps? But it should accept In other news, for your implicit_gemm algorithm, as far as citing it in my cltorch paper, question: how does this work in terms of gemm? Do you have a custom gemm implementation? Or you use a blas underneath that? Or something else? |
@hughperkins Yeah I can provide a standalone soon, maybe also with more tensor format input options. The implicit GEMM is loosely based on GEMM samples from here: and is also autotuneable together with additional convolution-specific parameters. |
Ah, interesting. Thanks! :-)
Ok, sounds good :-) I think the standalone bit is more important than the tensor input options. Copying stuff should be really fast compared to convolution.
Sounds good :-) |
Nice plan! |
@hughperkins FYI, Caffe and Torch have the same data layout, no need to dimshuffle. |
True, but per the above, |
@hughperkins |
See also sg14 iso c++ on memory access |
Fabian, I'm pretty sure I remember saying I dont need any specific layout or offset :-) Whatever is fastest, use that, and I'll Things I do need though are things that make cross-platform build easy, and specifically something like:
|
@bhack Can you provide me with your email address please? eg send me an invite on Linkedin? |
@hughperkins Done. |
@hughperkins |
Thanks! I should check this out. Right now, I'm busy squirrelling away on neoncl, but I will take a look soon :-) |
@naibaf7 Dtype* is cl_mem or float * / double *? Cause I see https://github.com/naibaf7/libdnn/blob/master/src/libdnn.cpp#L2055-L2056 |
@bhack It's not the prettiest variant, but the improvement I am going to use here depends on what solution will be used for virtual pointers in Caffe, unfortunately. And that depends on my collaboration partners. Tell me when tiny_cnn is ready for testing with libDNN. In that case I will help you fix the remaining bugs interfacing with OpenCL, should any such issues still exist. |
Can be used float/double and cl_mem at the same time here https://github.com/naibaf7/libdnn/blob/master/src/libdnn.cpp#L1674? |
cl_mem only on an OpenCL device and float/double pointers on CUDA only. |
But is it not constrained to float or double by https://github.com/naibaf7/libdnn/blob/master/src/libdnn.cpp#L2055-L2056? |
cl_mem is in fact an alias for |
OK so I need to cast cl_mem to float or double pointer before calling forward. |
Integration started here https://github.com/edgarriba/tiny-cnn/blob/libdnn/tiny_cnn/core/kernels/libdnn_conv2d_kernel.h. Vienna told that the memory object is not valid. So if cl_mem float casting is correct probably there is a problem sharing CTX. |
I will have a look at it, maybe I can help you a bit :) |
@naibaf7 thanks for the support. Note that context and device are created here:https://github.com/edgarriba/tiny-cnn/blob/libdnn/tiny_cnn/core/backend_dnn.h#L87-L94 The code right now is totally inefficient but as a proof of concept to check if it work is okay. |
For standalone integrations of LibDNN, refer to: https://github.com/naibaf7/libdnn. |
Hi Fabian @naibaf7 how to plug your convolutional layer into other frameworks? What format does it expect the tensor data to be in? Can I just send in a cl_mem object in a certain format? Presumably I'd need to provide a bunch of metadata too, like dimension sizes, strides, offset? Do you have any benchmarks comparing performance on AMD hardware versus eg clBLAS + im2col ? And for any other hardware?
If your convolutional implementations are consistently better, seems no point in duplicating effort, I might just switch my libraries to use yours :-) Should be easy to build though, cross-platform etc, ideally...
The text was updated successfully, but these errors were encountered: