-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Batching improvements for GEMM/TRSM operators and full MKL usage docs. #8846
Conversation
src/operator/linalg_impl.h
Outdated
linalg_check_batch_size(A.size(0), B.size(0), C.size(0)); \ | ||
check_gemm(A[0], B[0], C[0], alpha, beta, tA, tB); \ | ||
using namespace mshadow::cuda; \ | ||
int ngrid = std::min(kMaxGridNum, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ngrid is not needed anymore
using namespace mshadow::cuda; \ | ||
int ngrid = std::min(kMaxGridNum, \ | ||
static_cast<int>((A.size(0) + kBaseThreadNum - 1) / kBaseThreadNum)); \ | ||
linalgCollectBatchOffsetsGPU<<<ngrid, kBaseThreadNum, 0, mshadow::Stream<gpu>::GetStream(s)>>> \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls remove the function linaalgCollectBatchOffsetsGPU from the file as it should not be needed anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -1,3 +1,21 @@ | |||
# Full MKL Installation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should mention the purpose of doing so, i.e. that this will enable MKL for all operators in the linalg-namespace.
What about this piece of code (I guess it is still in the config):
ifeq ($(USE_BLAS), mkl)
USE_LAPACK = 0
endif
Guess this has to be changed as well.
And unfortunately this does not work exactly as planned. With the suggested settings, a user would get MKL for blas/lapack, but same time setting USE_MKL2017=0 would internally switch off use of MKLML for a alot of NN operators. Setting USE_MKL2017=0 was just a shortcut for our experiments with the linalg-operators.
The mechanism ideally should work such that the user just sets USE_BLAS=mkl and that is it. He can in addition set USE_MKL2017 and then also some other operators will start useing MKL's NN-functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I believe I changed the config file to adhere to the behavior we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to me no config changes now except removal of the piece of code above. Is this enough? I.e. have you tried that specifically the variant
USE_BLAS=mkl
USE_MKL2017=1
works?
@@ -103,8 +103,8 @@ void linalg_batch_gemm<cpu, DType>(const Tensor<cpu, 3, DType>& A, const Tensor< | |||
LINALG_CPU_GEMM(sgemm, float) | |||
LINALG_CPU_GEMM(dgemm, double) | |||
|
|||
LINALG_CPU_BATCH_GEMM(float) | |||
LINALG_CPU_BATCH_GEMM(double) | |||
LINALG_XPU_BATCH_GEMM(cpu, float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes an error in the amalgamation build as there MSHADOW_USE_CBLAS==0 && MSHADOW_USE_MKL==0. As a consequence, the dummy stubs starting at line 84 will be generated, but they are named "CPU_GEMM" and not "XPU_GEMM". So you may have to change the names of these dummy stubs such that a call to LINALG_XPU_BATCH_GEMM(cpu,...) correctly generates these stubs instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, let's see if it builds correctly.
@asmushetzel Any updates? |
Apologies for excess merge commits, not sure how to clean those up. |
af13b04
to
2f2b48b
Compare
5612ba8
to
35e70d2
Compare
Hey, this was rebased against master just before pushing. I'm not sure why the pr-merge isn't building correctly, is this alright? |
The pr-merge error seems fishy, i.e. not related to your changes. There should never any non-checked in changes in a build. |
2336e7d
to
72826a7
Compare
MKL_README.md
Outdated
|
||
1.1 Set ADD_LDFLAGS=-L<path/to/mkl/lib/folder> (ex. ADD_LDFLAGS=-L/opt/intel/compilers_and_libraries_2018.0.128/linux/mkl/lib) | ||
|
||
1.1 Set ADD_CFLAGS=-L<path/to/mkl/include/folder> (ex. ADD_CFLAGS=-L/opt/intel/compilers_and_libraries_2018.0.128/linux/mkl/include) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess you mean the "-I" flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The examples I have there are what I ran it with successfully, using "-L" not "-l". I'm not sure exactly which things would need to be included if I used "-l".
ee67aa9
to
8214cd9
Compare
Hmm the pr-head build failed with: docker: Error response from daemon: create nvidia_driver_384.111: VolumeDriver.Create: internal error, check logs for details. I don't think that has to do with my changes, but is there anything I can do to help find/fix whatever the issue is? Apart from that failure, once this passes I think these changes are good to go in. |
* Changed GEMM operator to use gemmStridedBatch CUDA implementation when CUDA is version 8 or higher, otherwise to just do batching manually. * Changed TRSM operator to not use the CUDA batching functionality as it's slower for large matrices. Instead do batching manually. * Added instructions for using a full MKL installation instead of just MKL2017
8214cd9
to
94d477e
Compare
@marcoabreu Marco, seems like a CI problem. Any advice how Eric can get this going again? |
Looks like something else went wrong this time, the build has been hanging for 18 hours now on most of the GPU builds. Is there an easy way to restart/rerun the CI tests? |
The nvidia_driver_384 issue was due to a security patch for Spectre. We have reinstalled all slaves yesterday evening and it should be working now. In order to trigger a new build, just make a new commit. The old build will time out after 24h. |
* Changed GEMM operator to use gemmStridedBatch CUDA implementation when CUDA is version 8 or higher, otherwise to just do batching manually. * Changed TRSM operator to not use the CUDA batching functionality as it's slower for large matrices. Instead do batching manually. * Added instructions for using a full MKL installation instead of just MKL2017
Cool, thanks Marco I thought it might be related to the Spectre change. Just pushed up a new commit, hopefully it doesn't get the same hang this time. |
IMO, this can be integrated now. |
apache#8846) * Batching improvements for GEMM/TRSM operators and full MKL usage docs. * Changed GEMM operator to use gemmStridedBatch CUDA implementation when CUDA is version 8 or higher, otherwise to just do batching manually. * Changed TRSM operator to not use the CUDA batching functionality as it's slower for large matrices. Instead do batching manually. * Added instructions for using a full MKL installation instead of just MKL2017 * Batching improvements for GEMM/TRSM operators and full MKL usage docs. * Changed GEMM operator to use gemmStridedBatch CUDA implementation when CUDA is version 8 or higher, otherwise to just do batching manually. * Changed TRSM operator to not use the CUDA batching functionality as it's slower for large matrices. Instead do batching manually. * Added instructions for using a full MKL installation instead of just MKL2017
apache#8846) * Batching improvements for GEMM/TRSM operators and full MKL usage docs. * Changed GEMM operator to use gemmStridedBatch CUDA implementation when CUDA is version 8 or higher, otherwise to just do batching manually. * Changed TRSM operator to not use the CUDA batching functionality as it's slower for large matrices. Instead do batching manually. * Added instructions for using a full MKL installation instead of just MKL2017 * Batching improvements for GEMM/TRSM operators and full MKL usage docs. * Changed GEMM operator to use gemmStridedBatch CUDA implementation when CUDA is version 8 or higher, otherwise to just do batching manually. * Changed TRSM operator to not use the CUDA batching functionality as it's slower for large matrices. Instead do batching manually. * Added instructions for using a full MKL installation instead of just MKL2017
apache#8846) * Batching improvements for GEMM/TRSM operators and full MKL usage docs. * Changed GEMM operator to use gemmStridedBatch CUDA implementation when CUDA is version 8 or higher, otherwise to just do batching manually. * Changed TRSM operator to not use the CUDA batching functionality as it's slower for large matrices. Instead do batching manually. * Added instructions for using a full MKL installation instead of just MKL2017 * Batching improvements for GEMM/TRSM operators and full MKL usage docs. * Changed GEMM operator to use gemmStridedBatch CUDA implementation when CUDA is version 8 or higher, otherwise to just do batching manually. * Changed TRSM operator to not use the CUDA batching functionality as it's slower for large matrices. Instead do batching manually. * Added instructions for using a full MKL installation instead of just MKL2017
apache#8846) * Batching improvements for GEMM/TRSM operators and full MKL usage docs. * Changed GEMM operator to use gemmStridedBatch CUDA implementation when CUDA is version 8 or higher, otherwise to just do batching manually. * Changed TRSM operator to not use the CUDA batching functionality as it's slower for large matrices. Instead do batching manually. * Added instructions for using a full MKL installation instead of just MKL2017 * Batching improvements for GEMM/TRSM operators and full MKL usage docs. * Changed GEMM operator to use gemmStridedBatch CUDA implementation when CUDA is version 8 or higher, otherwise to just do batching manually. * Changed TRSM operator to not use the CUDA batching functionality as it's slower for large matrices. Instead do batching manually. * Added instructions for using a full MKL installation instead of just MKL2017
apache#8846) * Batching improvements for GEMM/TRSM operators and full MKL usage docs. * Changed GEMM operator to use gemmStridedBatch CUDA implementation when CUDA is version 8 or higher, otherwise to just do batching manually. * Changed TRSM operator to not use the CUDA batching functionality as it's slower for large matrices. Instead do batching manually. * Added instructions for using a full MKL installation instead of just MKL2017 * Batching improvements for GEMM/TRSM operators and full MKL usage docs. * Changed GEMM operator to use gemmStridedBatch CUDA implementation when CUDA is version 8 or higher, otherwise to just do batching manually. * Changed TRSM operator to not use the CUDA batching functionality as it's slower for large matrices. Instead do batching manually. * Added instructions for using a full MKL installation instead of just MKL2017
Description
During some benchmarking I discovered that the CUDA internal batching implementations for trsm and gemm operators were slow for large matrices. For gemm, found that the gemmStridedBatch implementation is faster at all matrix sizes so we should use that when possible (cuda 8+) Otherwise, since most use cases for these operators use relatively large matrices, use a simple for loop for batch calls instead of the specific batched cuda implementation.
Also added instructions for how to compile with a full MKL installation instead of just the MKL2017 subset.
Checklist
Essentials
make lint
)Changes
Comments
Reviewers