-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
Please fix the test_numpy_op.test_np_transpose |
Can you add more description of how are you achieving the faster transpose ? |
Here numpy transpose calls the same TransposeImpl function that I have already handled. Function Definition |
Will the GPU side be accelerated? |
@sxjscience not yet! work will be done on that after CPU one is accelerated first |
@ChaiBapchya From your performance table, I also noticed that int32 has a very close performance number with int64 (if I am right, here int64 means using large tensor) for non-mkl 2D transpose op in master repo. Is there any performance regression with 3-D or 4-D input shape? |
Yes. @wuxun-zhang |
Yes, please check why your code broke their unit test? |
d79cb87
to
0f28c7d
Compare
Can you add test cases and update the benchmark since the previous error is not catched? |
Added the test case to catch the previous bug. @zhreshold @sxjscience |
@ChaiBapchya also paste the results of new unittest run here. |
|
What are the performances of the input sizes that are not divisible by blocksize? |
Script used - https://gist.github.com/ChaiBapchya/65a41eccf6c09a98b5a5012a211c820e |
@sxjscience 10000,10000 is not divisible by 32. Is there any specific input shape that you want benchmarked ? |
@anirudh2290 can you merge this PR |
@@ -2840,6 +2840,13 @@ def test_transpose(): | |||
assert_allclose(np.transpose(x.asnumpy()), y.asnumpy()) | |||
|
|||
|
|||
@with_seed() | |||
def test_larger_transpose(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this say larger_transpose ? is it a bug with specific sizes ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChaiBapchya I think we should merge the test with the current tests of transpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current test_transpose has size less than the block_size
This PR has blocksize 32*32 (and edge case didn't get tested with existing set of tests)
Hence added this test to catch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add this with the current tests of transpose as @sxjscience suggests. otherwise LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! Merging to avoid another CI round. Please merge the two transpose tests into one in your next PR.
* 2d transpose naive * omp pragma * omp pragma unroll * blocksize * make it 2d tile * loop peeling * better loop peeling * redundancy * removed bool * removing excess for loops, memory save * fix internal forloop * remove commented code, lint fix * Trigger notification * explain params, indent fix, explain blocksize * fix p,n and reduce for loop computation j+a,i+b * kernel * gpu thread 1 * remove gpu implementation * fix internal for loop * unittest to catch the previous error * optimizations * microsoft cpp doesn't support omp collapse
Description
Faster 2D transpose
How to achieve this
Existing implementation leverages mshadow's expression template.
However, we found that there is no need for it and can be optimized.
Principle behind this implementation is to derive maximum utilization from L1 cache. Most L1 caches will have cache size >= 32kb. Hence, the aim was to transpose the 2D matrix on a per block basis.
Block size calculation
Total size (to be utilized) = 32kb (2^15)
Largest size of a single unit of any dtype <= 8 byte (2^3)
Number of elements - 2^12 (2^15/2^3)
Block-size - 2^6 v 2^6 (64 v 64)
But we could leverage unrolling of for loops (for parallelization)
Block-size - 2^5 v 2^5 (32 v 32) with 4 pragma for loop unrolled
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Flags used
[✖ CUDA, ✖ CUDNN, ✖ NCCL, ✖ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✖ OPENCV, ✖ CAFFE, ✖ PROFILER, ✖ DIST_KVSTORE, ✖ CXX14, ✔ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✖ DEBUG, ✖ TVM_OP]
Performance
Tested with DEBUG OFF (for consistency with release pip)