This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[MXNET-1446] Quantization: intgemm matrix multiply wrappers (#17559)
This pull request adds wrappers to the intgemm matrix multiplication library: https://github.com/kpu/intgemm . A performance comparison with DNNL aka MKL-DNN is at kpu/intgemm#59 The library targets thin matrix sizes seen in neural machine translation inference and was part of the top submission to the 2018 Workshop on Neural Generation and Translation efficiency task: https://neural.mt/papers/edinburgh/wnmt_marian_paper.pdf . The purpose of this issue is to add similar functionality to Sockeye: awslabs/sockeye#771 . Quantized Sockeye performance is 2.95x as fast. One problem with the current MXQuantizeSymbol approach is that Sockeye does not have a static graph for everything. intgemm uses a custom memory layout for the weight matrix to make more memory accesses consecutive, so there are operators to convert weights to that format. The idea is that weights are typically loaded once for inference. On architectures without VNNI, intgemm uses saturating 16-bit accumulation. This avoids an expensive madd_epi16 instruction every multiply by exploiting the fact that most neural network parameters are near 0. Because x86 only offers a unsigned * signed instruction and most people want signed * signed, there are two strategies one can take. Add 128 to data so now it's unsigned. But that biases the output. DNNL calculates this bias on the fly by summing weights then subtracts it out during GEMM. intgemm calculates this bias in advance, which can then be subtracted from the bias term with no overhead at runtime. A problem with this strategy is that it makes the accumulator bigger, requiring more upcasting with an expensive madd_epi16 instruction. Emulate signed * signed by normalizing the sign bit into the second argument. This requires extra instructions in the hot loop but keeps the accumulator small, so it's less necessary to accumulate into 32-bit integers and madd_epi16 can be avoided. Both intgemm and DNNL implement strategy 1; intgemm also implements strategy 2. Similar to DNNL, intgemm has runtime CPUID selection among backends for SSSE3, AVX2, AVX512BW, and AVX512VNNI.
- Loading branch information
1393602
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kpuatamazon @leezu Ever since this commit got merged, the job time for CI centos-cpu has increased by over 45 mins as can be seen from this trend. Build 2207 (this commit) took 107 mins whereas Build 2206 (parent commit) took only 45 mins.
The increase in time comes from the Tests stage
Python3: CentOS 7 CPU
. On observing the logs, it is clear that the same tests are taking more time to complete after this commit. Do you think it is because of the introduced wrappers to intgemm library?Before this commit:
After this commit
1393602
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. None of these tests should be running intgemm.
I don't see a corresponding change on unix-cpu: https://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/unix-cpu/job/master/buildTimeTrend where 2213 was the change (that failed embarassingly due to a non-deterministic test that I fixed). Which suggests something weird about Centos 7.
Nor is there a corresponding change in the v1.x branch: https://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/unix-cpu/job/v1.x/buildTimeTrend around 97 or 103. So it seems peculiar to the master branch. There is a difference in how I'm doing testing:
@pytest.mark.parametrize
in master sweeps over sizes and lets the framework know I have many small tests whereas v1.x does for loops that comprise one small test.One hypothesis could be that the intgemm test is running in contention with these other tests, causing them to take longer, but the intgemm test doesn't run long to being with: about 11s if left alone.
I'm setting up a CentOS environment to test on but can be slow to respond because this is part-time for me.
1393602
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kpuatamazon you can reuse the CI CentOS environment via
python ci/build.py --run-only --platform centos7_cpu /work/runtime_functions.sh unittest_centos7_cpu
If you start making changes to the
runtime_functions.sh
you'll need to remove the--run-only
. If you do that, you can improve the Docker caching by adding the--no-pull --cache-intermediate
options to avoid pulling in CI cache and enable local intermediate docker build cache.1393602
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kpuatamazon I tried to reproduce these numbers locally on a c5.18xl EC2 instance (same as used by CI), but did not see any regression. Following are the three main testing modules and their latency numbers before/after this commit on EC2 vs CI.
NOT Serial NOT OPERATOR tests
OMP_NUM_THREADS=18 python -m pytest -m 'not serial' -k 'not test_operator' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
After commit
Before commit
NOT Serial OPERATOR tests
MXNET_ENGINE_TYPE=NaiveEngine OMP_NUM_THREADS=18 python -m pytest -m 'not serial' -k 'test_operator' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
After commit
Before commit
Serial ALL tests
python -m pytest -m serial --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
After commit
Before commit
1393602
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per @mseth10 's findings it is evident that something is different in CI setup. So, I tried 2 different builds on local with and without FLAG
USE_INTGEMM
Without INTGEMM flag total time taken is 13:22 mins
And with INTGEMM flag total time taken is 38:50mins
perhaps the build with
USE_INTGEMM
is slowingdown test runs in CentOS-CPUI ran the above tests inside Centos-CPU docker container identical to that used in our CI.
@leezu @kpuatamazon any thoughts as to why enabling
USE_INTGEMM
flag would cause a slowdown ?1393602
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you reproduce the slowdown while only looking at a single test? Or does the slowdown only occur when running the whole testsuite?
1393602
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leezu
I ran the test test_gluon.py::test_slice_pooling2d_slice_pooling2d
with
USE_INTGEMM = ON
time taken = 00:09:46
with
USE_INTGEMM = OFF
time taken = 00:02:06
Both with
devtoolset-7
anddevtoolset-8
The results are consistent
1393602
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've a few hypotheses about this:
/* || MXNET_USE_INTGEMM == 1 */
. It will break intgemm tests but if the other ones go faster, then we know what's up. This is the only thing I've touched in core MXNet.@access2rohit has been coaching me on getting a CI environment setup.
As you may know, I'm part time on this and expect to be in on Friday to look more.
1393602
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's OpenMP. Let's move to #19502