RFC: Optimize calculation of dot product for AVX #954

stweil · 2017-05-26T13:16:24Z

No description provided.

stweil · 2017-05-26T13:20:44Z

On an Intel Xeon Server the new code is significantly faster.

The results are less impressive on a notebook with Core i7 where it is also more difficult to reproduce a timing test. That's why I suggest independent tests by other people.

Shreeshrii · 2017-05-26T15:38:39Z

AppVeyor build failed

Shreeshrii · 2017-05-27T10:53:35Z

I gave it a try with eurotext.tif with -l eng and san001.tif with -l hin. I did not find much difference between the `optimized code' and code from the earlier commit, though there are variations in different runsof each from command line.

See attached log files with numbers.
test-san001.txt
test-eurotext.txt

As has been previously noted, legacy engine is faster for -l eng and LSTM engine is faster for -l hin.

rfschtkt · 2017-06-06T14:27:14Z

Er, what's with the low-level SSE/AVX programming instead of trusting tools like #pragma omp simd and even optimised higher-level libraries, unless it is to leave ample opportunity for further optimisation?

BTW, how is alignment at 32-byte boundaries achieved, and has this been verified? I was blinded by all the other uses of "align", so I didn't find any uses relating to memory alignment...

stweil · 2017-06-06T14:58:00Z

As far as I know, #pragma omp simd is not restricted to Intel CPUs, so it could also improve the generated code for ARM and other architectures.

Tesseract currently does not assume 32-byte alignment, but tests the alignment at execution time and chooses different code paths for aligned and unaligned arrays of double values.

amitdo · 2017-06-06T15:10:23Z

Just a note: #pragma omp simd is not supported with MSVC.

rfschtkt · 2017-06-06T17:24:41Z

@stweil Has anybody verified that this differentiation actually does some good? If you leave it up to chance, I don't think that the odds are in your favour, especially if both arguments have to be aligned at the same time, because the current code does not choose load instructions independently for each parameter; so if it does make a difference, you had better make sure that it's not just once in a while if you're lucky. One parameter aligned is actually the worst case, because you don't necessarily have to start at the beginning of a vector for the SIMD instructions, so what matters is whether the arguments are mutually aligned or not.

@amitdo I don't think that the laggard should set the pace, considering how relatively easy the pragma is to use.

stweil · 2017-06-06T17:48:46Z

@rfschtkt, it looks like one parameter is (mostly?) aligned while the other parameter is only aligned in one of four steps (it increases in steps of sizeof(double) == 8). As far as I see the code could also be modified to handle one aligned parameter with the other parameter unaligned.

I'm still not sure how much the aligned access is faster. As reported by @Shreeshrii, some CPU models don't show increased speed where others show significant faster execution. Memory caches also play an important role.

That's why I want to offer several implementations in Tesseract, and users can choose which one is best for their environment.

rfschtkt · 2017-06-06T19:00:12Z

Do you have any evidence that such differentiation might be useful beyond the purpose of experimentation? I would expect cache-oblivious code, techniques like blocking/strip-mining, to benefit all architectures more or less equally. It doesn't seem very efficient to try to reinvent the wheel here, but I haven't yet looked into any issues like licensing associated with the use of an external library with Tesseract, so...

stweil · 2017-06-06T20:18:48Z

See https://github.com/RRZE-HPC/DDOT-Bench and the associated article for more research on this. They also have code which can be used freely.

rfschtkt · 2017-06-07T09:14:09Z

How about #983? The idea is to primarily rely on OpenMP 4.0's simd where available, then try explicit code for specific implementations, then fall back to serial execution. If you're confident that you have a better implementation than OpenMP, this order can of course be changed, but I don't think this is the case yet.

rfschtkt · 2017-06-09T07:14:57Z

Could you share some details on how you test performance here? Could this be reframed as a unit test that immediately rejects a suboptimal implementation choice? My conjecture is that OpenMP wipes the floor with (relatively speaking) naive use of SSE/AVX intrinsics, and that you only need the latter if you are stuck with Visual C++ (for now).

stweil · 2017-06-09T09:32:25Z

Typically I use these test scenarios for the OCR process (tesseract executable, all images available on https://digi.bib.uni-mannheim.de/~stweil/tesseract/):

# OCR of a very large image (performance test with focus on OCR).
tesseract 0604.jp2 /tmp/0604-eng -l eng
tesseract 0604.jp2 /tmp/0604-frk -l frk
# OCR of a very small image (Valgrind, performance test with focus on pre OCR steps).
tesseract hello.png /tmp/hello

Recently I added a test scenario for training. It is based on issue #961 provided by @Shreeshrii.

I started writing a standalone test to test the effects of cache size, but that is still unfinished.

zdenop · 2017-09-11T07:39:51Z

What is status of this PR? At least it needs to resolve conflicts....

stweil · 2017-09-11T07:53:44Z

The conflicts are resolved now. I suggest to keep the PR open until more people have reported their timing test results.

jbarlow83 · 2018-02-23T22:35:06Z

Another test – Haswell i5 3.2 GHz with AVX2 on macOS High Sierra. Tesseract compiled with clang.

Without this PR:

$ time api/tesseract --tessdata-dir ./tessdata pr954/0604.jp2 /tmp/0604-eng.nopatch -l eng
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
       76.96 real        71.53 user         1.15 sys

$ time api/tesseract --tessdata-dir ./tessdata pr954/0604.jp2 /tmp/0604-frk.nopatch -l frk
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
       96.80 real        92.14 user         1.09 sys

With this PR:

$ time api/tesseract --tessdata-dir ./tessdata pr954/0604.jp2 /tmp/0604-eng -l eng
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
       74.31 real        70.25 user         1.06 sys

$ time api/tesseract --tessdata-dir ./tessdata pr954/0604.jp2 /tmp/0604-frk -l frk
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
       99.83 real        92.59 user         1.25 sys

The output text files were identical.

So no improvement, unfortunately.

Training data was the "fast" files.

stweil · 2018-02-24T12:46:31Z

That confirms my previous results and also those reported by @Shreeshrii. My new code obviously only improves the performance on server CPUs (Intel Xeon). Cache sizes and memory bandwidth might be different for such CPUs.

zdenop · 2018-02-24T14:08:20Z

So, should I merge this PR?

stweil · 2018-02-24T15:17:45Z

No, I still want to get more timing results. Then I plan sending a patch which allows users to choose from several methods to calculate the dot product. I'll add RFC to the title of the pull request here to make the status clearer.

This improves readability and reduces the code size from 317 to 288 bytes. Signed-off-by: Stefan Weil <[email protected]>

Fix this gcc warning: arch/dotproductavx.cpp:93:38: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] The new code no longer needs conditional compilation. Signed-off-by: Stefan Weil <[email protected]>

This improves the performance significantly. Signed-off-by: Stefan Weil <[email protected]>

zdenop · 2018-10-13T09:16:40Z

@stweil: what to do with this PR regarding 4.0 release?

stweil · 2018-10-13T11:35:09Z

That depends on the planned release date. I still want to allow the user to select the code used for the calculation of the dot product, but won't finish that the next two weeks, so I am afraid it will have to wait for a 4.1 release.

stweil · 2019-02-20T16:04:43Z

This pull request is now obsolete as newer code gives much better results with AVX and best models.

# Git master.
stweil@ub-backup:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract /home/stweil/src/github/tesseract-ocr/tesseract/issues/837/0604.jp2 /tmp/out -l best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-19-g3da8 with Leptonica
real 243.19
user 242.28
sys 0.86

# Code from PR #954.
stweil@ub-backup:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract /home/stweil/src/github/tesseract-ocr/tesseract/issues/837/0604.jp2 /tmp/out -l best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-19-g3da8 with Leptonica
real 229.33
user 228.68
sys 0.64

# New optimized dot product for AVX double.
stweil@ub-backup:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract /home/stweil/src/github/tesseract-ocr/tesseract/issues/837/0604.jp2 /tmp/out -l best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-19-g3da8 with Leptonica
real 180.58
user 182.01
sys 1.45

# Float dot product.
stweil@ub-backup:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract /home/stweil/src/github/tesseract-ocr/tesseract/issues/837/0604.jp2 /tmp/out -l best/eng -c dotproduct_kahan_float_mode=1
Tesseract Open Source OCR Engine v4.1.0-rc1-19-g3da8 with Leptonica
real 138.02
user 137.13
sys 0.88

# Integer dot product with fast model.
stweil@ub-backup:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract /home/stweil/src/github/tesseract-ocr/tesseract/issues/837/0604.jp2 /tmp/out -l fast/eng                                 
Tesseract Open Source OCR Engine v4.1.0-rc1-19-g3da8 with Leptonica
real 68.83
user 67.99
sys 0.83

stweil · 2019-02-20T16:34:36Z

The same test on a virtual machine with AVX shows no improvement by the code from PR #954, but a slightly faster result with the new optimized dot product.

# Git master (4 threads).
debian@development:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract issue/837/0604.jp2 /tmp/out -l tessdata_best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-7-gb3bd with Leptonica
real 133.55
user 408.48
sys 2.86

# Git master.
debian@development:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract issue/837/0604.jp2 /tmp/out -l tessdata_best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-7-gb3bd with Leptonica
real 332.21
user 331.27
sys 0.91

# Code from PR #954.
debian@development:~/src/github/tesseract-ocr/tesseract$ time bin/ndebug/x86_64-linux-gnu/src/api/tesseract issue/837/0604.jp2 /tmp/out2 -l tessdata_best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-7-gb3bd with Leptonica
real 334.51
user 333.58
sys  0.91

# New optimized dot product for AVX double.
debian@development:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract issue/837/0604.jp2 /tmp/out3 -l tessdata_best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-7-gb3bd with Leptonica
real 326.12
user 325.13
sys 0.98

stweil · 2019-02-20T21:05:44Z

Test results with MacBook Pro also show only a small improvement with the new dot product code.

# Git master.
macbook-pro:tesseract stweil$ time -p bin/ndebug/native/src/api/tesseract issue/837/0604.jp2 /tmp/out -l tessdata_best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-7-gb3bd with Leptonica
real 268.74
user 266.67
sys 1.23

# New optimized dot product for AVX double.
edv-macbook-pro:tesseract stweil$ time -p bin/ndebug/native/src/api/tesseract issue/837/0604.jp2 /tmp/out -l tessdata_best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-7-gb3bd with Leptonica
real 262.81
user 261.20
sys 1.09

Shreeshrii · 2019-02-21T03:22:26Z

Recently I added a test scenario for training. It is based on issue #961 provided by @Shreeshrii.

I started writing a standalone test to test the effects of cache size, but that is still unfinished.

@stweil Please add the training test to the unittests so that the script is easy to find. Thanks.

stweil force-pushed the dotproduct branch from 52f4ed4 to 2ff99ce Compare May 26, 2017 16:37

stweil force-pushed the dotproduct branch from 2ff99ce to f1a623c Compare September 11, 2017 07:49

stweil mentioned this pull request Feb 22, 2018

OpenCL support for CUDA - increase OCR speed ocrmypdf/OCRmyPDF#221

Closed

stweil changed the title ~~Optimize calculation of dot product for AVX~~ RFC: Optimize calculation of dot product for AVX Feb 24, 2018

stweil added 3 commits May 3, 2018 16:54

DotProductAVX: Simplify code

ea09043

This improves readability and reduces the code size from 317 to 288 bytes. Signed-off-by: Stefan Weil <[email protected]>

DotProductAVX: Unroll loops

61ff721

This improves the performance significantly. Signed-off-by: Stefan Weil <[email protected]>

stweil force-pushed the dotproduct branch from f1a623c to 61ff721 Compare May 3, 2018 14:54

zdenop added this to the 4.1.0 milestone Oct 13, 2018

zdenop added the performance label Oct 13, 2018

amitdo mentioned this pull request Oct 14, 2018

RFC: #pragma omp simd (don't merge) #983

Closed

stweil closed this Feb 20, 2019

stweil mentioned this pull request Feb 20, 2019

Optimize calculation of dot product for double vectors with AVX #2257

Merged

stweil deleted the dotproduct branch February 21, 2019 08:21

amitdo added the RFC label Mar 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Optimize calculation of dot product for AVX #954

RFC: Optimize calculation of dot product for AVX #954

stweil commented May 26, 2017

stweil commented May 26, 2017

Shreeshrii commented May 26, 2017

Shreeshrii commented May 27, 2017 •

edited

Loading

rfschtkt commented Jun 6, 2017

stweil commented Jun 6, 2017

amitdo commented Jun 6, 2017 •

edited

Loading

rfschtkt commented Jun 6, 2017

stweil commented Jun 6, 2017

rfschtkt commented Jun 6, 2017

stweil commented Jun 6, 2017

rfschtkt commented Jun 7, 2017

rfschtkt commented Jun 9, 2017

stweil commented Jun 9, 2017

zdenop commented Sep 11, 2017

stweil commented Sep 11, 2017

jbarlow83 commented Feb 23, 2018 •

edited

Loading

stweil commented Feb 24, 2018

zdenop commented Feb 24, 2018

stweil commented Feb 24, 2018

zdenop commented Oct 13, 2018

stweil commented Oct 13, 2018

stweil commented Feb 20, 2019 •

edited

Loading

stweil commented Feb 20, 2019

stweil commented Feb 20, 2019

Shreeshrii commented Feb 21, 2019

RFC: Optimize calculation of dot product for AVX #954

RFC: Optimize calculation of dot product for AVX #954

Conversation

stweil commented May 26, 2017

stweil commented May 26, 2017

Shreeshrii commented May 26, 2017

Shreeshrii commented May 27, 2017 • edited Loading

rfschtkt commented Jun 6, 2017

stweil commented Jun 6, 2017

amitdo commented Jun 6, 2017 • edited Loading

rfschtkt commented Jun 6, 2017

stweil commented Jun 6, 2017

rfschtkt commented Jun 6, 2017

stweil commented Jun 6, 2017

rfschtkt commented Jun 7, 2017

rfschtkt commented Jun 9, 2017

stweil commented Jun 9, 2017

zdenop commented Sep 11, 2017

stweil commented Sep 11, 2017

jbarlow83 commented Feb 23, 2018 • edited Loading

stweil commented Feb 24, 2018

zdenop commented Feb 24, 2018

stweil commented Feb 24, 2018

zdenop commented Oct 13, 2018

stweil commented Oct 13, 2018

stweil commented Feb 20, 2019 • edited Loading

stweil commented Feb 20, 2019

stweil commented Feb 20, 2019

Shreeshrii commented Feb 21, 2019

Shreeshrii commented May 27, 2017 •

edited

Loading

amitdo commented Jun 6, 2017 •

edited

Loading

jbarlow83 commented Feb 23, 2018 •

edited

Loading

stweil commented Feb 20, 2019 •

edited

Loading