Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Optimize calculation of dot product for AVX #954

Closed
wants to merge 3 commits into from

Conversation

stweil
Copy link
Member

@stweil stweil commented May 26, 2017

No description provided.

@stweil
Copy link
Member Author

stweil commented May 26, 2017

On an Intel Xeon Server the new code is significantly faster.

The results are less impressive on a notebook with Core i7 where it is also more difficult to reproduce a timing test. That's why I suggest independent tests by other people.

@Shreeshrii
Copy link
Collaborator

AppVeyor build failed

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented May 27, 2017

I gave it a try with eurotext.tif with -l eng and san001.tif with -l hin. I did not find much difference between the `optimized code' and code from the earlier commit, though there are variations in different runsof each from command line.

See attached log files with numbers.
test-san001.txt
test-eurotext.txt

As has been previously noted, legacy engine is faster for -l eng and LSTM engine is faster for -l hin.

@rfschtkt
Copy link
Contributor

rfschtkt commented Jun 6, 2017

Er, what's with the low-level SSE/AVX programming instead of trusting tools like #pragma omp simd and even optimised higher-level libraries, unless it is to leave ample opportunity for further optimisation?

BTW, how is alignment at 32-byte boundaries achieved, and has this been verified? I was blinded by all the other uses of "align", so I didn't find any uses relating to memory alignment...

@stweil
Copy link
Member Author

stweil commented Jun 6, 2017

As far as I know, #pragma omp simd is not restricted to Intel CPUs, so it could also improve the generated code for ARM and other architectures.

Tesseract currently does not assume 32-byte alignment, but tests the alignment at execution time and chooses different code paths for aligned and unaligned arrays of double values.

@amitdo
Copy link
Collaborator

amitdo commented Jun 6, 2017

Just a note: #pragma omp simd is not supported with MSVC.

@rfschtkt
Copy link
Contributor

rfschtkt commented Jun 6, 2017

@stweil Has anybody verified that this differentiation actually does some good? If you leave it up to chance, I don't think that the odds are in your favour, especially if both arguments have to be aligned at the same time, because the current code does not choose load instructions independently for each parameter; so if it does make a difference, you had better make sure that it's not just once in a while if you're lucky. One parameter aligned is actually the worst case, because you don't necessarily have to start at the beginning of a vector for the SIMD instructions, so what matters is whether the arguments are mutually aligned or not.

@amitdo I don't think that the laggard should set the pace, considering how relatively easy the pragma is to use.

@stweil
Copy link
Member Author

stweil commented Jun 6, 2017

@rfschtkt, it looks like one parameter is (mostly?) aligned while the other parameter is only aligned in one of four steps (it increases in steps of sizeof(double) == 8). As far as I see the code could also be modified to handle one aligned parameter with the other parameter unaligned.

I'm still not sure how much the aligned access is faster. As reported by @Shreeshrii, some CPU models don't show increased speed where others show significant faster execution. Memory caches also play an important role.

That's why I want to offer several implementations in Tesseract, and users can choose which one is best for their environment.

@rfschtkt
Copy link
Contributor

rfschtkt commented Jun 6, 2017

Do you have any evidence that such differentiation might be useful beyond the purpose of experimentation? I would expect cache-oblivious code, techniques like blocking/strip-mining, to benefit all architectures more or less equally. It doesn't seem very efficient to try to reinvent the wheel here, but I haven't yet looked into any issues like licensing associated with the use of an external library with Tesseract, so...

@stweil
Copy link
Member Author

stweil commented Jun 6, 2017

See https://github.com/RRZE-HPC/DDOT-Bench and the associated article for more research on this. They also have code which can be used freely.

@rfschtkt
Copy link
Contributor

rfschtkt commented Jun 7, 2017

How about #983? The idea is to primarily rely on OpenMP 4.0's simd where available, then try explicit code for specific implementations, then fall back to serial execution. If you're confident that you have a better implementation than OpenMP, this order can of course be changed, but I don't think this is the case yet.

@rfschtkt
Copy link
Contributor

rfschtkt commented Jun 9, 2017

Could you share some details on how you test performance here? Could this be reframed as a unit test that immediately rejects a suboptimal implementation choice? My conjecture is that OpenMP wipes the floor with (relatively speaking) naive use of SSE/AVX intrinsics, and that you only need the latter if you are stuck with Visual C++ (for now).

@stweil
Copy link
Member Author

stweil commented Jun 9, 2017

Typically I use these test scenarios for the OCR process (tesseract executable, all images available on https://digi.bib.uni-mannheim.de/~stweil/tesseract/):

# OCR of a very large image (performance test with focus on OCR).
tesseract 0604.jp2 /tmp/0604-eng -l eng
tesseract 0604.jp2 /tmp/0604-frk -l frk
# OCR of a very small image (Valgrind, performance test with focus on pre OCR steps).
tesseract hello.png /tmp/hello

Recently I added a test scenario for training. It is based on issue #961 provided by @Shreeshrii.

I started writing a standalone test to test the effects of cache size, but that is still unfinished.

@zdenop
Copy link
Contributor

zdenop commented Sep 11, 2017

What is status of this PR? At least it needs to resolve conflicts....

@stweil
Copy link
Member Author

stweil commented Sep 11, 2017

The conflicts are resolved now. I suggest to keep the PR open until more people have reported their timing test results.

@jbarlow83
Copy link

jbarlow83 commented Feb 23, 2018

Another test – Haswell i5 3.2 GHz with AVX2 on macOS High Sierra. Tesseract compiled with clang.

Without this PR:

$ time api/tesseract --tessdata-dir ./tessdata pr954/0604.jp2 /tmp/0604-eng.nopatch -l eng
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
       76.96 real        71.53 user         1.15 sys

$ time api/tesseract --tessdata-dir ./tessdata pr954/0604.jp2 /tmp/0604-frk.nopatch -l frk
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
       96.80 real        92.14 user         1.09 sys

With this PR:

$ time api/tesseract --tessdata-dir ./tessdata pr954/0604.jp2 /tmp/0604-eng -l eng
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
       74.31 real        70.25 user         1.06 sys

$ time api/tesseract --tessdata-dir ./tessdata pr954/0604.jp2 /tmp/0604-frk -l frk
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
       99.83 real        92.59 user         1.25 sys

The output text files were identical.

So no improvement, unfortunately.

Training data was the "fast" files.

@stweil
Copy link
Member Author

stweil commented Feb 24, 2018

That confirms my previous results and also those reported by @Shreeshrii. My new code obviously only improves the performance on server CPUs (Intel Xeon). Cache sizes and memory bandwidth might be different for such CPUs.

@zdenop
Copy link
Contributor

zdenop commented Feb 24, 2018

So, should I merge this PR?

@stweil
Copy link
Member Author

stweil commented Feb 24, 2018

No, I still want to get more timing results. Then I plan sending a patch which allows users to choose from several methods to calculate the dot product. I'll add RFC to the title of the pull request here to make the status clearer.

@stweil stweil changed the title Optimize calculation of dot product for AVX RFC: Optimize calculation of dot product for AVX Feb 24, 2018
stweil added 3 commits May 3, 2018 16:54
This improves readability and reduces the code size from 317 to 288 bytes.

Signed-off-by: Stefan Weil <[email protected]>
Fix this gcc warning:

arch/dotproductavx.cpp:93:38: warning:
 dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]

The new code no longer needs conditional compilation.

Signed-off-by: Stefan Weil <[email protected]>
This improves the performance significantly.

Signed-off-by: Stefan Weil <[email protected]>
@zdenop
Copy link
Contributor

zdenop commented Oct 13, 2018

@stweil: what to do with this PR regarding 4.0 release?

@stweil
Copy link
Member Author

stweil commented Oct 13, 2018

That depends on the planned release date. I still want to allow the user to select the code used for the calculation of the dot product, but won't finish that the next two weeks, so I am afraid it will have to wait for a 4.1 release.

@zdenop zdenop added this to the 4.1.0 milestone Oct 13, 2018
@stweil
Copy link
Member Author

stweil commented Feb 20, 2019

This pull request is now obsolete as newer code gives much better results with AVX and best models.

# Git master.
stweil@ub-backup:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract /home/stweil/src/github/tesseract-ocr/tesseract/issues/837/0604.jp2 /tmp/out -l best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-19-g3da8 with Leptonica
real 243.19
user 242.28
sys 0.86

# Code from PR #954.
stweil@ub-backup:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract /home/stweil/src/github/tesseract-ocr/tesseract/issues/837/0604.jp2 /tmp/out -l best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-19-g3da8 with Leptonica
real 229.33
user 228.68
sys 0.64

# New optimized dot product for AVX double.
stweil@ub-backup:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract /home/stweil/src/github/tesseract-ocr/tesseract/issues/837/0604.jp2 /tmp/out -l best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-19-g3da8 with Leptonica
real 180.58
user 182.01
sys 1.45

# Float dot product.
stweil@ub-backup:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract /home/stweil/src/github/tesseract-ocr/tesseract/issues/837/0604.jp2 /tmp/out -l best/eng -c dotproduct_kahan_float_mode=1
Tesseract Open Source OCR Engine v4.1.0-rc1-19-g3da8 with Leptonica
real 138.02
user 137.13
sys 0.88

# Integer dot product with fast model.
stweil@ub-backup:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract /home/stweil/src/github/tesseract-ocr/tesseract/issues/837/0604.jp2 /tmp/out -l fast/eng                                 
Tesseract Open Source OCR Engine v4.1.0-rc1-19-g3da8 with Leptonica
real 68.83
user 67.99
sys 0.83

@stweil stweil closed this Feb 20, 2019
@stweil
Copy link
Member Author

stweil commented Feb 20, 2019

The same test on a virtual machine with AVX shows no improvement by the code from PR #954, but a slightly faster result with the new optimized dot product.

# Git master (4 threads).
debian@development:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract issue/837/0604.jp2 /tmp/out -l tessdata_best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-7-gb3bd with Leptonica
real 133.55
user 408.48
sys 2.86

# Git master.
debian@development:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract issue/837/0604.jp2 /tmp/out -l tessdata_best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-7-gb3bd with Leptonica
real 332.21
user 331.27
sys 0.91

# Code from PR #954.
debian@development:~/src/github/tesseract-ocr/tesseract$ time bin/ndebug/x86_64-linux-gnu/src/api/tesseract issue/837/0604.jp2 /tmp/out2 -l tessdata_best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-7-gb3bd with Leptonica
real 334.51
user 333.58
sys  0.91

# New optimized dot product for AVX double.
debian@development:~/src/github/tesseract-ocr/tesseract$ time -p bin/ndebug/x86_64-linux-gnu/src/api/tesseract issue/837/0604.jp2 /tmp/out3 -l tessdata_best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-7-gb3bd with Leptonica
real 326.12
user 325.13
sys 0.98

@stweil
Copy link
Member Author

stweil commented Feb 20, 2019

Test results with MacBook Pro also show only a small improvement with the new dot product code.

# Git master.
macbook-pro:tesseract stweil$ time -p bin/ndebug/native/src/api/tesseract issue/837/0604.jp2 /tmp/out -l tessdata_best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-7-gb3bd with Leptonica
real 268.74
user 266.67
sys 1.23

# New optimized dot product for AVX double.
edv-macbook-pro:tesseract stweil$ time -p bin/ndebug/native/src/api/tesseract issue/837/0604.jp2 /tmp/out -l tessdata_best/eng
Tesseract Open Source OCR Engine v4.1.0-rc1-7-gb3bd with Leptonica
real 262.81
user 261.20
sys 1.09

@Shreeshrii
Copy link
Collaborator

Recently I added a test scenario for training. It is based on issue #961 provided by @Shreeshrii.

I started writing a standalone test to test the effects of cache size, but that is still unfinished.

@stweil Please add the training test to the unittests so that the script is easy to find. Thanks.

@stweil stweil deleted the dotproduct branch February 21, 2019 08:21
@amitdo amitdo added the RFC label Mar 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants