-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Optimize calculation of dot product for AVX #954
Conversation
On an Intel Xeon Server the new code is significantly faster. The results are less impressive on a notebook with Core i7 where it is also more difficult to reproduce a timing test. That's why I suggest independent tests by other people. |
|
I gave it a try with eurotext.tif with -l eng and san001.tif with -l hin. I did not find much difference between the `optimized code' and code from the earlier commit, though there are variations in different runsof each from command line. See attached log files with numbers. As has been previously noted, legacy engine is faster for -l eng and LSTM engine is faster for -l hin. |
Er, what's with the low-level SSE/AVX programming instead of trusting tools like BTW, how is alignment at 32-byte boundaries achieved, and has this been verified? I was blinded by all the other uses of "align", so I didn't find any uses relating to memory alignment... |
As far as I know, Tesseract currently does not assume 32-byte alignment, but tests the alignment at execution time and chooses different code paths for aligned and unaligned arrays of double values. |
Just a note: |
@stweil Has anybody verified that this differentiation actually does some good? If you leave it up to chance, I don't think that the odds are in your favour, especially if both arguments have to be aligned at the same time, because the current code does not choose load instructions independently for each parameter; so if it does make a difference, you had better make sure that it's not just once in a while if you're lucky. One parameter aligned is actually the worst case, because you don't necessarily have to start at the beginning of a vector for the SIMD instructions, so what matters is whether the arguments are mutually aligned or not. @amitdo I don't think that the laggard should set the pace, considering how relatively easy the pragma is to use. |
@rfschtkt, it looks like one parameter is (mostly?) aligned while the other parameter is only aligned in one of four steps (it increases in steps of sizeof(double) == 8). As far as I see the code could also be modified to handle one aligned parameter with the other parameter unaligned. I'm still not sure how much the aligned access is faster. As reported by @Shreeshrii, some CPU models don't show increased speed where others show significant faster execution. Memory caches also play an important role. That's why I want to offer several implementations in Tesseract, and users can choose which one is best for their environment. |
Do you have any evidence that such differentiation might be useful beyond the purpose of experimentation? I would expect cache-oblivious code, techniques like blocking/strip-mining, to benefit all architectures more or less equally. It doesn't seem very efficient to try to reinvent the wheel here, but I haven't yet looked into any issues like licensing associated with the use of an external library with Tesseract, so... |
See https://github.com/RRZE-HPC/DDOT-Bench and the associated article for more research on this. They also have code which can be used freely. |
How about #983? The idea is to primarily rely on OpenMP 4.0's simd where available, then try explicit code for specific implementations, then fall back to serial execution. If you're confident that you have a better implementation than OpenMP, this order can of course be changed, but I don't think this is the case yet. |
Could you share some details on how you test performance here? Could this be reframed as a unit test that immediately rejects a suboptimal implementation choice? My conjecture is that OpenMP wipes the floor with (relatively speaking) naive use of SSE/AVX intrinsics, and that you only need the latter if you are stuck with Visual C++ (for now). |
Typically I use these test scenarios for the OCR process (
Recently I added a test scenario for training. It is based on issue #961 provided by @Shreeshrii. I started writing a standalone test to test the effects of cache size, but that is still unfinished. |
What is status of this PR? At least it needs to resolve conflicts.... |
The conflicts are resolved now. I suggest to keep the PR open until more people have reported their timing test results. |
Another test – Haswell i5 3.2 GHz with AVX2 on macOS High Sierra. Tesseract compiled with clang. Without this PR:
With this PR:
The output text files were identical. So no improvement, unfortunately. Training data was the "fast" files. |
That confirms my previous results and also those reported by @Shreeshrii. My new code obviously only improves the performance on server CPUs (Intel Xeon). Cache sizes and memory bandwidth might be different for such CPUs. |
So, should I merge this PR? |
No, I still want to get more timing results. Then I plan sending a patch which allows users to choose from several methods to calculate the dot product. I'll add RFC to the title of the pull request here to make the status clearer. |
This improves readability and reduces the code size from 317 to 288 bytes. Signed-off-by: Stefan Weil <[email protected]>
Fix this gcc warning: arch/dotproductavx.cpp:93:38: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] The new code no longer needs conditional compilation. Signed-off-by: Stefan Weil <[email protected]>
This improves the performance significantly. Signed-off-by: Stefan Weil <[email protected]>
@stweil: what to do with this PR regarding 4.0 release? |
That depends on the planned release date. I still want to allow the user to select the code used for the calculation of the dot product, but won't finish that the next two weeks, so I am afraid it will have to wait for a 4.1 release. |
This pull request is now obsolete as newer code gives much better results with AVX and best models.
|
The same test on a virtual machine with AVX shows no improvement by the code from PR #954, but a slightly faster result with the new optimized dot product.
|
Test results with MacBook Pro also show only a small improvement with the new dot product code.
|
@stweil Please add the training test to the |
No description provided.