-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache performance and inline functions #1035
Comments
I haven't measured cache performance in detail as you have - more measured
real-time performance under various conditions and speculated as to the
cause ;-) The detail you provide is very interesting.
One thing I have noticed is that performance of OpenMP varies a lot
according to CPU load, so you will get very poor performance trying to run
multiple tesseracts in parallel. A major performance improvement is
possible under such circumstances if you pin OpenMP threads to cores, which
can be done using an environment variable:
https://www.nas.nasa.gov/hecc/support/kb/using-intel-openmp-thread-affinity-for-pinning_285.html
I think what happens is that the OpenMP or the OS moves the threads around
the cores, when it is busy, for no good reason at all, other than wanting
to run other processes, which kills cache performance.
The weight matrices are mostly small enough to fit in core-level cache, but
if they get moved, it will be much slower. The effect may be more intense
with the smaller models that I have coming soon.
I have often wondered whether it would work more reliably faster if
re-written to not use Open MP, but use low-level threads instead. That
would be a lot of work, and may not pay off, so I haven't done it.
Alternatively, if you want to run it on multiple images in parallel using
the parallel command, you might just get better performance by turning off
OpenMP entirely and running it single-threaded.
I agree that some of the (larger) code in functions.h should be de-inlined.
I have a question on you opening statement. When you say I/O bound, I
assume you mean between the core and main memory, not I/O to disk?
…On Wed, Jul 12, 2017 at 6:46 PM, Justin Hotchkiss Palermo < ***@***.***> wrote:
From general testing, it appears as though I'm I/O bound while using
tesseract, and not CPU bound. I checked out the CPU Cache performance
(profiled with valgrind and perf, but perf's output is easiest to read) and
found that my machines have a high number of cache misses, and even a high
number of iTLB misses. After inspecting the code, I also noticed that
almost every function was defined as 'inline'. The high rate of iTLB misses
surprised me. Usually iTLB misses are reduced after running a program for a
longer period of time, but increasing the run time (for instance with an
image with a lot more text) saw my iTLB miss rate increase, perhaps
indicating that I am filling my iTLB, which is rare.
The iTLB misses worry me, especially because running a program for a
longer period of time typically reduces them (one inevitably has cache/tlb
misses when executing something that the CPU has not seen, but once it is
in the cache, it should stay in the cache...). iTLB misses are expensive on
Intel CPUs. I have used OpenMP in the past, but I'm not sure how it handles
an iTLB miss per 'thread'. If each thread hits a miss and hits a page fault
exception, and it's non blocking (which makes sense), then it'd mean that
each thread would cause a page walk for the same instructions. Not sure if
that's what happens, but I'd love to know what does happen (or a book or
resource) if someone knows more about how OpenMP handles them.
I have seen code styled similarly at my last job, where the main part of
the program was written over 25 years ago. At the time I expressed surprise
(and alarm) to a co-worker, who explained that compilers used to be
terrible at attempting to inline functions, and it wasn't until GCC 4 that
there was a decent attempt to improve them. GCC 4 was released in 2005.
(The version of GCC 4 you are using now has many updates and improvements,
and is different from the GCC 4 in 2005, so GCC 4 is not that 'outdated',
but it was the first major time where there were improvements.) So before
2005, best practice was to write code similar to how Tesseract identifies
almost all functions as 'inline'.
The 'inline' specifier carries some interesting aspects too, and was more
'important' to use with C, since C++ defaults to 'inline' when... (From
https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Inline.html#Inline
<http://url>)
As required by ISO C++, GCC considers member functions defined within the
body of a class to be marked inline even if they are not explicitly
declared with the inline keyword.
Also, GCC can ignore 'inline' as it sees fit. The -O2 optimization option
includes -finline-small-functions, which inlines functions that do not
generate additional code. -O3 includes -finline-functions, which is more or
less the same as writing 'inline' before every function, and only ignoring
it if GCC decides that it may reduce performance. -O2 is in the default
configure options, so adding 'inline' before each function is similar to
just compiling with '-finline-functions', except that it is slightly more
explicit.
There are many downsides to defining a function as 'inline', but the one
people usually seem to notice is increased compile time. There's plenty of
information available on Google/GNU's documentation about inline, the
history of inline, and slight differences between inline in g++ and GNU C
back in the day.
The biggest example of what I'd call (modern) "overactive inline use" in
tesseract is probably in these files:
https://github.com/tesseract-ocr/tesseract/blob/master/lstm/functions.h
https://github.com/tesseract-ocr/tesseract/blob/master/lstm/functions.cpp
functions.cpp has two lines of code, with almost everything in functions.h.
This is also an example of compiler output with default options, i.e.
keeping '-O2' and all of the different functions 'inlined' (with -Winline:
https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Inline.html#Inline ),
defined as:
Using -Winline warns when a function marked inline could not be
substituted, and gives the reason for the failure.
functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
inline double Logistic(double x) {
^~~~~~~~
functions.h:64:37: note: called from here
if (x < 0.0) return 1.0 - Logistic(-x);
~~~~~~~~^~~~
functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
inline double Logistic(double x) {
^~~~~~~~
functions.h:64:37: note: called from here
if (x < 0.0) return 1.0 - Logistic(-x);
~~~~~~~~^~~~
functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
inline double Logistic(double x) {
^~~~~~~~
functions.h:64:37: note: called from here
if (x < 0.0) return 1.0 - Logistic(-x);
~~~~~~~~^~~~
functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
inline double Logistic(double x) {
^~~~~~~~
functions.h:64:37: note: called from here
if (x < 0.0) return 1.0 - Logistic(-x);
~~~~~~~~^~~~
functions.h:45:15: warning: inlining failed in call to ‘double tesseract::Tanh(double)’: call is unlikely and code size would grow [-Winline]
inline double Tanh(double x) {
^~~~
functions.h:46:28: note: called from here
if (x < 0.0) return -Tanh(-x);
~~~~^~~~
As a test on my target machines, I copied my 'tesseract' repository,
removed all (easy to remove...) 'inline' functions, shifted the definition
to the .cpp file instead of the .h file (leaving the declaration in .h as
is standard practice), and compiled a separate version, installing it with
a different prefix. This allowed me to have multiple versions of Tesseract
on the same machine, and to run tests with all of them.
Next, I made a script. Fairly simple, utilizing GNU Parallel
CITATION:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.
The simple bash scripts look like this:
#!/bin/bash# CONTROL SCRIPT
TESS=/usr/bin/tesseract
F_LOC=(location of my test images)
parallel $TESS $F_LOC/{1} stdout ::: img1.jpg img2.png img3.jpeg
and
#!/bin/bash# TEST SCRIPT
TESS=/home/hotchkiss/usr/bin/tesseract
F_LOC=(location of my test images)
parallel $TESS $F_LOC/{1} stdout ::: img1.jpg img2.png img3.jpeg
I'd run them ~50 times randomly(as random as /dev/urandom can afford on
two options), 'randomly' defined as I randomly chose which to run, and
collected the average run times and percentages as an aggregate using perf.
Each run was after a fresh reboot, with no other programs running other
than X, i3 (my windows manager), and my default kernel + kernel modules
(less than 110MB when running 'free -m').
Results look something like:
Performance counter stats for 'test_tess':
54913.737862 task-clock:u (msec) # 6.757 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
136,385 page-faults:u # 0.002 M/sec
161,251,587,471 cycles:u # 2.936 GHz (29.05%)
154,963,746,273 instructions:u # 0.96 insn per cycle (36.65%)
35,080,176,770 branches:u # 638.823 M/sec (42.77%)
306,913,327 branch-misses:u # 0.87% of all branches (41.68%)
50,274,555,586 L1-dcache-loads:u # 915.519 M/sec (27.81%)
5,075,515,864 L1-dcache-load-misses:u # 10.10% of all L1-dcache hits (19.72%)
450,004,836 LLC-loads:u # 8.195 M/sec (17.70%)
23,036,525 LLC-load-misses:u # 10.24% of all LL-cache hits (21.92%)
<not supported> L1-icache-loads:u
11,674,082 L1-icache-load-misses:u (28.02%)
48,676,934,856 dTLB-loads:u # 886.425 M/sec (21.31%)
1,322,784 dTLB-load-misses:u # 0.00% of all dTLB cache hits (20.86%)
264,354 iTLB-loads:u # 0.005 M/sec (17.36%)
690,732 iTLB-load-misses:u # 261.29% of all iTLB cache hits (21.42%)
<not supported> L1-dcache-prefetches:u
<not supported> L1-dcache-prefetch-misses:u
8.127386606 seconds time elapsed
As expected, with no compiler optimizations and all functions with the
'inline' identifier removed, the performance was worse by a significant
amount (8% to 10% worse). Also as expected with modern compilers (gcc 6.3
and gcc 8.0... I haven't run 7.x stable yet, although I could tomorrow),
the performance seems to be ~the same and (perhaps, although requires more
testing to be definite) better on a CPU with smaller cache/tlb sizes.
I used parallels to ensure that things were running in parallel, similar
to what I will be using in production, and, what (hopefully) everyone else
is doing too. I checked the other things people have posted, and it seems
as though most others are doing the same too.
I am posting this for a few reasons.
1. How are you collecting cache and performance information, is it
standardized, and how would you like any statistics or data to be
submitted/displayed?
2. Is it alright if I remove and submit code without 'inline' declared
on every function?
3. Should I submit code with 'inline' removed from functions others
have written?
4. What are your (other developer's) target machines like, and are my
results and thought process similar?
Personally, I'm not a fan of manually marking all functions as inline, but
I'm also (quite) used to re-writing code/libraries to optimize performance
for my use case(s).
I can wait on posting any code changes until @theraysmith
<https://github.com/theraysmith> has the new beta tag out though. I'm not
sure how these types of issues are handled on github either. I've spent
most of my time working at a FFRDC, and haven't submitted much FOSS code
publicly at all because of it, but I can post what I've done with this,
which is great. Typically, I'd either write my own tests and submit along
with compiled code (and a 'NO MERGE' pull request), or follow already
provided tests and submit according to them, however I haven't seen any.
What do you all prefer to use, and would this be something you're
interested in? I absolutely abhor, detest, and strongly oppose programming
for the compiler, but I feel like these changes would actually do the
opposite, and allow the compiler to take over.
I'm also profiling functions for more detailed optimization with
cachegrind, and I'll absolutely post those after the changes have been
submitted. Again, it's less of a 'logic' change, and in some cases more of
a logic 'reordering' (like order of conditional if tests, what to
inline/skip/spend more time on and the like).
Another option that (might?) be decent, if it turns out that removing
'inline' does hurt performance on CPUs with large cache/tlb spaces, is that
I can do something similar to what the Linux Kernel has. There is a "Allow
gcc to uninline functions marked 'inline'" option (under 'Kernel Hacking')
because they have run into the same issue.
------------------------------
Environment
- *Tesseract Version*: 4.0x(latest -dev)
- *Platform*: Linux [hostname] 4.12.0 #1
<#1> SMP Tue Jul 11
14:56:49 EDT 2017 x86_64 Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
GenuineIntel GNU/Linux
(and, specs not with me right now, but posting anyway, an Intel Atom,
can post later).
tested with:
gcc version 8.0.0 20170711 (experimental) (GCC)
gcc version 6.3.0 (Gentoo 6.3.0 p1.0)
I can test with gcc 7.x later if need be. I haven't switched my
development system's stable version of gcc to 7 yet, even though I still
test the experimental svn branch.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1035>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AL056byjoySZrjs2w14KyfUye2KocwQ8ks5sNXbogaJpZM4OWZ7L>
.
--
Ray.
|
Regarding OpenMP thread affinity control:
GCC supports OpenMP 3.1 added OpenMP 4.0 added |
I'm just processing a larger number of images and run Tesseract on 4 Xeon servers of different age with (16 + 16 + 32 + 8) tesseract processes using GNU parallel. The binaries were built without OpenMP to avoid the overhead related with multithreading and to allow optimal use of the available CPU cores (one process per core). This setting produced 150,000 hOCR files during the last 10 days (example). |
Regarding OpenMP thread affinity: I think that the problem won't be fixed by binding OpenMP threads to a fixed CPU core nor will other thread libraries help. The current Tesseract simply creates too many threads even if a single process is started because it uses interleaved |
but you can limit the number of threads being used. |
I'm afraid you can limit it only by recompilation (either without OpenMP or with modified source code). |
Well, we can change the code... (if we can prove the change will really make things better) |
Did you try |
No, I did not – obviously I missed that environment variable. I only had tried |
Great! I just discovered this variable an hour ago :-) |
https://github.com/tesseract-ocr/tesseract/search?q=omp+path%3Asrc%2Flstm Maybe we should remove some of the |
tesseract-ocr/tesstrain#259 (comment) I'm OK with disabling OpenMP by default in 5.0.1 and 4.1.4. I still think OpenMP in Tesseract can be improved (future work). |
From general testing, it appears as though I'm I/O bound while using tesseract, and not CPU bound. I checked out the CPU Cache performance (profiled with valgrind and perf, but perf's output is easiest to read) and found that my machines have a high number of cache misses, and even a high number of iTLB misses. After inspecting the code, I also noticed that almost every function was defined as 'inline'. The high rate of iTLB misses surprised me. Usually iTLB misses are reduced after running a program for a longer period of time, but increasing the run time (for instance with an image with a lot more text) saw my iTLB miss rate increase, perhaps indicating that I am filling my iTLB, which is rare.
The iTLB misses worry me, especially because running a program for a longer period of time typically reduces them (one inevitably has cache/tlb misses when executing something that the CPU has not seen, but once it is in the cache, it should stay in the cache...). iTLB misses are expensive on Intel CPUs. I have used OpenMP in the past, but I'm not sure how it handles an iTLB miss per 'thread'. If each thread hits a miss and hits a page fault exception, and it's non blocking (which makes sense), then it'd mean that each thread would cause a page walk for the same instructions. Not sure if that's what happens, but I'd love to know what does happen (or a book or resource) if someone knows more about how OpenMP handles them.
I have seen code styled similarly at my last job, where the main part of the program was written over 25 years ago. At the time I expressed surprise (and alarm) to a co-worker, who explained that compilers used to be terrible at attempting to inline functions, and it wasn't until GCC 4 that there was a decent attempt to improve them. GCC 4 was released in 2005. (The version of GCC 4 you are using now has many updates and improvements, and is different from the GCC 4 in 2005, so GCC 4 is not that 'outdated', but it was the first major time where there were improvements.) So before 2005, best practice was to write code similar to how Tesseract identifies almost all functions as 'inline'.
The 'inline' specifier carries some interesting aspects too, and was more 'important' to use with C, since C++ defaults to 'inline' when... (From https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Inline.html#Inline)
Also, GCC can ignore 'inline' as it sees fit. The -O2 optimization option includes -finline-small-functions, which inlines functions that do not generate additional code. -O3 includes -finline-functions, which is more or less the same as writing 'inline' before every function, and only ignoring it if GCC decides that it may reduce performance. -O2 is in the default configure options, so adding 'inline' before each function is similar to just compiling with '-finline-functions', except that it is slightly more explicit.
There are many downsides to defining a function as 'inline', but the one people usually seem to notice is increased compile time. There's plenty of information available on Google/GNU's documentation about inline, the history of inline, and slight differences between inline in g++ and GNU C back in the day.
The biggest example of what I'd call (modern) "overactive inline use" in tesseract is probably in these files:
https://github.com/tesseract-ocr/tesseract/blob/master/lstm/functions.h
https://github.com/tesseract-ocr/tesseract/blob/master/lstm/functions.cpp
functions.cpp has two lines of code, with almost everything in functions.h.
This is also an example of compiler output with default options, i.e. keeping '-O2' and all of the different functions 'inlined' (with -Winline: https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Inline.html#Inline ), defined as:
As a test on my target machines, I copied my 'tesseract' repository, removed all (easy to remove...) 'inline' functions, shifted the definition to the .cpp file instead of the .h file (leaving the declaration in .h as is standard practice), and compiled a separate version, installing it with a different prefix. This allowed me to have multiple versions of Tesseract on the same machine, and to run tests with all of them.
Next, I made a script. Fairly simple, utilizing GNU Parallel
CITATION:
The simple bash scripts look like this:
and
I'd run them ~50 times randomly(as random as /dev/urandom can afford on two options), 'randomly' defined as I randomly chose which to run, and collected the average run times and percentages as an aggregate using perf. Each run was after a fresh reboot, with no other programs running other than X, i3 (my windows manager), and my default kernel + kernel modules (less than 110MB when running 'free -m').
Results look something like:
As expected, with no compiler optimizations and all functions with the 'inline' identifier removed, the performance was worse by a significant amount (8% to 10% worse). Also as expected with modern compilers (gcc 6.3 and gcc 8.0... I haven't run 7.x stable yet, although I could tomorrow), the performance seems to be ~the same and (perhaps, although requires more testing to be definite) better on a CPU with smaller cache/tlb sizes.
I used parallels to ensure that things were running in parallel, similar to what I will be using in production, and, what (hopefully) everyone else is doing too. I checked the other things people have posted, and it seems as though most others are doing the same too.
What I'm asking
Personally, I'm not a fan of manually marking all functions as inline, but I'm also (quite) used to re-writing code/libraries to optimize performance for my use case(s).
I can wait on posting any code changes until @theraysmith has the new beta tag out though. I'm not sure how these types of issues are handled on github either. I've spent most of my time working at a FFRDC, and haven't submitted much FOSS code publicly at all because of it, but I can post what I've done with this, which is great. Typically, I'd either write my own tests and submit along with compiled code (and a 'NO MERGE' pull request), or follow already provided tests and submit according to them, however I haven't seen any. What do you all prefer to use, and would this be something you're interested in? I absolutely abhor, detest, and strongly oppose programming for the compiler, but I feel like these changes would actually do the opposite, and allow the compiler to take over.
I'm also profiling functions for more detailed optimization with cachegrind, and I'll absolutely post those after the changes have been submitted. Again, it's less of a 'logic' change, and in some cases more of a logic 'reordering' (like order of conditional if tests, what to inline/skip/spend more time on and the like).
Another option that (might?) be decent, if it turns out that removing 'inline' does hurt performance on CPUs with large cache/tlb spaces, is that I can do something similar to what the Linux Kernel has. There is a "Allow gcc to uninline functions marked 'inline'" option (under 'Kernel Hacking') because they have run into the same issue.
Environment
(and, specs not with me right now, but posting anyway, an Intel Atom, can post later).
tested with:
gcc version 8.0.0 20170711 (experimental) (GCC)
gcc version 6.3.0 (Gentoo 6.3.0 p1.0)
I can test with gcc 7.x later if need be. I haven't switched my development system's stable version of gcc to 7 yet, even though I still test the experimental svn branch.
The text was updated successfully, but these errors were encountered: