Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache performance and inline functions #1035

Open
hotchkiss87 opened this issue Jul 13, 2017 · 12 comments
Open

cache performance and inline functions #1035

hotchkiss87 opened this issue Jul 13, 2017 · 12 comments

Comments

@hotchkiss87
Copy link
Contributor

hotchkiss87 commented Jul 13, 2017

From general testing, it appears as though I'm I/O bound while using tesseract, and not CPU bound. I checked out the CPU Cache performance (profiled with valgrind and perf, but perf's output is easiest to read) and found that my machines have a high number of cache misses, and even a high number of iTLB misses. After inspecting the code, I also noticed that almost every function was defined as 'inline'. The high rate of iTLB misses surprised me. Usually iTLB misses are reduced after running a program for a longer period of time, but increasing the run time (for instance with an image with a lot more text) saw my iTLB miss rate increase, perhaps indicating that I am filling my iTLB, which is rare.

The iTLB misses worry me, especially because running a program for a longer period of time typically reduces them (one inevitably has cache/tlb misses when executing something that the CPU has not seen, but once it is in the cache, it should stay in the cache...). iTLB misses are expensive on Intel CPUs. I have used OpenMP in the past, but I'm not sure how it handles an iTLB miss per 'thread'. If each thread hits a miss and hits a page fault exception, and it's non blocking (which makes sense), then it'd mean that each thread would cause a page walk for the same instructions. Not sure if that's what happens, but I'd love to know what does happen (or a book or resource) if someone knows more about how OpenMP handles them.

I have seen code styled similarly at my last job, where the main part of the program was written over 25 years ago. At the time I expressed surprise (and alarm) to a co-worker, who explained that compilers used to be terrible at attempting to inline functions, and it wasn't until GCC 4 that there was a decent attempt to improve them. GCC 4 was released in 2005. (The version of GCC 4 you are using now has many updates and improvements, and is different from the GCC 4 in 2005, so GCC 4 is not that 'outdated', but it was the first major time where there were improvements.) So before 2005, best practice was to write code similar to how Tesseract identifies almost all functions as 'inline'.

The 'inline' specifier carries some interesting aspects too, and was more 'important' to use with C, since C++ defaults to 'inline' when... (From https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Inline.html#Inline)

As required by ISO C++, GCC considers member functions defined within the body of a class to be marked inline even if they are not explicitly declared with the inline keyword.

Also, GCC can ignore 'inline' as it sees fit. The -O2 optimization option includes -finline-small-functions, which inlines functions that do not generate additional code. -O3 includes -finline-functions, which is more or less the same as writing 'inline' before every function, and only ignoring it if GCC decides that it may reduce performance. -O2 is in the default configure options, so adding 'inline' before each function is similar to just compiling with '-finline-functions', except that it is slightly more explicit.

There are many downsides to defining a function as 'inline', but the one people usually seem to notice is increased compile time. There's plenty of information available on Google/GNU's documentation about inline, the history of inline, and slight differences between inline in g++ and GNU C back in the day.

The biggest example of what I'd call (modern) "overactive inline use" in tesseract is probably in these files:
https://github.com/tesseract-ocr/tesseract/blob/master/lstm/functions.h
https://github.com/tesseract-ocr/tesseract/blob/master/lstm/functions.cpp

functions.cpp has two lines of code, with almost everything in functions.h.

This is also an example of compiler output with default options, i.e. keeping '-O2' and all of the different functions 'inlined' (with -Winline: https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Inline.html#Inline ), defined as:

Using -Winline warns when a function marked inline could not be substituted, and gives the reason for the failure.

functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
 inline double Logistic(double x) {
               ^~~~~~~~
functions.h:64:37: note: called from here
   if (x < 0.0) return 1.0 - Logistic(-x);
                             ~~~~~~~~^~~~
functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
 inline double Logistic(double x) {
               ^~~~~~~~
functions.h:64:37: note: called from here
   if (x < 0.0) return 1.0 - Logistic(-x);
                             ~~~~~~~~^~~~
functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
 inline double Logistic(double x) {
               ^~~~~~~~
functions.h:64:37: note: called from here
   if (x < 0.0) return 1.0 - Logistic(-x);
                             ~~~~~~~~^~~~
functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
 inline double Logistic(double x) {
               ^~~~~~~~
functions.h:64:37: note: called from here
   if (x < 0.0) return 1.0 - Logistic(-x);
                             ~~~~~~~~^~~~
functions.h:45:15: warning: inlining failed in call to ‘double tesseract::Tanh(double)’: call is unlikely and code size would grow [-Winline]
 inline double Tanh(double x) {
               ^~~~
functions.h:46:28: note: called from here
   if (x < 0.0) return -Tanh(-x);
                        ~~~~^~~~

As a test on my target machines, I copied my 'tesseract' repository, removed all (easy to remove...) 'inline' functions, shifted the definition to the .cpp file instead of the .h file (leaving the declaration in .h as is standard practice), and compiled a separate version, installing it with a different prefix. This allowed me to have multiple versions of Tesseract on the same machine, and to run tests with all of them.

Next, I made a script. Fairly simple, utilizing GNU Parallel
CITATION:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.

The simple bash scripts look like this:

#!/bin/bash
# CONTROL SCRIPT
TESS=/usr/bin/tesseract
F_LOC=(location of my test images)
parallel $TESS $F_LOC/{1} stdout ::: img1.jpg img2.png img3.jpeg

and

#!/bin/bash
# TEST SCRIPT
TESS=/home/hotchkiss/usr/bin/tesseract
F_LOC=(location of my test images)
parallel $TESS $F_LOC/{1} stdout ::: img1.jpg img2.png img3.jpeg

I'd run them ~50 times randomly(as random as /dev/urandom can afford on two options), 'randomly' defined as I randomly chose which to run, and collected the average run times and percentages as an aggregate using perf. Each run was after a fresh reboot, with no other programs running other than X, i3 (my windows manager), and my default kernel + kernel modules (less than 110MB when running 'free -m').

Results look something like:

 Performance counter stats for 'test_tess':

      54913.737862      task-clock:u (msec)       #    6.757 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
           136,385      page-faults:u             #    0.002 M/sec                  
   161,251,587,471      cycles:u                  #    2.936 GHz                      (29.05%)
   154,963,746,273      instructions:u            #    0.96  insn per cycle           (36.65%)
    35,080,176,770      branches:u                #  638.823 M/sec                    (42.77%)
       306,913,327      branch-misses:u           #    0.87% of all branches          (41.68%)
    50,274,555,586      L1-dcache-loads:u         #  915.519 M/sec                    (27.81%)
     5,075,515,864      L1-dcache-load-misses:u   #   10.10% of all L1-dcache hits    (19.72%)
       450,004,836      LLC-loads:u               #    8.195 M/sec                    (17.70%)
        23,036,525      LLC-load-misses:u         #   10.24% of all LL-cache hits     (21.92%)
   <not supported>      L1-icache-loads:u                                           
        11,674,082      L1-icache-load-misses:u                                       (28.02%)
    48,676,934,856      dTLB-loads:u              #  886.425 M/sec                    (21.31%)
         1,322,784      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   (20.86%)
           264,354      iTLB-loads:u              #    0.005 M/sec                    (17.36%)
           690,732      iTLB-load-misses:u        #  261.29% of all iTLB cache hits   (21.42%)
   <not supported>      L1-dcache-prefetches:u                                      
   <not supported>      L1-dcache-prefetch-misses:u                                   

       8.127386606 seconds time elapsed

As expected, with no compiler optimizations and all functions with the 'inline' identifier removed, the performance was worse by a significant amount (8% to 10% worse). Also as expected with modern compilers (gcc 6.3 and gcc 8.0... I haven't run 7.x stable yet, although I could tomorrow), the performance seems to be ~the same and (perhaps, although requires more testing to be definite) better on a CPU with smaller cache/tlb sizes.

I used parallels to ensure that things were running in parallel, similar to what I will be using in production, and, what (hopefully) everyone else is doing too. I checked the other things people have posted, and it seems as though most others are doing the same too.

What I'm asking

  1. How are you collecting cache and performance information, is it standardized, and how would you like any statistics or data to be submitted/displayed?
  2. Is it alright if I remove and submit code without 'inline' declared on every function?
  3. Should I submit code with 'inline' removed from functions others have written?
  4. What are your (other developer's) target machines like, and are my results and thought process similar?

Personally, I'm not a fan of manually marking all functions as inline, but I'm also (quite) used to re-writing code/libraries to optimize performance for my use case(s).

I can wait on posting any code changes until @theraysmith has the new beta tag out though. I'm not sure how these types of issues are handled on github either. I've spent most of my time working at a FFRDC, and haven't submitted much FOSS code publicly at all because of it, but I can post what I've done with this, which is great. Typically, I'd either write my own tests and submit along with compiled code (and a 'NO MERGE' pull request), or follow already provided tests and submit according to them, however I haven't seen any. What do you all prefer to use, and would this be something you're interested in? I absolutely abhor, detest, and strongly oppose programming for the compiler, but I feel like these changes would actually do the opposite, and allow the compiler to take over.

I'm also profiling functions for more detailed optimization with cachegrind, and I'll absolutely post those after the changes have been submitted. Again, it's less of a 'logic' change, and in some cases more of a logic 'reordering' (like order of conditional if tests, what to inline/skip/spend more time on and the like).

Another option that (might?) be decent, if it turns out that removing 'inline' does hurt performance on CPUs with large cache/tlb spaces, is that I can do something similar to what the Linux Kernel has. There is a "Allow gcc to uninline functions marked 'inline'" option (under 'Kernel Hacking') because they have run into the same issue.


Environment

  • Tesseract Version: 4.0x(latest -dev)
  • Platform: Linux [hostname] 4.12.0 defect issue #1 SMP Tue Jul 11 14:56:49 EDT 2017 x86_64 Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz GenuineIntel GNU/Linux
    (and, specs not with me right now, but posting anyway, an Intel Atom, can post later).

tested with:
gcc version 8.0.0 20170711 (experimental) (GCC)
gcc version 6.3.0 (Gentoo 6.3.0 p1.0)
I can test with gcc 7.x later if need be. I haven't switched my development system's stable version of gcc to 7 yet, even though I still test the experimental svn branch.

@theraysmith
Copy link
Contributor

theraysmith commented Jul 13, 2017 via email

@amitdo
Copy link
Collaborator

amitdo commented Jul 13, 2017

Regarding OpenMP thread affinity control:

https://www.nas.nasa.gov/hecc/support/kb/using-intel-openmp-thread-affinity-for-pinning_285.html

KMP_AFFINITY is specific to intel's OpenMP runtime.

GCC supports GOMP_CPU_AFFINITY
https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html#GOMP_005fCPU_005fAFFINITY

OpenMP 3.1 added OMP_PROC_BIND
https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fPROC_005fBIND.html#OMP_005fPROC_005fBIND

OpenMP 4.0 added OMP_PLACES
https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fPLACES.html#OMP_005fPLACES

@stweil
Copy link
Member

stweil commented Jul 13, 2017

What are your (other developer's) target machines like, and are my results and thought process similar?

I'm just processing a larger number of images and run Tesseract on 4 Xeon servers of different age with (16 + 16 + 32 + 8) tesseract processes using GNU parallel. The binaries were built without OpenMP to avoid the overhead related with multithreading and to allow optimal use of the available CPU cores (one process per core). This setting produced 150,000 hOCR files during the last 10 days (example).

@stweil
Copy link
Member

stweil commented Jul 13, 2017

Regarding OpenMP thread affinity: I think that the problem won't be fixed by binding OpenMP threads to a fixed CPU core nor will other thread libraries help. The current Tesseract simply creates too many threads even if a single process is started because it uses interleaved #pragma omp parallel statements. As soon as more than one process is running things get even worse because the number of threads then will exceed the number of CPU cores, and scheduling overhead will occur.

@amitdo
Copy link
Collaborator

amitdo commented Jul 13, 2017

because the number of threads then will exceed the number of CPU cores

but you can limit the number of threads being used.

@stweil
Copy link
Member

stweil commented Jul 13, 2017

I'm afraid you can limit it only by recompilation (either without OpenMP or with modified source code).

@amitdo
Copy link
Collaborator

amitdo commented Jul 13, 2017

Well, we can change the code...

(if we can prove the change will really make things better)

@amitdo
Copy link
Collaborator

amitdo commented Jul 13, 2017

Did you try OMP_THREAD_LIMIT=1 tesseract...?

@stweil
Copy link
Member

stweil commented Jul 13, 2017

No, I did not – obviously I missed that environment variable. I only had tried OMP_NUM_THREADS. Setting OMP_THREAD_LIMIT=1 indeed stops thread creation. Many thanks. Now I no longer have to build special binaries without OpenMP.

@amitdo
Copy link
Collaborator

amitdo commented Jul 13, 2017

Great!

I just discovered this variable an hour ago :-)
https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html

@amitdo
Copy link
Collaborator

amitdo commented Dec 27, 2021

@stweil,

Regarding OpenMP thread affinity: I think that the problem won't be fixed by binding OpenMP threads to a fixed CPU core nor will other thread libraries help. The current Tesseract simply creates too many threads even if a single process is started because it uses interleaved #pragma omp parallel statements. As soon as more than one process is running things get even worse because the number of threads then will exceed the number of CPU cores, and scheduling overhead will occur.

https://github.com/tesseract-ocr/tesseract/search?q=omp+path%3Asrc%2Flstm

Maybe we should remove some of the #pragma omp parallel for to prevent the interleaving issue?

@amitdo
Copy link
Collaborator

amitdo commented Jan 7, 2022

tesseract-ocr/tesstrain#259 (comment)

I'm OK with disabling OpenMP by default in 5.0.1 and 4.1.4.

I still think OpenMP in Tesseract can be improved (future work).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants