-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perfomance loss without export OMP_NUM_THREADS=1
#596
Comments
Hi Remy, Thanks for the investigation. I get slow down with HT on. I have 12 physical cpu cores, with OMP_NUM_THREADS=12, it's good. Not sure how omp threads behave on Intel Core i9-13950H, i9-13950H seems to have 8 performance cores and 16 efficient cores. |
Thank you very much for your feedback. Here is what I get on my system:
I see no significant difference for OMP_NUM_THREADS <= 8 and 9 <= OMP_NUM_THREADS <= 24. Should I try to install FINUFFT from the Github sources until the next release? PS: is |
I'm not sure if the current master fix the omp hyper threading slow down, could you try install finufft==2.2.0 to see if you get slow down with threads >= 28 and you could also install finufft from GitHub master branch source to see if you still get slow down. If finufft==2.2.0, finufft==2.3.0 and github master branch, all the three installations give you similar slow down with default hyper theading number of omp threads, then for now you may need to disable hyper threading and choose your best number of omp threads(maybe just set omp num threads to number of physical cores, say 24 in your case)
pr #586 might already fix the package issue, see commit 53515f7, it's not in 2.3.0, should be in next 2.3.1. |
Sorry I misunderstood your first post, I thought this PR was related to an upcoming fix for OMP threads.
I still observe the OMP slow down issue with finufft==2.2.0 and finufft==2.3.0. I tried to install from GitHub master branch with pip install . from the
Disabling hyperthreading and/or manually setting the OMP_NUM_THREADS variable is ok for me but I am using finufft as a dependency of a package that I developed (see here) so I am looking for a more automatic way to do that until a fix is available. I am thinking about manually setting the OMP_NUM_THREADS by adding # temporary fix for FINUFFT issue #596: force OMP_NUM_THREADS=1
import os
os.environ["OMP_NUM_THREADS"] = "1" at the beginning of the # temporary fix for FINUFFT issue #596: force OMP_NUM_THREADS = number of physical cores
import os
import psutil
os.environ["OMP_NUM_THREADS"] = str(psutil.cpu_count(logical=False)) Do you have a better idea? |
You should be able to pass an One small remark: the test case you are showing is quite small, so it's perhaps not surprising that it doesn't show a big speedup when increasing the number of threads fom 1 to 24. Multithreading can only really show benefits for transforms with larger grids / lrger numbers of nonuniform points. |
Yes, agreed. python interface kwargs should support |
I tried |
In my case installation works fine but import raises the error mentioned above.
Thank you very much for your this suggestion, I will use a decorator to change the default value of the Please let me know when a fix is done in a future release so that I can remove my temporary fix. Thanks you again. |
Most probably it's because you are in |
You were right, thank you! I could install from Github master branch. I still observe a slight slow down with OMP_NUM_THREADS > 24 (~ 40ms / pass) but this is much better than what I described above with finufft v2.3.0 (~ 8000 ms / pass). My temporary fix still leads better perfs for the moment. |
Glad the master branch slow down on hyper threading is alleviated, not sure what is improvement though. Mostly on CPU side is @mreineck code templatizing and clean up. Maybe @mreineck also improved the openmp scheduling? |
It could be that it is using a different openmp library. One that is more recent. Also, compiling from source enable -march=native that can further tailor the code to the system used. As per docs, we recommend installing from source for best performance :) |
@lu1and10 it seems that even in the cpu code, now that everything is vectorized we are memory bound. Hence, hyper-threading might actually impair performance as it stresses the memory controller more. It might be worth using only physical cores with openmp? Also we might need to tweak the auto nthreads heuristic as it seems too aggressive on the number of threads. |
Agreed, I wonder for what problem size hyper threading actually helps, if not maybe just simply default to number of physical cores is fine. |
I'm not aware of changing anything related to multithreading, so it's a pleasant surprise that things are working better on the master branch :-) |
Indeed it's a good surprise :-) However, I found that the issue related above is actually very dependent on the values of |
Hi,
I am doing some benchmarks on a new laptop and I isolated some particular settings leading to significant performance losses depending on whether
export OMP_NUM_THREADS=1
is used or not before launching Python. Could you please try to reproduce the following?Install FINUFFT in a fresh virtual environment using pip
Notice the comment mentioning a potential missing dependency related to finufft installation with pip (but this is not really what has brought me here).
Benchmarking issue
Open a Python console from the terminal and copy-paste the following.
Here is what I get (it is very slow excepting for pass 1 in this run, the observed execution time can change quite a lot from one pass to another):
Now open a terminal and type
export OMP_NUM_THREADS=1 python
then copy-paste again the benchmarking code above. Here is what I get (much faster and repeatable execution times):
Additional comments
export OMP_NUM_THREADS=1
, changingdtype = np.float32
intodtype = np.float64
drops the computation time to roughly 40ms per pass.(N1, N2, N3)
, for instance,N1 = N2 = N3 = 50
leads to roughly 40ms per pass withdtype = np.float32
and withoutexport OMP_NUM_THREADS=1
.Environment
The text was updated successfully, but these errors were encountered: