removed MAX_USEFUL_NTHREADS; rebuilt docs

flatironinstitute · Dec 5, 2020 · b998d49 · b998d49
1 parent 9ee5544
commit b998d49
Show file tree

Hide file tree

Showing 6 changed files with 8 additions and 8 deletions.
diff --git a/CHANGELOG b/CHANGELOG
@@ -8,6 +8,8 @@ V 2.0.2 (12/5/20)
   critical to atomic add_wrapped_subgrid() operations; thanks Rob Blackwell.
 * Increased heuristic t1 spreader max_subproblem_size, faster in 2D, 3D, and
   allowed this and the above atomic threshold to be controlled as nufft_opts.
+* Removed MAX_USEFUL_NTHREADS from defs.h and all code, for simplicity, since
+  large thread number now scales better.
 * multithreaded one-mode accuracy test in C++ tests, t1 & t3, for faster tests.
 
 V 2.0.1 (10/6/20)

diff --git a/docs/opts.rst b/docs/opts.rst
@@ -128,7 +128,7 @@ Diagnostic options
 Algorithm performance options
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-**nthreads**: Number of threads to use. This sets the number of threads FINUFFT will use in FFTW, bin-sorting, and spreading/interpolation steps. This number of threads also controls the batch size for vectorized transforms (ie ``ntr>1`` :ref:`here <c>`). Setting ``nthreads=0`` uses all threads available (up to an internal maximum that has been chosen based on performance; see ``MAX_USEFUL_NTHREADS`` in ``include/defs.h``). For repeated small problems it can be advantageous to use a small number, such as 1.
+**nthreads**: Number of threads to use. This sets the number of threads FINUFFT will use in FFTW, bin-sorting, and spreading/interpolation steps. This number of threads also controls the batch size for vectorized transforms (ie ``ntr>1`` :ref:`here <c>`). Setting ``nthreads=0`` uses all threads available. For repeated small problems it can be advantageous to use a small number, such as 1.
 
 **fftw**: FFTW planner flags. This number is simply passed to FFTW's planner;
 the flags are documented `here <http://www.fftw.org/fftw3_doc/Planner-Flags.html#Planner-Flags>`_.

diff --git a/docs/trouble.rst b/docs/trouble.rst
@@ -40,7 +40,7 @@ If FINUFFT is slow (eg, less than $10^6$ nonuniform points per second), here is
 
 - Try printing debug output to see step-by-step progress by FINUFFT. Do this by setting ``opts.debug`` to 1 or 2 then looking at the timing information.
 
-- Try reducing the number of threads either externally or via ``opts.nthreads``, perhaps down to 1 thread, to make sure you are not having collisions between threads, or slowdown due to thread overheads. Hyperthreading (more threads than physical cores) rarely helps much. Thread collisions are possible if large problems are run with a large number of (say more than 64) threads. We added the constant ``MAX_USEFUL_NTHREADS`` in ``include/defs.h`` to address this in the vectorized (stacked) inputs case. Another ase causing slowness is very many repetitions of small problems; see ``test/manysmallprobs`` which exceeds $10^7$ points/sec with one thread via the guru interface, but can get ridiculously slower with many threads; see https://github.com/flatironinstitute/finufft/issues/86
+- Try reducing the number of threads, either those available via OpenMP, or via ``opts.nthreads``, perhaps down to 1 thread, to make sure you are not having collisions between threads, or slowdown due to thread overheads. Hyperthreading (more threads than physical cores) rarely helps much. Thread collisions are possible if large problems are run with a large number of (say more than 64) threads. Another ase causing slowness is very many repetitions of small problems; see ``test/manysmallprobs`` which exceeds $10^7$ points/sec with one thread via the guru interface, but can get ridiculously slower with many threads; see https://github.com/flatironinstitute/finufft/issues/86
 
 - Try setting a crude tolerance, eg ``tol=1e-3``. How many digits do you actually need? This has a big effect in higher dimensions, since the number of flops scales like $(\log 1/\epsilon)^d$, but not quite as big an effect as this scaling would suggest, because in higher dimensions the flops/RAM ratio is higher.
 

diff --git a/finufft-manual.pdf b/finufft-manual.pdf
diff --git a/include/defs.h b/include/defs.h
@@ -8,6 +8,7 @@
 // use types intrinsic to finufft interface (FLT, CPX, BIGINT, etc)
 #include <dataTypes.h>
 
+
 // ------------- Library-wide algorithm parameter settings ----------------
 
 // Library version (is a string)
@@ -28,9 +29,6 @@
 // Increase this if you need >1TB RAM... (used only in common.cpp)
 #define MAX_NF    (BIGINT)1e11
 
-// Max number of useful threads for setting default blksize (depends on NUMA)
-#define MAX_USEFUL_NTHREADS 24
-
 
 
 // ---------- Global error/warning output codes for the library ---------------

diff --git a/src/finufft.cpp b/src/finufft.cpp
@@ -574,9 +574,9 @@ int FINUFFT_MAKEPLAN(int type, int dim, BIGINT* n_modes, int iflag,
   p->fftSign = (iflag>=0) ? 1 : -1;         // clean up flag input
 
   // choose overall # threads...
-  int nthr = min(MY_OMP_GET_MAX_THREADS(), MAX_USEFUL_NTHREADS);   // limit it
+  int nthr = MY_OMP_GET_MAX_THREADS();      // use as many as OMP gives us
   if (p->opts.nthreads>0)
-    nthr = p->opts.nthreads;                // user override (no limit)
+    nthr = p->opts.nthreads;                // user override (no limit or check)
   p->opts.nthreads = nthr;                  // store actual # thr planned for
 
   // choose batchSize for types 1,2 or 3... (uses int ceil(b/a)=1+(b-1)/a trick)
@@ -627,7 +627,7 @@ int FINUFFT_MAKEPLAN(int type, int dim, BIGINT* n_modes, int iflag,
   if (type==1 || type==2) {
 
     int nthr_fft = nthr;    // give FFTW same as overall
-    // should limit max # threads here too? or set equal to batchsize?
+    // *** should set equal to batchsize?
     // *** put in logic for setting FFTW # thr based on o.spread_thread?
     FFTW_INIT();           // only does anything when OMP=ON for >1 threads
     FFTW_PLAN_TH(nthr_fft); // " (not batchSize since can be 1 but want mul-thr)