diff --git a/CHANGELOG b/CHANGELOG index 39d66efeb..d515ce790 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -8,6 +8,8 @@ V 2.0.2 (12/5/20) critical to atomic add_wrapped_subgrid() operations; thanks Rob Blackwell. * Increased heuristic t1 spreader max_subproblem_size, faster in 2D, 3D, and allowed this and the above atomic threshold to be controlled as nufft_opts. +* Removed MAX_USEFUL_NTHREADS from defs.h and all code, for simplicity, since + large thread number now scales better. * multithreaded one-mode accuracy test in C++ tests, t1 & t3, for faster tests. V 2.0.1 (10/6/20) diff --git a/docs/opts.rst b/docs/opts.rst index a737928f5..ce0d2c9c0 100644 --- a/docs/opts.rst +++ b/docs/opts.rst @@ -128,7 +128,7 @@ Diagnostic options Algorithm performance options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -**nthreads**: Number of threads to use. This sets the number of threads FINUFFT will use in FFTW, bin-sorting, and spreading/interpolation steps. This number of threads also controls the batch size for vectorized transforms (ie ``ntr>1`` :ref:`here `). Setting ``nthreads=0`` uses all threads available (up to an internal maximum that has been chosen based on performance; see ``MAX_USEFUL_NTHREADS`` in ``include/defs.h``). For repeated small problems it can be advantageous to use a small number, such as 1. +**nthreads**: Number of threads to use. This sets the number of threads FINUFFT will use in FFTW, bin-sorting, and spreading/interpolation steps. This number of threads also controls the batch size for vectorized transforms (ie ``ntr>1`` :ref:`here `). Setting ``nthreads=0`` uses all threads available. For repeated small problems it can be advantageous to use a small number, such as 1. **fftw**: FFTW planner flags. This number is simply passed to FFTW's planner; the flags are documented `here `_. diff --git a/docs/trouble.rst b/docs/trouble.rst index 7bfb52141..61daa5859 100644 --- a/docs/trouble.rst +++ b/docs/trouble.rst @@ -40,7 +40,7 @@ If FINUFFT is slow (eg, less than $10^6$ nonuniform points per second), here is - Try printing debug output to see step-by-step progress by FINUFFT. Do this by setting ``opts.debug`` to 1 or 2 then looking at the timing information. -- Try reducing the number of threads either externally or via ``opts.nthreads``, perhaps down to 1 thread, to make sure you are not having collisions between threads, or slowdown due to thread overheads. Hyperthreading (more threads than physical cores) rarely helps much. Thread collisions are possible if large problems are run with a large number of (say more than 64) threads. We added the constant ``MAX_USEFUL_NTHREADS`` in ``include/defs.h`` to address this in the vectorized (stacked) inputs case. Another ase causing slowness is very many repetitions of small problems; see ``test/manysmallprobs`` which exceeds $10^7$ points/sec with one thread via the guru interface, but can get ridiculously slower with many threads; see https://github.com/flatironinstitute/finufft/issues/86 +- Try reducing the number of threads, either those available via OpenMP, or via ``opts.nthreads``, perhaps down to 1 thread, to make sure you are not having collisions between threads, or slowdown due to thread overheads. Hyperthreading (more threads than physical cores) rarely helps much. Thread collisions are possible if large problems are run with a large number of (say more than 64) threads. Another ase causing slowness is very many repetitions of small problems; see ``test/manysmallprobs`` which exceeds $10^7$ points/sec with one thread via the guru interface, but can get ridiculously slower with many threads; see https://github.com/flatironinstitute/finufft/issues/86 - Try setting a crude tolerance, eg ``tol=1e-3``. How many digits do you actually need? This has a big effect in higher dimensions, since the number of flops scales like $(\log 1/\epsilon)^d$, but not quite as big an effect as this scaling would suggest, because in higher dimensions the flops/RAM ratio is higher. diff --git a/finufft-manual.pdf b/finufft-manual.pdf index c0601a77a..80dffd4f6 100644 Binary files a/finufft-manual.pdf and b/finufft-manual.pdf differ diff --git a/include/defs.h b/include/defs.h index 96d19154c..31fa64ac1 100644 --- a/include/defs.h +++ b/include/defs.h @@ -8,6 +8,7 @@ // use types intrinsic to finufft interface (FLT, CPX, BIGINT, etc) #include + // ------------- Library-wide algorithm parameter settings ---------------- // Library version (is a string) @@ -28,9 +29,6 @@ // Increase this if you need >1TB RAM... (used only in common.cpp) #define MAX_NF (BIGINT)1e11 -// Max number of useful threads for setting default blksize (depends on NUMA) -#define MAX_USEFUL_NTHREADS 24 - // ---------- Global error/warning output codes for the library --------------- diff --git a/src/finufft.cpp b/src/finufft.cpp index c95928acf..657d3f146 100644 --- a/src/finufft.cpp +++ b/src/finufft.cpp @@ -574,9 +574,9 @@ int FINUFFT_MAKEPLAN(int type, int dim, BIGINT* n_modes, int iflag, p->fftSign = (iflag>=0) ? 1 : -1; // clean up flag input // choose overall # threads... - int nthr = min(MY_OMP_GET_MAX_THREADS(), MAX_USEFUL_NTHREADS); // limit it + int nthr = MY_OMP_GET_MAX_THREADS(); // use as many as OMP gives us if (p->opts.nthreads>0) - nthr = p->opts.nthreads; // user override (no limit) + nthr = p->opts.nthreads; // user override (no limit or check) p->opts.nthreads = nthr; // store actual # thr planned for // choose batchSize for types 1,2 or 3... (uses int ceil(b/a)=1+(b-1)/a trick) @@ -627,7 +627,7 @@ int FINUFFT_MAKEPLAN(int type, int dim, BIGINT* n_modes, int iflag, if (type==1 || type==2) { int nthr_fft = nthr; // give FFTW same as overall - // should limit max # threads here too? or set equal to batchsize? + // *** should set equal to batchsize? // *** put in logic for setting FFTW # thr based on o.spread_thread? FFTW_INIT(); // only does anything when OMP=ON for >1 threads FFTW_PLAN_TH(nthr_fft); // " (not batchSize since can be 1 but want mul-thr)