Skip to content

Commit

Permalink
removed MAX_USEFUL_NTHREADS; rebuilt docs
Browse files Browse the repository at this point in the history
  • Loading branch information
ahbarnett committed Dec 5, 2020
1 parent 9ee5544 commit b998d49
Show file tree
Hide file tree
Showing 6 changed files with 8 additions and 8 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ V 2.0.2 (12/5/20)
critical to atomic add_wrapped_subgrid() operations; thanks Rob Blackwell.
* Increased heuristic t1 spreader max_subproblem_size, faster in 2D, 3D, and
allowed this and the above atomic threshold to be controlled as nufft_opts.
* Removed MAX_USEFUL_NTHREADS from defs.h and all code, for simplicity, since
large thread number now scales better.
* multithreaded one-mode accuracy test in C++ tests, t1 & t3, for faster tests.

V 2.0.1 (10/6/20)
Expand Down
2 changes: 1 addition & 1 deletion docs/opts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ Diagnostic options
Algorithm performance options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**nthreads**: Number of threads to use. This sets the number of threads FINUFFT will use in FFTW, bin-sorting, and spreading/interpolation steps. This number of threads also controls the batch size for vectorized transforms (ie ``ntr>1`` :ref:`here <c>`). Setting ``nthreads=0`` uses all threads available (up to an internal maximum that has been chosen based on performance; see ``MAX_USEFUL_NTHREADS`` in ``include/defs.h``). For repeated small problems it can be advantageous to use a small number, such as 1.
**nthreads**: Number of threads to use. This sets the number of threads FINUFFT will use in FFTW, bin-sorting, and spreading/interpolation steps. This number of threads also controls the batch size for vectorized transforms (ie ``ntr>1`` :ref:`here <c>`). Setting ``nthreads=0`` uses all threads available. For repeated small problems it can be advantageous to use a small number, such as 1.

**fftw**: FFTW planner flags. This number is simply passed to FFTW's planner;
the flags are documented `here <http://www.fftw.org/fftw3_doc/Planner-Flags.html#Planner-Flags>`_.
Expand Down
2 changes: 1 addition & 1 deletion docs/trouble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ If FINUFFT is slow (eg, less than $10^6$ nonuniform points per second), here is

- Try printing debug output to see step-by-step progress by FINUFFT. Do this by setting ``opts.debug`` to 1 or 2 then looking at the timing information.

- Try reducing the number of threads either externally or via ``opts.nthreads``, perhaps down to 1 thread, to make sure you are not having collisions between threads, or slowdown due to thread overheads. Hyperthreading (more threads than physical cores) rarely helps much. Thread collisions are possible if large problems are run with a large number of (say more than 64) threads. We added the constant ``MAX_USEFUL_NTHREADS`` in ``include/defs.h`` to address this in the vectorized (stacked) inputs case. Another ase causing slowness is very many repetitions of small problems; see ``test/manysmallprobs`` which exceeds $10^7$ points/sec with one thread via the guru interface, but can get ridiculously slower with many threads; see https://github.com/flatironinstitute/finufft/issues/86
- Try reducing the number of threads, either those available via OpenMP, or via ``opts.nthreads``, perhaps down to 1 thread, to make sure you are not having collisions between threads, or slowdown due to thread overheads. Hyperthreading (more threads than physical cores) rarely helps much. Thread collisions are possible if large problems are run with a large number of (say more than 64) threads. Another ase causing slowness is very many repetitions of small problems; see ``test/manysmallprobs`` which exceeds $10^7$ points/sec with one thread via the guru interface, but can get ridiculously slower with many threads; see https://github.com/flatironinstitute/finufft/issues/86

- Try setting a crude tolerance, eg ``tol=1e-3``. How many digits do you actually need? This has a big effect in higher dimensions, since the number of flops scales like $(\log 1/\epsilon)^d$, but not quite as big an effect as this scaling would suggest, because in higher dimensions the flops/RAM ratio is higher.

Expand Down
Binary file modified finufft-manual.pdf
Binary file not shown.
4 changes: 1 addition & 3 deletions include/defs.h
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
// use types intrinsic to finufft interface (FLT, CPX, BIGINT, etc)
#include <dataTypes.h>


// ------------- Library-wide algorithm parameter settings ----------------

// Library version (is a string)
Expand All @@ -28,9 +29,6 @@
// Increase this if you need >1TB RAM... (used only in common.cpp)
#define MAX_NF (BIGINT)1e11

// Max number of useful threads for setting default blksize (depends on NUMA)
#define MAX_USEFUL_NTHREADS 24



// ---------- Global error/warning output codes for the library ---------------
Expand Down
6 changes: 3 additions & 3 deletions src/finufft.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -574,9 +574,9 @@ int FINUFFT_MAKEPLAN(int type, int dim, BIGINT* n_modes, int iflag,
p->fftSign = (iflag>=0) ? 1 : -1; // clean up flag input

// choose overall # threads...
int nthr = min(MY_OMP_GET_MAX_THREADS(), MAX_USEFUL_NTHREADS); // limit it
int nthr = MY_OMP_GET_MAX_THREADS(); // use as many as OMP gives us
if (p->opts.nthreads>0)
nthr = p->opts.nthreads; // user override (no limit)
nthr = p->opts.nthreads; // user override (no limit or check)
p->opts.nthreads = nthr; // store actual # thr planned for

// choose batchSize for types 1,2 or 3... (uses int ceil(b/a)=1+(b-1)/a trick)
Expand Down Expand Up @@ -627,7 +627,7 @@ int FINUFFT_MAKEPLAN(int type, int dim, BIGINT* n_modes, int iflag,
if (type==1 || type==2) {

int nthr_fft = nthr; // give FFTW same as overall
// should limit max # threads here too? or set equal to batchsize?
// *** should set equal to batchsize?
// *** put in logic for setting FFTW # thr based on o.spread_thread?
FFTW_INIT(); // only does anything when OMP=ON for >1 threads
FFTW_PLAN_TH(nthr_fft); // " (not batchSize since can be 1 but want mul-thr)
Expand Down

0 comments on commit b998d49

Please sign in to comment.