Merge remote-tracking branch 'upstream/master' into user_fftw_lock

blackwer · Sep 17, 2024 · 3de4ed5 · 3de4ed5
2 parents 8594b65 + 7438845
commit 3de4ed5
Show file tree

Hide file tree

Showing 118 changed files with 2,096 additions and 881 deletions.
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,8 +1,19 @@
 List of features / changes made / release notes, in reverse chronological order.
 If not stated, FINUFFT is assumed (cuFINUFFT <=1.3 is listed separately).
 
-V 2.3.1
-
+Master (9/10/24)
+
+* reduced roundoff error in a[n] phase calc in CPU onedim_fseries_kernel().
+   #534 (Barnett).
+* GPU code type 1,2 also reduced round-off error in phases, to match CPU code;
+  rationalized onedim_{fseries,nuft}_* GPU codes to match CPU (Barbone, Barnett)
+* Added type 3 in 1D, 2D, and 3D, in the GPU library cufinufft. PR #517, Barbone
+  - Removed the CPU fseries computation (used for benchmark, no longer needed)
+  - Added complex arithmetic support for cuda_complex type
+  - Added tests for type 3 in 1D, 2D, and 3D and cuda_complex arithmetic
+  - Minor fixes on the GPU code:
+    a) removed memory leaks in case of errors
+    b) renamed maxbatchsize to batchsize
 * Add options for user-provided FFTW locker (PR548, Blackwell). These options can be be
 used to prevent crashes when a user is creating/destroying FFTW plans and
 FINUFFT plans in threads simultaneously.

diff --git a/docs/devnotes.rst b/docs/devnotes.rst
@@ -54,6 +54,8 @@ Developer notes
 
 * CMake compiling on linux at Flatiron Institute (Rusty cluster): We have had a report that if you want to use LLVM, you need to ``module load llvm/16.0.3`` otherwise the default ``llvm/14.0.6`` does not find ``OpenMP_CXX``.
 
+* Note to the nvcc developer. nvcc with debug symbols causes a stack overflow that is undetected at both compile and runtime. This goes undetected until ns>=10 and dim=3, for ns<10 or dim < 3, one can use -G and debug the code with cuda-gdb. The way to avoid is to not use Debug symbols, possibly using ``--generate-line-info`` might work (not tested). As a side note, compute-sanitizers do not detect the issue.
+
 * Testing cufinufft (for FI, mostly):
 
 .. code-block:: sh

diff --git a/docs/opts.rst b/docs/opts.rst
@@ -22,10 +22,10 @@ Recall how to do this from C++:
 .. code-block:: C++
 
   // (... set up M,x,c,tol,N, and allocate F here...)
-  finufft_opts* opts;
-  finufft_default_opts(opts);
-  opts->debug = 1;
-  int ier = finufft1d1(M,x,c,+1,tol,N,F,opts);
+  finufft_opts opts;
+  finufft_default_opts(&opts);
+  opts.debug = 1;
+  int ier = finufft1d1(M,x,c,+1,tol,N,F,&opts);
 
 This setting produces more timing output to ``stdout``.
 

diff --git a/docs/performance.rst b/docs/performance.rst
@@ -35,8 +35,10 @@ To generate the below results, run ``bench.py`` from the ``perftest`` directory.
 This requires the Python packages ``numpy``, ``pandas`` and ``matplotlib``.
 The script assumes a bash-like shell, and may not work on Windows.
 
+The last set of benchmarks shows the difference in performance between fftw and ducc in a fft bound problem. ducc is expected to outperform fftw in 2D and 3D problems however in 1D fftw is expected to be faster.
+
 .. warning::
-    The script ``bench.py`` clones FINUFFT into the current directory, then switches between various branches and builds them. Thus DO NOT RUN the script from inside your FINUFFT git directory as it will mess up the git directory and fail! Instead move the script into a clean directory and run there.
+    The script ``bench.py`` clones FINUFFT into the current directory, then switches between various branches and builds them. Thus DO NOT RUN the script from inside your FINUFFT git directory as it will mess up the git directory and fail! Instead move the script into a clean directory outside the git checkout and run there.
 
 1D Transforms
 ---------------------------------------------
@@ -86,7 +88,7 @@ Type 3
 .. image:: pics/320x320x1-type-3-upsamp2.00-precf-thread1.png
 .. image:: pics/320x320x1-type-3-upsamp2.00-precd-thread1.png
 
-2D transforms Multi-Threaded
+2D transforms Multi-Threaded (float32)
 ---------------------------------------------
 
 Type 1
@@ -104,20 +106,34 @@ Type 3
 .. image:: pics/320x320x1-type-3-upsamp1.25-precf-thread32.png
 .. image:: pics/320x320x1-type-3-upsamp2.00-precf-thread32.png
 
-3D transforms Multi-Threaded (float32)
----------------------------------------------
+3D transforms Multi-Threaded (float64)
+--------------------------------------
+
+Type 1
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. image:: pics/192x192x128-type-1-upsamp1.25-precd-thread32.png
+.. image:: pics/192x192x128-type-1-upsamp2.00-precd-thread32.png
+
+Type 2
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. image:: pics/192x192x128-type-2-upsamp1.25-precd-thread32.png
+.. image:: pics/192x192x128-type-2-upsamp2.00-precd-thread32.png
+
+Type 3
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. image:: pics/192x192x128-type-3-upsamp1.25-precd-thread32.png
+
+3D FFT benchmarks
+-----------------
 
 Type 1
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. image:: pics/192x192x128-type-1-upsamp1.25-precf-thread32.png
-.. image:: pics/192x192x128-type-1-upsamp2.00-precf-thread32.png
+.. image:: pics/250x250x250-type-1-upsamp2.00-precd-thread1.png
 
 Type 2
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. image:: pics/192x192x128-type-2-upsamp1.25-precf-thread32.png
-.. image:: pics/192x192x128-type-2-upsamp2.00-precf-thread32.png
+.. image:: pics/250x250x250-type-2-upsamp2.00-precd-thread1.png
 
 Type 3
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. image:: pics/192x192x128-type-3-upsamp1.25-precf-thread32.png
-.. image:: pics/192x192x128-type-3-upsamp2.00-precf-thread32.png
+.. image:: pics/250x250x250-type-3-upsamp2.00-precd-thread1.png
diff --git a/docs/pics/10000x1x1-type-1-upsamp1.25-precd-thread1.png b/docs/pics/10000x1x1-type-1-upsamp1.25-precd-thread1.png
diff --git a/docs/pics/10000x1x1-type-1-upsamp1.25-precf-thread1.png b/docs/pics/10000x1x1-type-1-upsamp1.25-precf-thread1.png
diff --git a/docs/pics/10000x1x1-type-1-upsamp2.00-precd-thread1.png b/docs/pics/10000x1x1-type-1-upsamp2.00-precd-thread1.png
diff --git a/docs/pics/10000x1x1-type-1-upsamp2.00-precf-thread1.png b/docs/pics/10000x1x1-type-1-upsamp2.00-precf-thread1.png
diff --git a/docs/pics/10000x1x1-type-2-upsamp1.25-precd-thread1.png b/docs/pics/10000x1x1-type-2-upsamp1.25-precd-thread1.png
diff --git a/docs/pics/10000x1x1-type-2-upsamp1.25-precf-thread1.png b/docs/pics/10000x1x1-type-2-upsamp1.25-precf-thread1.png
diff --git a/docs/pics/10000x1x1-type-2-upsamp2.00-precd-thread1.png b/docs/pics/10000x1x1-type-2-upsamp2.00-precd-thread1.png
diff --git a/docs/pics/10000x1x1-type-2-upsamp2.00-precf-thread1.png b/docs/pics/10000x1x1-type-2-upsamp2.00-precf-thread1.png
diff --git a/docs/pics/10000x1x1-type-3-upsamp1.25-precd-thread1.png b/docs/pics/10000x1x1-type-3-upsamp1.25-precd-thread1.png
diff --git a/docs/pics/10000x1x1-type-3-upsamp1.25-precf-thread1.png b/docs/pics/10000x1x1-type-3-upsamp1.25-precf-thread1.png
diff --git a/docs/pics/10000x1x1-type-3-upsamp2.00-precd-thread1.png b/docs/pics/10000x1x1-type-3-upsamp2.00-precd-thread1.png
diff --git a/docs/pics/10000x1x1-type-3-upsamp2.00-precf-thread1.png b/docs/pics/10000x1x1-type-3-upsamp2.00-precf-thread1.png
diff --git a/docs/pics/192x192x128-type-1-upsamp1.25-precd-thread16.png b/docs/pics/192x192x128-type-1-upsamp1.25-precd-thread16.png
diff --git a/docs/pics/192x192x128-type-1-upsamp1.25-precd-thread32.png b/docs/pics/192x192x128-type-1-upsamp1.25-precd-thread32.png
diff --git a/docs/pics/192x192x128-type-1-upsamp1.25-precf-thread32.png b/docs/pics/192x192x128-type-1-upsamp1.25-precf-thread32.png
diff --git a/docs/pics/192x192x128-type-1-upsamp2.00-precd-thread16.png b/docs/pics/192x192x128-type-1-upsamp2.00-precd-thread16.png
diff --git a/docs/pics/192x192x128-type-1-upsamp2.00-precd-thread32.png b/docs/pics/192x192x128-type-1-upsamp2.00-precd-thread32.png
diff --git a/docs/pics/192x192x128-type-1-upsamp2.00-precf-thread32.png b/docs/pics/192x192x128-type-1-upsamp2.00-precf-thread32.png
diff --git a/docs/pics/192x192x128-type-2-upsamp1.25-precd-thread16.png b/docs/pics/192x192x128-type-2-upsamp1.25-precd-thread16.png
diff --git a/docs/pics/192x192x128-type-2-upsamp1.25-precd-thread32.png b/docs/pics/192x192x128-type-2-upsamp1.25-precd-thread32.png
diff --git a/docs/pics/192x192x128-type-2-upsamp1.25-precf-thread32.png b/docs/pics/192x192x128-type-2-upsamp1.25-precf-thread32.png
diff --git a/docs/pics/192x192x128-type-2-upsamp2.00-precd-thread16.png b/docs/pics/192x192x128-type-2-upsamp2.00-precd-thread16.png
diff --git a/docs/pics/192x192x128-type-2-upsamp2.00-precd-thread32.png b/docs/pics/192x192x128-type-2-upsamp2.00-precd-thread32.png
diff --git a/docs/pics/192x192x128-type-2-upsamp2.00-precf-thread32.png b/docs/pics/192x192x128-type-2-upsamp2.00-precf-thread32.png
diff --git a/docs/pics/192x192x128-type-3-upsamp1.25-precd-thread16.png b/docs/pics/192x192x128-type-3-upsamp1.25-precd-thread16.png
diff --git a/docs/pics/192x192x128-type-3-upsamp1.25-precd-thread32.png b/docs/pics/192x192x128-type-3-upsamp1.25-precd-thread32.png
diff --git a/docs/pics/192x192x128-type-3-upsamp1.25-precf-thread32.png b/docs/pics/192x192x128-type-3-upsamp1.25-precf-thread32.png
diff --git a/docs/pics/192x192x128-type-3-upsamp2.00-precd-thread16.png b/docs/pics/192x192x128-type-3-upsamp2.00-precd-thread16.png
diff --git a/docs/pics/192x192x128-type-3-upsamp2.00-precd-thread32.png b/docs/pics/192x192x128-type-3-upsamp2.00-precd-thread32.png
diff --git a/docs/pics/192x192x128-type-3-upsamp2.00-precf-thread32.png b/docs/pics/192x192x128-type-3-upsamp2.00-precf-thread32.png
diff --git a/docs/pics/250x250x250-type-1-upsamp2.00-precd-thread1.png b/docs/pics/250x250x250-type-1-upsamp2.00-precd-thread1.png
diff --git a/docs/pics/250x250x250-type-2-upsamp2.00-precd-thread1.png b/docs/pics/250x250x250-type-2-upsamp2.00-precd-thread1.png
diff --git a/docs/pics/250x250x250-type-3-upsamp2.00-precd-thread1.png b/docs/pics/250x250x250-type-3-upsamp2.00-precd-thread1.png
diff --git a/docs/pics/320x320x1-type-1-upsamp1.25-precd-thread1.png b/docs/pics/320x320x1-type-1-upsamp1.25-precd-thread1.png
diff --git a/docs/pics/320x320x1-type-1-upsamp1.25-precf-thread1.png b/docs/pics/320x320x1-type-1-upsamp1.25-precf-thread1.png
diff --git a/docs/pics/320x320x1-type-1-upsamp1.25-precf-thread16.png b/docs/pics/320x320x1-type-1-upsamp1.25-precf-thread16.png
diff --git a/docs/pics/320x320x1-type-1-upsamp1.25-precf-thread32.png b/docs/pics/320x320x1-type-1-upsamp1.25-precf-thread32.png
diff --git a/docs/pics/320x320x1-type-1-upsamp2.00-precd-thread1.png b/docs/pics/320x320x1-type-1-upsamp2.00-precd-thread1.png
diff --git a/docs/pics/320x320x1-type-1-upsamp2.00-precf-thread1.png b/docs/pics/320x320x1-type-1-upsamp2.00-precf-thread1.png
diff --git a/docs/pics/320x320x1-type-1-upsamp2.00-precf-thread16.png b/docs/pics/320x320x1-type-1-upsamp2.00-precf-thread16.png
diff --git a/docs/pics/320x320x1-type-1-upsamp2.00-precf-thread32.png b/docs/pics/320x320x1-type-1-upsamp2.00-precf-thread32.png
diff --git a/docs/pics/320x320x1-type-2-upsamp1.25-precd-thread1.png b/docs/pics/320x320x1-type-2-upsamp1.25-precd-thread1.png
diff --git a/docs/pics/320x320x1-type-2-upsamp1.25-precf-thread1.png b/docs/pics/320x320x1-type-2-upsamp1.25-precf-thread1.png
diff --git a/docs/pics/320x320x1-type-2-upsamp1.25-precf-thread16.png b/docs/pics/320x320x1-type-2-upsamp1.25-precf-thread16.png
diff --git a/docs/pics/320x320x1-type-2-upsamp1.25-precf-thread32.png b/docs/pics/320x320x1-type-2-upsamp1.25-precf-thread32.png
diff --git a/docs/pics/320x320x1-type-2-upsamp2.00-precd-thread1.png b/docs/pics/320x320x1-type-2-upsamp2.00-precd-thread1.png
diff --git a/docs/pics/320x320x1-type-2-upsamp2.00-precf-thread1.png b/docs/pics/320x320x1-type-2-upsamp2.00-precf-thread1.png
diff --git a/docs/pics/320x320x1-type-2-upsamp2.00-precf-thread16.png b/docs/pics/320x320x1-type-2-upsamp2.00-precf-thread16.png
diff --git a/docs/pics/320x320x1-type-2-upsamp2.00-precf-thread32.png b/docs/pics/320x320x1-type-2-upsamp2.00-precf-thread32.png
diff --git a/docs/pics/320x320x1-type-3-upsamp1.25-precd-thread1.png b/docs/pics/320x320x1-type-3-upsamp1.25-precd-thread1.png
diff --git a/docs/pics/320x320x1-type-3-upsamp1.25-precf-thread1.png b/docs/pics/320x320x1-type-3-upsamp1.25-precf-thread1.png
diff --git a/docs/pics/320x320x1-type-3-upsamp1.25-precf-thread16.png b/docs/pics/320x320x1-type-3-upsamp1.25-precf-thread16.png
diff --git a/docs/pics/320x320x1-type-3-upsamp1.25-precf-thread32.png b/docs/pics/320x320x1-type-3-upsamp1.25-precf-thread32.png
diff --git a/docs/pics/320x320x1-type-3-upsamp2.00-precd-thread1.png b/docs/pics/320x320x1-type-3-upsamp2.00-precd-thread1.png
diff --git a/docs/pics/320x320x1-type-3-upsamp2.00-precf-thread1.png b/docs/pics/320x320x1-type-3-upsamp2.00-precf-thread1.png
diff --git a/docs/pics/320x320x1-type-3-upsamp2.00-precf-thread16.png b/docs/pics/320x320x1-type-3-upsamp2.00-precf-thread16.png
diff --git a/docs/pics/320x320x1-type-3-upsamp2.00-precf-thread32.png b/docs/pics/320x320x1-type-3-upsamp2.00-precf-thread32.png
diff --git a/docs/users.rst b/docs/users.rst
@@ -35,7 +35,7 @@ and also add them to GitHub's Used By feature):
 #. `EM-Align <https://github.com/ShkolniskyLab/emalign>`_: Aligning rotation, reflection, and translation between volumes (desntiy maps) in cryo-electron microscopy, from Shkolnisky Lab at Tel Aviv.
 
 #. `spinifel <https://gitlab.osti.gov/mtip/spinifel>`_: Uses the multitiered iterative phasing (M-TIP) algorithm for single particle X-ray diffraction imaging, on CPU/GPU, from the ExaFEL project at LBNL/DOE.
-   
+
 #. `sinctransform <https://github.com/hannahlawrence/sinctransform>`_: C++ and MATLAB codes to evaluate sums of the sinc and sinc^2 kernels between arbitrary nonuniform points in 1,2, or 3 dimensions, by Hannah Lawrence (2017 summer intern at Flatiron).
 
 #. `fsinc <https://github.com/gauteh/fsinc>`_:  Gaute Hope's fast sinc transform and interpolation Python package.
@@ -46,18 +46,22 @@ and also add them to GitHub's Used By feature):
 
 #. `TRIQS CTINT <https://github.com/TRIQS/ctint>`_: continous time interaction-expansion solver, by N. Wentzell and O. Parcollet (Flatiron Institute, part of platform for interacting quantum systems).
 
+#. `cunuSHT <https://github.com/Sebastian-Belkner/cunuSHT>`_: GPU accelerated spherical harmonic transforms from nonuniform samples (arbitrary pixelizations), by S. Belkner and coauthors. https://arxiv.org/abs/2406.14542
+
+#. `FReSCO <https://github.com/martiniani-lab/FReSCo>`_: Fast reciprocal-space correlator, by Aaron Shih, Mathias Kasiulis, and Stefano Martiani. This uses thousands of calls to all three transform types in 2D or 3D, to iteratively adjust nonuniform points until their Fourier transforms match a desired function. Physics Mag. article and movie: https://physics.aps.org/articles/v17/134
+
 
 Other wrappers to (cu)FINUFFT
 ------------------------------
-   
+
 #. `FINUFFT.jl <https://github.com/ludvigak/FINUFFT.jl>`_: a `julia <https://julialang.org/>`_ language wrapper by Ludvig af Klinteberg, Libin Lu, and others, now using pure Julia, and fully featured (rather than via Python). This is itself wrapped by `AbstractNFFTs.jl` in `NFFT.jl <https://juliamath.github.io/NFFT.jl/dev/performance/>`_.
 
 #. `TensorFlow NUFFT <https://github.com/mrphys/tensorflow-nufft>`_: a wrapper to the differentiable machine learning Python tool TensorFlow, for the CPU (via FINUFFT) and GPU (via cuFINUFFT). By Javier Montalt Tordera (UCL).
 
 #. `JAX bindings to (cu)FINUFFT <https://github.com/dfm/jax-finufft>`_: a wrapper to the differentiable machine learning Python tool JAX. Directly exposes the FINUFFT library to JAX's XLA backend, as well as implementing differentiation rules for the transforms. By Dan Foreman-Mackey (CCA).
-   
+
 #. `PyTorch wrapper to (cu)FINUFFT <https://flatironinstitute.github.io/pytorch-finufft>`_:  a wrapper to the differentiable machine learning Python tool PyTorch. By Michael Eickenberg and Brian Ward (CCM).
-   
+
 
 Research output using (cu)FINUFFT
 ---------------------------------
@@ -92,14 +96,14 @@ For the latest see: Google Scholar `FINUFFT citations <https://scholar.google.co
 #. A. Harness, S. Shaklan, P. Willems, N. J. Kasdin, K. Balasubramanian, V. White, K. Yee, P. Dumont, R. Muller, S. Vuong, M. Galvin,
    "Optical experiments and model validation of perturbed starshade designs," Proc. SPIE 11823, Techniques and Instrumentation for Detection of Exoplanets X, 1182312 (1 September 2021); https://doi.org/10.1117/12.2595409
 
-#. Chang, P., Pienaar, E., & Gebbie, T. (2020). "Malliavin--Mancino Estimators Implemented with Nonuniform Fast Fourier Transforms." SIAM J. Sci. Comput. 42(6), B1378–B1403. https://doi.org/10.1137/20m1325903 
+#. Chang, P., Pienaar, E., & Gebbie, T. (2020). "Malliavin--Mancino Estimators Implemented with Nonuniform Fast Fourier Transforms." SIAM J. Sci. Comput. 42(6), B1378–B1403. https://doi.org/10.1137/20m1325903
 
 #. Heisenberg voxelization (HVOX) for inteferometry of spherical sky maps in radio-astronomy, by Kashani, Simeoni, et al. (2023) https://arxiv.org/abs/2306.06007 https://github.com/matthieumeo/hvox
 
 #. Sriramkrishnan Muralikrishnan at the Jülich Supercomputing Centre is running cufinufft on 6144 A100 GPUs (the NERSC-9 supercomputer), for a particle-in-Fourier method for plasma simulations. https://pasc23.pasc-conference.org/presentation/?id=msa167&sess=sess154
 
 #. Related to that, FINUFFT is being used for a better-converging Fourier approach to the Immersed Boundary method of Peskin and his group at NYU. Zhe Chen and Charles Peskin, https://arxiv.org/abs/2302.08694
-   
+
 #. Pei R, Askham T, Greengard L, Jiang S (2023). "A fast method for imposing periodic boundary conditions on arbitrarily-shaped lattices in two dimensions." J. Comput. Phys. 474, 111792. https://doi.org/10.1016/j.jcp.2022.111792 Uses FINUFFT for plane wave sums.
 
 #. Dylan Green, JR Jamora, and Anne Gelb (2023). "Leveraging joint sparsity in 3D synthetic aperture radar imaging," Appl. Math. Modern Chall. 1, 61-86. https://doi.org/10.3934/ammc.2023005 Uses 3D transforms between $N=201^3$ modes (voxels) and $M=313300$ data points. As they state, "...the computational cost of each method heavily depends on the NUFFT algorithm used."
@@ -119,8 +123,8 @@ Papers influenced by other aspects of FINUFFT:
 
 1. NFFT.jl: Generic and Fast Julia Implementation of the Nonequidistant Fast Fourier Transform, by Tobias Knopp, Marija Boberg, Mirco Grosser (2022). https://arxiv.org/abs/2208.00049  They use our blocked spreading and piecewise polynomial ideas, and beat our type 1 and 2 performance by a factor of up to 1.7 in multithreaded cases. Code is dimension-independent but very abstract (two levels of meta-programming, I believe).
 
-   
-   
+
+
 Some citations to FINUFFT that do not appear to be actual users
 ---------------------------------------------------------------
 
@@ -133,5 +137,5 @@ Some citations to FINUFFT that do not appear to be actual users
 #. https://arxiv.org/abs/1912.09746
 
 #. https://arxiv.org/abs/2010.05295
-      
+
 Now too many to track by hand... please see Google Scholar search linked above.
diff --git a/include/cufinufft/common.h b/include/cufinufft/common.h
@@ -7,31 +7,37 @@
 #include <finufft_errors.h>
 #include <finufft_spread_opts.h>
 
-#include <complex.h>
+#include <complex>
 
 namespace cufinufft {
 namespace common {
 template<typename T>
-__global__ void fseries_kernel_compute(int nf1, int nf2, int nf3, T *f,
-                                       cuDoubleComplex *a, T *fwkerhalf1, T *fwkerhalf2,
+__global__ void fseries_kernel_compute(int nf1, int nf2, int nf3, T *f, T *a,
+                                       T *fwkerhalf1, T *fwkerhalf2, T *fwkerhalf3,
+                                       int ns);
+template<typename T>
+__global__ void cu_nuft_kernel_compute(int nf1, int nf2, int nf3, T *f, T *z, T *kx,
+                                       T *ky, T *kz, T *fwkerhalf1, T *fwkerhalf2,
                                        T *fwkerhalf3, int ns);
 template<typename T>
-int cufserieskernelcompute(int dim, int nf1, int nf2, int nf3, T *d_f,
-                           cuDoubleComplex *d_a, T *d_fwkerhalf1, T *d_fwkerhalf2,
-                           T *d_fwkerhalf3, int ns, cudaStream_t stream);
+int fseries_kernel_compute(int dim, int nf1, int nf2, int nf3, T *d_f, T *d_phase,
+                           T *d_fwkerhalf1, T *d_fwkerhalf2, T *d_fwkerhalf3, int ns,
+                           cudaStream_t stream);
+template<typename T>
+int nuft_kernel_compute(int dim, int nf1, int nf2, int nf3, T *d_f, T *d_z, T *d_kx,
+                        T *d_ky, T *d_kz, T *d_fwkerhalf1, T *d_fwkerhalf2,
+                        T *d_fwkerhalf3, int ns, cudaStream_t stream);
 template<typename T>
 int setup_spreader_for_nufft(finufft_spread_opts &spopts, T eps, cufinufft_opts opts);
 
 void set_nf_type12(CUFINUFFT_BIGINT ms, cufinufft_opts opts, finufft_spread_opts spopts,
                    CUFINUFFT_BIGINT *nf, CUFINUFFT_BIGINT b);
+
 template<typename T>
-void onedim_fseries_kernel(CUFINUFFT_BIGINT nf, T *fwkerhalf, finufft_spread_opts opts);
-template<typename T>
-void onedim_fseries_kernel_precomp(CUFINUFFT_BIGINT nf, T *f, std::complex<double> *a,
+void onedim_fseries_kernel_precomp(CUFINUFFT_BIGINT nf, T *f, T *a,
                                    finufft_spread_opts opts);
 template<typename T>
-void onedim_fseries_kernel_compute(CUFINUFFT_BIGINT nf, T *f, std::complex<double> *a,
-                                   T *fwkerhalf, finufft_spread_opts opts);
+void onedim_nuft_kernel_precomp(T *f, T *zout, finufft_spread_opts opts);
 
 template<typename T>
 std::size_t shared_memory_required(int dim, int ns, int bin_size_x, int bin_size_y,
@@ -41,8 +47,8 @@ template<typename T>
 void cufinufft_setup_binsize(int type, int ns, int dim, cufinufft_opts *opts);
 
 template<typename T, typename V>
-auto cufinufft_set_shared_memory(V *kernel, const int dim,
-                                 const cufinufft_plan_t<T> &d_plan) {
+int cufinufft_set_shared_memory(V *kernel, const int dim,
+                                const cufinufft_plan_t<T> &d_plan) {
   /**
    * WARNING: this function does not handle cuda errors. The caller should check them.
    */