Exploit structure of upsampled input data when doing FFTs #304

blackwer · 2023-06-28T17:25:11Z

There may be other ways than pruning to speed up the FFT. In a unifom-to-nonuniform 2D NUFFT with, say, an oversampling factor of 2, you only need to transform over the first axis for half of the array, since the other half of the array contains vectors that are identically zero. This saves 25% of FFT cost without having to use complicated hand-crafted algorithms. In 3D, you have to transform a quarter of the array along the first axis and half of the array along the second axis, saving an even larger fraction of FFT time

Originally posted by @mreineck in #293 (comment)

I've implemented a POC for the suggestion by @mreineck that shows typical speedups ~1.5 on 2D data vs the naive way with MKL. I'm not doing upsampling quite the way it's done in finufft, but the answer should correct up to a phase shift as far as I can tell. I don't have a lot of cycles to work on this right now, but this seems like a very easy way to get a huge performance bump in higher dimensions.

#include <cstring>
#include <cstdio>
#include <random>
#include <tuple>
#include <utility>

#include <fftw3.h>
#include <omp.h>

std::pair<fftw_complex *, fftw_complex *> initialize_arrays(int n_inner[2], int n_tot[2]) {
    std::random_device rand_dev;
    std::mt19937 generator(1);
    std::uniform_real_distribution<double> distr(-1.0, 1.0);

    int N = n_tot[0] * n_tot[1];
    fftw_complex *in = (fftw_complex *)fftw_malloc(sizeof(fftw_complex) * N);
    fftw_complex *out = (fftw_complex *)fftw_malloc(sizeof(fftw_complex) * N);

    std::memset(in, 0, n_tot[0] * n_tot[1] * sizeof(fftw_complex));
    for (int i = 0; i < n_inner[0]; ++i)
        for (int j = 0; j < n_inner[1]; ++j)
            for (int k = 0; k < 2; ++k)
                in[i * n_tot[0] + j][k] = distr(generator);

    std::memset(out, 0, n_tot[0] * n_tot[1] * sizeof(fftw_complex));
    return std::make_pair(in, out);
}

double eval_naive(int n[2], fftw_complex *in, fftw_complex *out) {
    int n_threads;
#pragma omp parallel
    n_threads = omp_get_num_threads();
    fftw_plan_with_nthreads(n_threads);
    fftw_plan p = fftw_plan_dft(2, n, in, out, FFTW_FORWARD, FFTW_MEASURE);
    double t_elapsed = omp_get_wtime();

    fftw_execute(p);
    fftw_destroy_plan(p);
    return omp_get_wtime() - t_elapsed;
}

double eval_clever(int n_actual[2], int n_upsampled[2], fftw_complex *in, fftw_complex *out) {
    int n_threads;
#pragma omp parallel
    n_threads = omp_get_num_threads();
    fftw_plan_with_nthreads(n_threads);

    fftw_plan p_cols;
    {
        int rank = 1;               // we're doing 1d transforms of columns first
        int n[] = {n_upsampled[0]}; // each column has n_upsampled[0] rows
        int howmany = n_actual[1];  // we only need to tranform the original columns
        int idist = 1, odist = 1;
        int istride = n_upsampled[1], ostride = n_upsampled[1];
        int *oembed = NULL, *iembed = NULL;
        p_cols = fftw_plan_many_dft(rank, n, howmany, in, iembed, istride, idist, out, oembed, ostride, odist,
                                    FFTW_FORWARD, FFTW_MEASURE);
    }
    fftw_plan p_rows;
    {
        int rank = 1;               // we're doing 1d transforms of rows now
        int n[] = {n_upsampled[1]}; // each row has n_upsampled[1] columns to transform
        int howmany = n_upsampled[0];  // we need to tranform ALL rows
        int idist = n_upsampled[1], odist = n_upsampled[1];
        int istride = 1, ostride = 1;
        int *oembed = NULL, *iembed = NULL;
        p_rows = fftw_plan_many_dft(rank, n, howmany, out, iembed, istride, idist, out, oembed, ostride, odist,
                                    FFTW_FORWARD, FFTW_MEASURE);
    }
    double t_elapsed = omp_get_wtime();

    fftw_execute(p_cols);
    fftw_execute(p_rows);


    fftw_destroy_plan(p_cols);
    fftw_destroy_plan(p_rows);
    return omp_get_wtime() - t_elapsed;
}

int main(int argc, char *argv[]) {
    constexpr int n_runs = 10;
    int n_per_dim = 500;
    if (argc > 1)
        n_per_dim = std::atof(argv[1]);
    constexpr int dim = 2;
    constexpr double upsamp = 2.0;

    int n_inner[dim];
    int n_upsampled[dim];
    for (int i = 0; i < dim; ++i) {
        n_inner[i] = n_per_dim;
        n_upsampled[i] = upsamp * n_inner[i];
    }
    int N_UPSAMP = n_upsampled[0] * n_upsampled[1];

    auto [in_naive, out_naive] = initialize_arrays(n_inner, n_upsampled);
    auto [in_clever, out_clever] = initialize_arrays(n_inner, n_upsampled);

    double t_naive = 0.0;
    for (int i = 0; i < n_runs; ++i) {
        std::memset(out_naive, 0, N_UPSAMP * sizeof(fftw_complex));
        t_naive += eval_naive(n_upsampled, in_naive, out_naive);
    }

    double t_clever =  0.0;
    for (int i = 0; i < n_runs; ++i) {
        std::memset(out_clever, 0, N_UPSAMP * sizeof(fftw_complex));
        t_clever += eval_clever(n_inner, n_upsampled, in_clever, out_clever);
    }

    printf("%.4f\t%.4f\n", t_naive, t_naive / n_runs);
    printf("%.4f\t%.4f\n", t_clever, t_clever / n_runs);
    printf("%.4f\n", t_naive / t_clever);

    fftw_free(in_naive);
    fftw_free(in_clever);
    fftw_free(out_naive);
    fftw_free(out_clever);
    return 0;
}

% g++ upsample_test.cpp -lmkl_rt -fopenmp -std=c++17 -g -O3
% OMP_NUM_THREADS=1 taskset -c 0 ./a.out 2000
1.6956  0.1696
1.1748  0.1175
1.4433

The text was updated successfully, but these errors were encountered:

mreineck · 2023-06-30T14:42:48Z

I pushed a demo for the partial FFTs as part of the #287 PR.

turbotage · 2023-12-28T17:59:04Z

Any news here? Seems like an amazing performance boost! Are there any limitations as to why this shouldn't work well both for cufinufft and finufft or certain upsampling factors?

ahbarnett · 2023-12-29T16:00:55Z

No news; too busy w/ 2.2 & various deadlines. The complication is its interaction with the vectorized interface (batching in FFTW via plan_many_dfts...) It will only make a difference for upsampfac=2 for the CPU. To be honest, cufft is so fast that for the GPU it won't make much difference. (Unless maybe you have timing breakdowns in your MRI application that says otherwise... let us know). Happy new year, Alex

…

On Thu, Dec 28, 2023 at 12:59 PM turbotage ***@***.***> wrote: Any news here? Seems like an amazing performance boost! Are there any limitations as to why this shouldn't work well both for cufinufft and finufft or certain upsampling factors? — Reply to this email directly, view it on GitHub <#304 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSVDWVNTSGZMGKPPONTYLWXPHAVCNFSM6AAAAAAZXN2DCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZRGM4DGMRRGQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

DiamonDinoia · 2024-07-17T15:44:24Z

I think #463 achieves this for DUCC. I do not think supporting FFTW in the same way is worth.

ahbarnett · 2024-07-17T16:00:26Z

I disagree, that the project of FFTW exploiting sparsity is still interesting, and has potential 2x in 3D, when FFT-dominated. I am delighted with the ducc0 performance, but we can't throw away FFTW completely...

…

On Wed, Jul 17, 2024 at 11:44 AM Marco Barbone ***@***.***> wrote: I think #463 <#463> achieves this for DUCC. I do not think supporting FFTW in the same way is worth. — Reply to this email directly, view it on GitHub <#304 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSWEAS5C7VO5RBWOP63ZM2GO5AVCNFSM6AAAAABLA5AZQGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZTGYZTEMBRGU> . You are receiving this because you commented.Message ID: ***@***.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

DiamonDinoia · 2024-08-01T16:43:50Z

To clarify. I do not thing adding support for sparsity in fftw is worth the maintainability tradeoff. I am not advocating for dropping fftw. This can be reopened in the future in case there a strong need, or can be moved to discussions for future reference.

mreineck mentioned this issue Jun 30, 2023

Tech demo: switch from FFTW to pocketfft/ducc.fft #287

Closed

blackwer mentioned this issue Sep 29, 2023

Seemingly random memory-related errors while running FINUFFT inside of OpenMP parallel section #346

Closed

ahbarnett added the enhancement label Dec 21, 2023

flatironinstitute locked and limited conversation to collaborators Aug 1, 2024

ahbarnett converted this issue into discussion #510 Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Exploit structure of upsampled input data when doing FFTs #304

Exploit structure of upsampled input data when doing FFTs #304

blackwer commented Jun 28, 2023 •

edited

Loading

mreineck commented Jun 30, 2023

turbotage commented Dec 28, 2023

ahbarnett commented Dec 29, 2023 via email

DiamonDinoia commented Jul 17, 2024

ahbarnett commented Jul 17, 2024 via email

DiamonDinoia commented Aug 1, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Exploit structure of upsampled input data when doing FFTs #304

Exploit structure of upsampled input data when doing FFTs #304

Comments

blackwer commented Jun 28, 2023 • edited Loading

mreineck commented Jun 30, 2023

turbotage commented Dec 28, 2023

ahbarnett commented Dec 29, 2023 via email

DiamonDinoia commented Jul 17, 2024

ahbarnett commented Jul 17, 2024 via email

DiamonDinoia commented Aug 1, 2024

This issue was moved to a discussion.

blackwer commented Jun 28, 2023 •

edited

Loading