Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU performance improvements #488

Merged
merged 50 commits into from
Aug 2, 2024

Conversation

DiamonDinoia
Copy link
Collaborator

@DiamonDinoia DiamonDinoia commented Jul 17, 2024

Possible improvements to GPU perfomance are:

  • binsize determined by the available shared memory
  • using integer and float32 arithmetic instead of the more expensive float64.
  • fixed minor issues on cmake
  • Updated horner coefficient to use the same as the CPU version

#481 summarizes the achieved performance.

@DiamonDinoia DiamonDinoia mentioned this pull request Jul 17, 2024
8 tasks
@DiamonDinoia DiamonDinoia added this to the 3.0 milestone Jul 17, 2024
@DiamonDinoia DiamonDinoia mentioned this pull request Jul 17, 2024
6 tasks
@DiamonDinoia DiamonDinoia removed this from the 2.3 milestone Jul 17, 2024
Copy link
Member

@lu1and10 lu1and10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just left some minor comments in the files.

CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
int *d_idxnupts = d_plan->idxnupts;
thrust::sequence(thrust::cuda::par.on(stream), d_idxnupts, d_idxnupts + M);
RETURN_IF_CUDA_ERROR
thrust::sort(thrust::cuda::par.on(stream), d_idxnupts, d_idxnupts + M,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does thrust sort will also be faster than current bin sort in 2D and 3D? Though sort only takes few percentage in 2D and 3D.

One thing to note is that thrust sort(most probably calls cub sort) will create a workspace during sorting, so the GPU memory may have a little spike, while current binsort's memory is all managed by ourselves.

throw std::runtime_error(cudaGetErrorString(err));
}
// use 1/6 of the shared memory for the binsize
shared_mem_per_block /= 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this 1/6 heuristic getting from perf test experiments or some theory?

src/ker_horner_allw_loop.inc Outdated Show resolved Hide resolved
@lu1and10 lu1and10 mentioned this pull request Jul 30, 2024
7 tasks
const T *x, const T *y, const cuda_complex<T> *c, cuda_complex<T> *fw, int M, int ns,
int nf1, int nf2, T es_c, T es_beta, T sigma, const int *idxnupts) {
#if ALLOCA_SUPPORTED
auto ker = (T *)alloca(sizeof(T) * ns * 3);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to fix the *3 here

Copy link
Member

@blackwer blackwer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a beast of a PR. LGTM. A lot to take in but everything seems OK on the surface.

Did you notice any significant improvements you could achieve with the reduced memory pressure from alloca? I don't especially love dealing with VLAs, but it's probably OK here if there's an obvious advantage, especially if this remains in cuda support for future specs.

@janden
Copy link
Collaborator

janden commented Aug 1, 2024

Sorry I haven't had a change to look at this yet. Will go through it tomorrow.

@DiamonDinoia
Copy link
Collaborator Author

What a beast of a PR. LGTM. A lot to take in but everything seems OK on the surface.

Did you notice any significant improvements you could achieve with the reduced memory pressure from alloca? I don't especially love dealing with VLAs, but it's probably OK here if there's an obvious advantage, especially if this remains in cuda support for future specs.

Hi Robert,

Thanks for the review, alloca makes a small difference but I think it is worth having it in as registers/stack is quite precious on GPU. We are limited by shared memory more than register at the moment so it is not a huge improvement. If it becomes un-maintainable we can pull out but nvidia will likely not drop support for it.

@DiamonDinoia
Copy link
Collaborator Author

DiamonDinoia commented Aug 1, 2024

I added 1.25 upsampfact unit test since review, no new feature.

Copy link
Collaborator

@janden janden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for doing this. Just have a few questions and comments here and there.

src/cuda/1d/spread1d_wrapper.cu Show resolved Hide resolved
devel/gen_all_horner_C_code.m Show resolved Hide resolved
devel/gen_all_horner_C_code.m Show resolved Hide resolved
include/cufinufft/impl.h Outdated Show resolved Hide resolved
include/cufinufft/utils.h Show resolved Hide resolved
src/cuda/1d/spread1d_wrapper.cu Show resolved Hide resolved
src/cuda/3d/spread3d_wrapper.cu Outdated Show resolved Hide resolved
src/cuda/common.cu Outdated Show resolved Hide resolved
src/cuda/common.cu Outdated Show resolved Hide resolved
src/cuda/common.cu Show resolved Hide resolved
@DiamonDinoia DiamonDinoia requested a review from janden August 2, 2024 21:55
@janden
Copy link
Collaborator

janden commented Aug 2, 2024

Good here as far as I'm concerned. Nice work!

@DiamonDinoia DiamonDinoia merged commit b3c2be7 into flatironinstitute:master Aug 2, 2024
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants