cuda: update interface to take 64-bit M #411

janden · 2023-12-27T23:17:26Z

For compatibility with FINUFFT.

This is quite an invasive patch and I'm not sure it's doing the right thing in several places (in particular, I'm not certain that CUDA is happy taking int64_t for block dimensions, etc.). That being said, it is compiling and tests are passing, so that's something.

ahbarnett · 2023-12-27T23:38:50Z

This breaks the cufinufft interface, so is the plan to go into 2.2.1 ? 2.3?

janden · 2023-12-27T23:49:17Z

This breaks the cufinufft interface, so is the plan to go into 2.2.1 ? 2.3?

Right. The plan was to get this in before the 2.2.0 release (see discussion in #255). But perhaps that is too late now?

In terms of breaking the interface, it's quite a benign breakage (so we may not need to wait until version 3.0). From what I understand, it will break the ABI (since we have int64_ts where we previously had int), but code will still compile as expected (compilers will silently upconvert int to int64_t where necessary). That being said, I'm no C guru, so someone tell me if I'm wrong here.

ahbarnett · 2023-12-27T23:50:10Z

Thanks. Ah, I just read your response - hoping you're in USA not Sweden :).
Could you include the docs and CHANGELOG changes that go along with this. Maybe this could be a 2.2.1, but no need for pre-Lanczos. Stability is key not matching the FINUFFT interface, for that...

janden · 2023-12-27T23:57:25Z

Thanks. Ah, I just read your response - hoping you're in USA not Sweden :).

Heh. In Sweden actually… This is prime work time!

Could you include the docs and CHANGELOG changes that go along with this.

Done.

Maybe this could be a 2.2.1, but no need for pre-Lanczos. Stability is key not matching the FINUFFT interface, for that...

Yeah I'm thinking we wait on this. Would like to get @blackwer's eyes on this before I do anything.

blackwer · 2023-12-28T01:02:51Z

I don't see how this could cause any problems, so it looks good to me. Since it's not a pointer type and we're increasing the size so it won't cause overflows. I falsely said in another post that this had to do with cufft having an int32 size limit, but this is the number of non-uniform points, so it's not constrained by the fft like the number of modes. Unless I'm missing something...

It is technically an API change so semantic versioning would probably tell us to wait until 2.3 for this if it doesn't make it into 2.2. That said, given the automatic type casting I'm inclined to ignore that since it really shouldn't ever break anything (unless someone is linking 2.2 while using 2.2.1 headers or decided to make their own bindings). I personally vote we just go ahead and merge it but I understand the reticence there

janden · 2023-12-28T11:00:30Z

this is the number of non-uniform points, so it's not constrained by the fft like the number of modes.

Right. That was my thinking too.

Still, I'm a bit concerned about how this interacts with dim3 and kernel launches where we assign M to variables that should be 32-bit, like here:

finufft/src/cuda/2d/spread2d_wrapper.cu

Lines 125 to 135 in 37dbab2

    
           threadsPerBlock.x = 16; 
        
           threadsPerBlock.y = 1; 
        
           blocks.x = (M + threadsPerBlock.x - 1) / threadsPerBlock.x; 
        
           blocks.y = 1; 
        
           if (d_plan->opts.gpu_kerevalmeth) { 
        
               for (int t = 0; t < blksize; t++) { 
        
                   spread_2d_nupts_driven<T, 1> 
        
                       <<<blocks, threadsPerBlock, 0, stream>>>(d_kx, d_ky, d_c + t * M, d_fw + t * nf1 * nf2, M, ns, nf1, nf2, 
        
                                                                es_c, es_beta, sigma, d_idxnupts, pirange); 
        
                   RETURN_IF_CUDA_ERROR 
        
               }

From what I understand, dim3 is a tuple of three unsigned ints, so the compiler should complain here, but it's not. Is it silently truncating? If so, we'd expect problems if M was more than 16*2^64. Will run some tests to find out (should be unit testing these large-M problems anyhow).

It is technically an API change so semantic versioning would probably tell us to wait until 2.3 for this if it doesn't make it into 2.2. That said, given the automatic type casting I'm inclined to ignore that since it really shouldn't ever break anything (unless someone is linking 2.2 while using 2.2.1 headers or decided to make their own bindings). I personally vote we just go ahead and merge it but I understand the reticence there

Would tend to agree, but we can discuss later. As I said elsewhere, no need to rush this.

ahbarnett · 2024-04-23T19:56:48Z

Can you add a 1d test with M>2.2e9, which becomes part of the CI for cufinufft? (at 1e9 NU pt/sec, it should take only 3 secs!).

Thanks! Alex

Tips: keep the cufft size < int32)
For the analogous CPU tester, see: perftest/big2d2f.cpp

janden · 2024-05-09T15:50:34Z

I've been working on testing this yesterday and today. Some notes:

The code as written doesn't work due to the (silent) overflows when assigning int64_t to int in a few places. This can be fixed, but requires more work.
The set of possible problem sizes is actually quite limited on current hardware due to the limit on GPU memory. So far, I'm able to run 4e9 points in single precision without sorting. On the GPU, this takes up approximately 4e9 * 4 bytes for the locations, 4e9 * 8 bytes for the strengths and 4e9 * 4 bytes for idxnupts, which comes out to 60 GB. (Actually, the last one should be 4e9 * 8 so it can properly index the points, so the total comes out to 75 GB.)
Running the tests is actually very slow (more than 3 seconds). The vast majority of the time is not taken up by the cufinufft call, but generating the data and checking the accuracy of the output (on the CPU). This can of course be sped up using multithreading.

So I guess the question here is whether it's worth the effort (do we foresee GPU memory going into triple or quadruple digits anytime soon?) to finish the conversion. It is doable, of course, but would require some time and increase complexity of the code.

ahbarnett · 2024-05-09T17:57:09Z

Interesting. Actually int (signed) overflows at 2^31 so 2.2e9 exceeds it (you can half your RAM total). But, I would be fine having the GPU code be int-only. Since idxnupts is 4 rather than 8 bytes, it allows the user more RAM for other stuff. Best, Alex

…

On Thu, May 9, 2024 at 11:50 AM Joakim Andén ***@***.***> wrote: I've been working on testing this yesterday and today. Some notes: - The code as written doesn't work due to the (silent) overflows when assigning int64_t to int in a few places. This can be fixed, but requires more work. - The set of possible problem sizes is actually quite limited on current hardware due to the limit on GPU memory. So far, I'm able to run 4e9 points in single precision without sorting. On the GPU, this takes up approximately 4e9 * 4 bytes for the locations, 4e9 * 8 bytes for the strengths and 4e9 * 4 bytes for idxnupts, which comes out to 60 GB. (Actually, the last one should be 4e9 * 8 so it can properly index the points, so the total comes out to 75 GB.) - Running the tests is actually very slow (more than 3 seconds). The vast majority of the time is not taken up by the cufinufft call, but generating the data and checking the accuracy of the output (on the CPU). This can of course be sped up using multithreading. So I guess the question here is whether it's worth the effort (do we foresee GPU memory going into triple or quadruple digits anytime soon?) to finish the conversion. It is doable, of course, but would require some time and increase complexity of the code. — Reply to this email directly, view it on GitHub <#411 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSQKJKULFET5D7ENR33ZBOLOBAVCNFSM6AAAAABBE3ADAWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBSHEZTKOJYG4> . You are receiving this because you commented.Message ID: ***@***.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

janden · 2024-05-31T08:51:56Z

Following discussion on Tuesday, we're scrapping the effort to transition the internals to handle 64-bit integers for the number of points. This is such a limited use case that it's not worth it at this point (may be relevant once we start having GPUs with more than 80 GB of memory).

As a result, we're going to do the same thing as for the number of modes. That is, we allow 64-bit integers, but check whether these will fit into 32-bit integers and if so cast (otherwise, we error).

While the interface changes, we still won't allow more than 2e9 points (for now) since this will complicate the code quite a bit with little tangible benefits (transforms these size are currently out of range for most GPUs in terms of memory consumption).

janden · 2024-06-11T07:27:51Z

Had to update Jenkins slightly to install dependencies without cache (pip install --no-cache-dir) to get rid of this error:

ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    unknown package:
        Expected sha256 a884af6cb594e89b0669964b47e1ae6fb13ae6d8c580db3426d08ee2ca84ecd4
             Got        d738f15fb0b4bacd71f10fe8ccbb537cdd1c05c0c018b0dc4dcbd41afbc41fcb

Not sure why this worked (a fresh Docker image should have no cache, correct?), but it got rid of the error message.

ahbarnett · 2024-06-24T17:50:08Z

@janden do you need reviews from people on this before we bring in? you'll have to remind them re the final decision:
#411 (comment)

janden · 2024-06-25T16:53:00Z

@janden do you need reviews from people on this before we bring in? you'll have to remind them re the final decision: #411 (comment)

Yes I'll tag @blackwer.

blackwer

It looks good to me after the typo gets fixed

CHANGELOG

include/cufinufft.h

janden force-pushed the cu_int64_convert branch from 338b5e2 to b4e1c4e Compare December 27, 2023 23:33

janden mentioned this pull request Dec 27, 2023

finufft/cufinufft interface rationalization (C) #255

Closed

janden force-pushed the cu_int64_convert branch from 91412f7 to 37dbab2 Compare December 27, 2023 23:59

janden force-pushed the cu_int64_convert branch from 37dbab2 to d07b2cf Compare May 31, 2024 08:50

janden force-pushed the cu_int64_convert branch from d07b2cf to 3d9f5a2 Compare May 31, 2024 10:03

janden force-pushed the cu_int64_convert branch from 3d9f5a2 to 6f30fbd Compare June 11, 2024 06:50

janden and others added 3 commits June 11, 2024 08:51

cuda+py: update py interface to deal with 64-bit M

96f450f

cuda: update CHANGELOG wrt new setpts API

6f30fbd

janden force-pushed the cu_int64_convert branch from 98d5de6 to 7d740ef Compare June 11, 2024 07:07

ci: no cache for Jenkins

7d740ef

janden marked this pull request as ready for review June 11, 2024 07:26

janden requested a review from blackwer June 25, 2024 16:53

blackwer reviewed Jun 25, 2024

View reviewed changes

CHANGELOG Outdated Show resolved Hide resolved

include/cufinufft.h Outdated Show resolved Hide resolved

janden added 2 commits June 25, 2024 19:47

docs: fix typos in CHANGELOG

3c0ab9a

cuda: remove const qualifiers for setpts

0ce5642

blackwer merged commit 000f8dc into flatironinstitute:master Jun 25, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: update interface to take 64-bit M #411

cuda: update interface to take 64-bit M #411

janden commented Dec 27, 2023 •

edited

Loading

ahbarnett commented Dec 27, 2023

janden commented Dec 27, 2023

ahbarnett commented Dec 27, 2023 •

edited

Loading

janden commented Dec 27, 2023

blackwer commented Dec 28, 2023

janden commented Dec 28, 2023

ahbarnett commented Apr 23, 2024 •

edited

Loading

janden commented May 9, 2024

ahbarnett commented May 9, 2024 via email

janden commented May 31, 2024

janden commented Jun 11, 2024

ahbarnett commented Jun 24, 2024

janden commented Jun 25, 2024

blackwer left a comment

cuda: update interface to take 64-bit M #411

cuda: update interface to take 64-bit M #411

Conversation

janden commented Dec 27, 2023 • edited Loading

ahbarnett commented Dec 27, 2023

janden commented Dec 27, 2023

ahbarnett commented Dec 27, 2023 • edited Loading

janden commented Dec 27, 2023

blackwer commented Dec 28, 2023

janden commented Dec 28, 2023

ahbarnett commented Apr 23, 2024 • edited Loading

janden commented May 9, 2024

ahbarnett commented May 9, 2024 via email

janden commented May 31, 2024

janden commented Jun 11, 2024

ahbarnett commented Jun 24, 2024

janden commented Jun 25, 2024

blackwer left a comment

Choose a reason for hiding this comment

janden commented Dec 27, 2023 •

edited

Loading

ahbarnett commented Dec 27, 2023 •

edited

Loading

ahbarnett commented Apr 23, 2024 •

edited

Loading