Return device_count=0 in error case #285

td-mpcdf · 2023-12-01T14:26:34Z

For me it seems to be more natural if the get_device_count function returns just 0 in the case of cudaErrorNoDevice and cudaErrorInsufficientDriver. The handling of this case should happen on the applications level, but at the moment the gtensor code aborts.

bd4 · 2023-12-01T14:35:17Z

I think this is a reasonable change and won't break anything, @germasch what do you think?

Regarding the implementation, you can use gtGpuCheck(code) to trigger standard error handling if the code is not success or no device (or anything else you want to special case). This would also need to be implemented for SYCL so behavior is consistent.

@td-mpcdf do you imagine writing code that would fallback to running on CPU if device count is 0?

td-mpcdf · 2023-12-01T15:54:29Z

We mainly need it to print a meaningful error message that the user can understand and knows what to do. Currently, if you run a GPU built CUDA code on a CPU partition, you get some error message about insufficient driver version which is not that helpful for a user. With the proposed changes one could write something more meaningful.

bd4

Needs fallbacks and SYCL implementation. I can do the SYCL implementation as a separate PR if you don't have a machine with SYCL, just let me know.

bd4 · 2023-12-01T16:11:49Z

include/gtensor/backend_cuda.h

+        fprintf(stderr, "Did you start the job on a CPU partition?\n");
+        device_count = 0;
+        break;
+      case cudaSuccess: break;


need a fallback case that calls gtGpuCheck(code) for all other errors.

bd4 · 2023-12-01T16:12:11Z

include/gtensor/backend_hip.h

+        /* Set silently the return value to 0 */
+        device_count = 0;
+        break;
+      case hipSuccess: break;


Also needs fallback with gtGpuCheck(code).

td-mpcdf · 2023-12-08T14:16:56Z

Needs fallbacks and SYCL implementation. I can do the SYCL implementation as a separate PR if you don't have a machine with SYCL, just let me know.

I do not have a sycl system, so if you could do it, it would be nice. The fallback code has been added now.

bd4 · 2023-12-08T15:54:34Z

Clang-format wants to put gtGpuCheck on one line, but that breaks the macro :P. I'm trying to figure out the least disruptive solution here. I am thinking maybe changing to an if/elseif/else clause.

bd4 · 2023-12-08T16:06:33Z

I think part of the problem is that gtensor dies on errors in get device count, which does not give the application a chance to interpret the error and decide whether it should die or do something else.

work around limitations in gtGpuCheck macro

bd4 · 2023-12-08T16:19:57Z

include/gtensor/backend_cuda.h

+    } else if (code == cudaErrorInsufficientDriver) {
+      fprintf(stderr, "Error in cudaGetDeviceCount: %d (%s)\n", code,
+              cudaGetErrorString(code));
+      fprintf(stderr, "Did you start the job on a CPU partition?\n");


This is kind of an app specific message, but given we have no way of the application defining it's own message, I guess I am ok with it. Really begs the question as to whether we should propagate certain errors up to the caller.

germasch · 2023-12-08T19:08:43Z

Well, I'd say as usual, if you makes your life too easy at first, it comes back to bite you eventually.

I think I was kinda aware that we're not really putting any work into error handling, and actually a lot of time in HPC I think that's just fine (e.g., if you get an error from a kernel execution, it's not likely there's a way you can recover, so you might as well abort right there).

But in a case like this, I agree that it's not gtensor which should make the decision what to do, it should be left to the application (GENE). I guess I'm not opposed to the current PR as a bandaid, but it'd be nice to handle this better in the future, which probably means finding a way to return an error code (potentially breaking the API). An alternate solution for the issue at hand might be to have some additional function that one could call first to make sure that CUDA (or whatever) is set up correctly? Actually, how about returning -1 when an error occurs, as opposed to 0 if there's just no actual device. That way, the application could at least be a bit more explicit about what went wrong.

A potential problem to be mindful of is that someone might mean to run GENE on the GPU partition but due to some config / build issue, the CUDA versions don't match, and so GENE could end up using, e.g., just 8 CPU cores per node rather than the 8 GPUs it can't access. So I think it's important to still abort (in GENE, which I think it what you're suggesting), rather than just running on CPU, which would very inefficiently eat up computing time...

--Kai

bd4 · 2023-12-08T21:32:20Z

This PR at least allows the application to raise some kind of helpful error to user if gpu count is 0 and it's trying to run in GPU mode. I think we merge this and can refine further in future. It could raise an exception or we could define gtensor error codes that standardize across the backends (hip and cuda are largely similar but SYCL tends to use exceptions).

bd4 · 2023-12-08T21:40:58Z

Regarding negative numbers as errors, I think that is reasonable, and we could have codes for negative numbers. However 0 for the special case of no device or no driver seems reasonable to me. My vote is to merge as is and refine in future PR.

Return device_count=0 in error case

ab8b5ac

bd4 requested changes Dec 1, 2023

View reviewed changes

Add fallback for get_device_count

c0ad119

device count: use if/else

1e40210

work around limitations in gtGpuCheck macro

bd4 reviewed Dec 8, 2023

View reviewed changes

bd4 approved these changes Dec 8, 2023

View reviewed changes

bd4 merged commit 473e685 into wdmapp:main Dec 8, 2023
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return device_count=0 in error case #285

Return device_count=0 in error case #285

td-mpcdf commented Dec 1, 2023

bd4 commented Dec 1, 2023 •

edited

Loading

td-mpcdf commented Dec 1, 2023

bd4 left a comment

bd4 Dec 1, 2023

bd4 Dec 1, 2023

td-mpcdf commented Dec 8, 2023

bd4 commented Dec 8, 2023

bd4 commented Dec 8, 2023

bd4 Dec 8, 2023

germasch commented Dec 8, 2023

bd4 commented Dec 8, 2023

bd4 commented Dec 8, 2023

Return device_count=0 in error case #285

Return device_count=0 in error case #285

Conversation

td-mpcdf commented Dec 1, 2023

bd4 commented Dec 1, 2023 • edited Loading

td-mpcdf commented Dec 1, 2023

bd4 left a comment

Choose a reason for hiding this comment

bd4 Dec 1, 2023

Choose a reason for hiding this comment

bd4 Dec 1, 2023

Choose a reason for hiding this comment

td-mpcdf commented Dec 8, 2023

bd4 commented Dec 8, 2023

bd4 commented Dec 8, 2023

bd4 Dec 8, 2023

Choose a reason for hiding this comment

germasch commented Dec 8, 2023

bd4 commented Dec 8, 2023

bd4 commented Dec 8, 2023

bd4 commented Dec 1, 2023 •

edited

Loading