Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option not to abort on cuda malloc errors #1083

Open
WilliamTambellini opened this issue Jan 23, 2025 · 3 comments
Open

Add option not to abort on cuda malloc errors #1083

WilliamTambellini opened this issue Jan 23, 2025 · 3 comments

Comments

@WilliamTambellini
Copy link
Contributor

WilliamTambellini commented Jan 23, 2025

As today ggml force aborts the process whenever there is a cuda malloc failure: eg:

#2  0x00007f99c75ca66e in ggml_abort.cold () 
#3  0x00007f99c7b57882 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () 
#4  0x00007f99c7b5ae80 in ggml_cuda_mul_mat_batched_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#5  0x00007f99c7b648a6 in ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#6  0x00007f99c7b6a2e8 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*)
#7  0x00007f99c7b21715 in ggml_backend_sched_graph_compute_async () from /home/nbuild/pub/xmt/latest/lib/libsdl-xnn-ggml.so

This is not ideal for some production context in which we need to have a controlled way to return an OOM error and exit/reload/resume/skip gracefully.
Would you mind if I:

  • add an option (eg GGML_NO_ABORT_ON_OOM) to skip abort if malloc failures
  • return a GGML_STATUS_ALLOC_FAILED to upper calls in the stack (ggml_cuda_mul_mat, ...) if cuda_malloc failed

?

Note:

  • ggml would still have same behavior as today: abort in all cases
  • this would be just for malloc failures: would still abort in all other cases.

Best
W.

@slaren
Copy link
Member

slaren commented Jan 24, 2025

The goal of returning an error instead of crashing is good, we definitely need to do that. The problem is that much of the existing code does not check the error code returned by the graph_compute functions, so making this change could cause existing code to start failing silently in strange ways. I imagine that's where the idea of gating this behind a compile time switch comes from. Ideally we would at least fix the existing code in llama.cpp to make sure that it always check for errors, but if the code is good and ensures that the CUDA backend never ends in a inconsistent state due to this, the change as proposed here would still be good enough to merge.

@WilliamTambellini
Copy link
Contributor Author

Tks Diego.
Preparing a minimalist PR for early review if you dont mind.

@WilliamTambellini
Copy link
Contributor Author

Ok @slaren Diego, ready for a 1st review please:
#1110
Tks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants