-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) #725
Comments
Hi @vivekpandian08, this sounds very much like a problem I had. You can find the description and the fix here: You can either:
I hope that helps. |
Hi @jmorlock , Thank you for sharing! I appreciate the reference to #722 and the options for handling this issue. Currently, I’m using 56 latent factors for my model. Thanks again for the help! |
Hi @vivekpandian08 , from a theoretical point-of-view you should select a parameter set where your model performance is optimal. from a practical point of view if you are stuck with the current version from implicit featuring the bug I explained, you must select the number of factors in a way where the integer overflow does not occur:
You can still do a hyperparameter search as explained above but now with this maximum as the upper boundary for the number of factors. In case the optimal value is below that number you are lucky. I hope that helps. |
Hi @jmorlock , Thank you for the detailed explanation! I was already aware of the theoretical approach to finding the optimal number of latent factors, but your practical method for avoiding integer overflow is really helpful. Setting an upper boundary by calculating based on the matrix size makes perfect sense, especially with the current constraints in the implicit version. Thanks again ! |
Hi @jmorlock , I’m trying to clone the repository https://github.com/jmorlock/implicit and build it locally, but I’m encountering an error at this step: [13/34] Generating CXX source implicit/cpu/_als.cxx I’ve already uninstalled the existing implicit library from my environment to avoid conflicts. Could you provide any guidance on how to resolve this issue? Are there any specific dependencies or configurations I might be missing? Thanks in advance for your help! |
Hi @vivekpandian08, sorry for the late reply. I am not sure whether I can help you with this error. But I can tell you what I did in order to build implicit.
In case of success you will find a I hope this helps. |
Description:
I encountered a RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) while running the implicit.gpu.als model on a large dataset. The error may be related to memory handling or CUDA library compatibility issues.
System Information:
Dataset size:
Number of users: 50 million
Number of items: 360,000
GPU: NVIDIA A100 (40 GB)
Memory Usage: Approximately 13,943 MiB / 40,960 MiB
CUDA Version: 12.4
Library Versions:
implicit: latest (0.7.2)
torch: 2.5.1
Issue Details: When running the model, the following error occurs:
RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196)
-->model.fit(weighted_matrix)
-->self.solver.least_squares(Cui, X, _YtY, Y, self.cg_steps)
This error happens consistently on my large dataset. The GPU has sufficient available memory (about 13,943 MiB is used out of 40,960 MiB). I have attempted the following troubleshooting steps:
Steps to Reproduce:
Expected Behavior: The model should train successfully on the A100 GPU without running into Cuda Error.
Actual Behavior: The Cuda Error interrupts training, and the model cannot proceed further.
Additional Notes: This issue may relate to handling large datasets or to CUDA 12.4 compatibility with implicit.gpu. Any insights on possible fixes or workarounds would be greatly appreciated!
The text was updated successfully, but these errors were encountered: