-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Error #57
Comments
Is this on a local machine? |
Yes, I have rebooted both machines as well as reinstalled/recompiled all my Julia packages to clear any bad versions. The two machines (one a laptop and the other a deskop) have different NVidea GPU's and drivers so I do not think it is a driver issue though that can never be ruled out. I will try messing with that in the mean time. |
What version of CUDA.jl are you using? Can I see the result of I have seen many different problems resulting in Code 8 errors, including Out Of Memory errors (are you sure you have enough memory on your GPU to accommodate your network) and bugs in CUDA.jl (AlphaZero.jl is a stress test for CUDA.jl). |
PS: I love the idea of using AlphaZero on Tetris! |
from
This is from my laptop. So it does not have a lot of memory but my GPU on my desktop has 8GB. Based on the stacktrace im pretty sure I am using the Flux backend. I had tried to reduce the memory usage by reducing the number of boards stored not by the network size. I can try that next. |
I see nothing wrong with your There are scripts in the Note that it is also possible that what you are observing comes from a problem with CUDA.jl, as I've seen it happen in the past. |
100% due to memory constraints on the gpu. I agree with his suggestion to lower batch size. How much vram do you have? Unsure but I’d assume the size of your vectorized states would effect the sizes of this as well. recently I’ve been trying to get the most out of both my cpu and gpu and it’s typically very much a trail and error balancing act from my experience. |
Another thing to note is that alphazero.Jl appears to preallocate all available gpu vram so that’s not a good way to measure. |
Thank you all for you help so far. I just need to get some simple results this week so ill just run this on the CPU for now, but will be back in a couple weeks to work through this, then maybe set up a PR for Tetris. |
If you want to get results on CPU, you probably need to simplify the problem somehow (for example by looking at a smaller grid). I suspect that original Tetris is too complicated for AlphaZero to learn the game in a reasonable amount of time without a GPU. That being said, I may be wrong here. In any case, you will need to use a much smaller network if you want to train your agent on CPU. |
@Pandabear314 One thing you may also want to do is to update all dependencies using |
@SheldonCurtiss was correct with the batch size being the culprit of me running out of VRAM, and everything runs correctly once I reduce that. Also the PR may take some time as will have to reformulate how Tetris is run by AlphaZero as my current implementation does not learn, but I have a few ideas to try yet. |
While atempting to utalize AlphaZero for tetris I keep running into this error when running it on the GPU. I have reproduced this error on two separate machines, and happens consistently when launching a checkpoint evaluation. I am wondering if someone has insight into what might be causing this.
Repo:
https://gitlab.com/samdickinson314/tetrisai
include("runner.jl")
The text was updated successfully, but these errors were encountered: