-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'invalid device context' and other concurrency issues #261
Comments
Having many Also, I am concerned that the Perhaps have a single thread created via [1] Due to --- surprise --- finaliser related bugs in the llvm backend, we hit this limit pretty quickly running no fib. |
I think we only need to have as many I'm not sure having a single bound thread that is sent actions to execute would work out so well. Firstly, as you say, things like I have a very simple example here that highlights the issue at hand: module Main where
import Foreign.CUDA.Driver.Context hiding ( device )
import Foreign.CUDA.Driver.Device ( device, initialise )
import Control.Concurrent ( forkOn, forkIO, forkOS, yield )
import Control.Concurrent.MVar
main = do
var <- newEmptyMVar
forkIO $ do
initialise []
dev <- device 0
ctx <- create dev []
yield
pop
putMVar var ()
readMVar var If you compile this with
Are there any particular examples you can think of where enough calls to |
This also stops the finalizer having to make CUDA API calls. See AccelerateHS/accelerate#261 .
I should also add that if we find that we still run out of OS threads, even when they're not used for finalizers, we could possibly overcome that by managing our own thread pool. It still isn't ideal, but it would at least remove that particular problem. |
You know, I think these are all fixed now; I have not seen these problems in a long time. |
This was talked about in #227 and in AccelerateHS/accelerate-cuda#11, but I figured it was best to create a separate ticket for it as it didn't really have one dedicated to it. I suspect that this is also strongly related to #260.
The good news is I think I have found the problem. The bad news is that the only solution I have come up with has a performance penalty and requires rethinking a few things. Basically, we never ensure that all our calls to the CUDA API are from bound threads. Because CUDA depends on thread local state (the "context"), we should be using
forkOS
instead offorkIO
. This is unfortunate as a context switch between bound threads is significantly more expensive thanforkIO
threads.The other problem this brings up is CUDA API calls in finalizers. Finalizers are also not run in bound threads. The simplest solution to this is to use
runInBoundThread
in the finalizers themselves, but given the number of finalizers firing now, I think it would be a significant performance hit to do that. I believe that by caching resources (like events) I can get it so that the only time finalizers will need to be bound is at program exit, where the performance cost is not such a problem. I think that would be a better solution.Unless anyone has a better way of solving this (looking mostly at you @tmcdonell), I'm going to go ahead and make the necessary changes.
The text was updated successfully, but these errors were encountered: