-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start cuBLAS backend support. #18
Conversation
- Refactored common codegen logic into a new `DslGenBase` trait. TODO: reduce code duplication between code generators and drivers. Defining a base trait for code generators makes sense.
Relevant info for code review:
$ ./snippet asdf
[1] 1914 bus error (core dumped) ./snippet asdf
$ gdb snippet
...
Thread 1 "snippet" received signal SIGBUS, Bus error.
Snippet (x0=<optimized out>) at snippet.cpp:144
144 float x32 = x30 - x31;
|
936c2b6
to
6e6f063
Compare
- `Backend` now defines a default `dot` method that dispatches to separate v*v, m*v, m*m methods. - Implement `dot` methods for cuBLAS. - v*v: cublasSdot - m*v: cublasSgemv - m*m: cublasSgemm
The cuBLAS test suite is currently disabled (`isGPUAvailable` is set to false). Otherwise, Travis CI will fail. TODO: - Implement `isGPUAvailable` to actually detect whether GPU codegen is possible. - Factor test utility methods into a common base class. - Set up GPU CI.
`withBackend` explicitly demarcates code that should be run on a different backend. It transfers inputs/results between backends automatically. Design info: feiwang3311#8 (comment)
These comments bloat the lines of code. Links to the cuBLAS API reference still exist, for each method.
53481e1
to
b4a5187
Compare
I wanted to try it on my iMac but it seems like there is no CUDA for Mojave yet ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great start! Couple of comments:
- Let's separate out the
withBackend
functionality into a separate PR. I think we need to get a basic GPU-only version working first and I expect getting all the transfer logic right with multiple backends will take several rounds of debugging, and perhaps we want to consider alternative designs as well. - Instead of globally rewiring the codegen for
ArrayNew
, I'd propose to introduce amalloc
like operation in the backend trait, and some of way of freeing memory in a coarse grained way (freeAll
or something like this). This makes memory management for Tensors a responsibility of the CPU/GPU backend. - The code duplication between the codegen back-ends in indeed unfortunate. It seems like the GPU-enabled codegen templates only need to add things, so I'd propose to refactor this with proper extension points (i.e. something like `includes += "<cuda_runtime.h>"
// We can use a similar memory pool technique with `cudaMallocManaged`. | ||
val arrType = remap(a.m) | ||
stream.println(arrType + "* " + quote(sym) + "; " + getCudaMallocManagedString(quote(sym), quote(n), arrType)) | ||
// stream.println(arrType + "* " + quote(sym) + "; " + getCudaMallocString(quote(sym), quote(n), arrType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewiring NewArray
globally is problematic for various reasons. I think a better way would be to have explicit allocation operations for GPU. I'm also concerned about performance of CUDA's Unified Memory layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes complete sense. Adding a mallocTensor
operation in Backend
seems like a good approach. mallocTensor
returns Rep[Array[Float]]
and should replace invocations of NewArray[Float]
.
I'll try to flesh this out.
cublasHandle_t handle; | ||
|
||
int main(int argc, char *argv[]) { | ||
CUBLAS_CALL(cublasCreate(&handle)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These don't have to be in the codegen template, but could be explicitly invoked by the CUBLAS backend trait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify the design you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking explicit init
and cleanup
methods in the backend interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how to hook up init
/cleanup
Backend
-defined methods with the codegen template. AFAICT, the DslDriverXXX
trait (which contains DslGenXXX
) and TensorExp
trait (which contains Backend
) are decoupled, so I'm not sure how to invoke Backend
-defined methods in DslGenXXX
, where codegen templates are defined.
Perhaps I'm missing something, do you have some idea on how to hook up things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't mean that the code generator would invoke these methods, but rather that they should be called from the frontend somewhere. Does that help clear things up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I'm still a bit confused. Do you mean something like this?
val foo = new DslDriverCublas[String, Unit] with TensorExp {
backend = new BackendCublas
@virtualize
def snippet(x: Rep[String]): Rep[Unit] = {
backend.init()
// ...
Tensor.assertEqual(...)
backend.cleanup()
}
}
If so, I don't quite believe this is the best approach. I would argue backend setup/cleanup are implementation details that should be hidden from DSL users if possible. Also, if backend.init()
initializes cublasHandle_t handle
, then it really should be performed before any computation in the snippet. Burdening DSL users with that responsibility is suboptimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't necessarily thinking users should call these, but for example init/reset
could be ideally hidden in a withGPU { ... }
construct or even as part of the DSLDriver class before and after invoking snippet
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, my point here is that codegen templates should be as minimal as possible, and we should define the application logic as far as possible using metaprogramming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense - thanks for the clarification.
Along those lines, a straightforward idea is to create a function wrapping snippet
and compiling that function during codegen:
// `wrapper` could use a better name.
def wrapper(x: Rep[String]): Rep[Unit] = {
backend.init()
val result = snippet(x)
backend.cleanup()
return result
}
// During code generation:
lazy val code: String = {
val source = new java.io.StringWriter()
// Compile wrapper function.
codegen.emitSource(wrapper, "Snippet", new java.io.PrintWriter(source))
source.toString
}
If this seems reasonable (and it works), I'll implement it.
Note: this wrapper
does add complexity to the simple mental model of "the snippet
function is compiled, WYSIWYG". It would be good to make setup
and cleanup
idempotent if possible, in case DSL users call the functions themselves within snippet
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good, yes. Let's not worry about idempotency at this point.
Add `withBackend` in a separate PR for separation of concerns.
A first step towards #8.
dot
forBackendCublas
, refactorBackend
trait.withBackend
device placement function.