Start cuBLAS backend support. #18

dan-zheng · 2018-10-06T17:32:28Z

A first step towards #8.

Add cuBLAS/cuDNN code generators and drivers.
Implement dot for BackendCublas, refactor Backend trait.
Add cuBLAS code generation test.
Implement withBackend device placement function.

- Refactored common codegen logic into a new `DslGenBase` trait. TODO: reduce code duplication between code generators and drivers. Defining a base trait for code generators makes sense.

dan-zheng · 2018-10-06T17:44:31Z

Relevant info for code review:

There's a lot of code duplication in the code generators and DSL drivers. Let's tackle that in follow-up commits.
The cuBLAS tests fail on my GPU machine for some reason. If someone else has a GPU, please try to run the tests. I'm not sure why this occurs, cudaMallocManaged is used to allocate tensors.

$ ./snippet asdf
[1]    1914 bus error (core dumped)  ./snippet asdf
$ gdb snippet
...
Thread 1 "snippet" received signal SIGBUS, Bus error.
Snippet (x0=<optimized out>) at snippet.cpp:144
144	float x32 = x30 - x31;

It would be nice to set up GPU CI, but I don't think there are free services providing it. One solution adopted by other codebases is Jenkins CI + a personal GPU machine.

- `Backend` now defines a default `dot` method that dispatches to separate v*v, m*v, m*m methods. - Implement `dot` methods for cuBLAS. - v*v: cublasSdot - m*v: cublasSgemv - m*m: cublasSgemm

The cuBLAS test suite is currently disabled (`isGPUAvailable` is set to false). Otherwise, Travis CI will fail. TODO: - Implement `isGPUAvailable` to actually detect whether GPU codegen is possible. - Factor test utility methods into a common base class. - Set up GPU CI.

`withBackend` explicitly demarcates code that should be run on a different backend. It transfers inputs/results between backends automatically. Design info: feiwang3311#8 (comment)

These comments bloat the lines of code. Links to the cuBLAS API reference still exist, for each method.

TiarkRompf · 2018-10-06T18:28:40Z

I wanted to try it on my iMac but it seems like there is no CUDA for Mojave yet ...

TiarkRompf

This looks like a great start! Couple of comments:

Let's separate out the withBackend functionality into a separate PR. I think we need to get a basic GPU-only version working first and I expect getting all the transfer logic right with multiple backends will take several rounds of debugging, and perhaps we want to consider alternative designs as well.
Instead of globally rewiring the codegen for ArrayNew, I'd propose to introduce a malloc like operation in the backend trait, and some of way of freeing memory in a coarse grained way (freeAll or something like this). This makes memory management for Tensors a responsibility of the CPU/GPU backend.
The code duplication between the codegen back-ends in indeed unfortunate. It seems like the GPU-enabled codegen templates only need to add things, so I'd propose to refactor this with proper extension points (i.e. something like `includes += "<cuda_runtime.h>"

TiarkRompf · 2018-10-06T19:58:18Z

src/main/scala/lantern/dslapi.scala

+      // We can use a similar memory pool technique with `cudaMallocManaged`.
+      val arrType = remap(a.m)
+      stream.println(arrType + "* " + quote(sym) + "; " + getCudaMallocManagedString(quote(sym), quote(n), arrType))
+      // stream.println(arrType + "* " + quote(sym) + "; " + getCudaMallocString(quote(sym), quote(n), arrType))


Rewiring NewArray globally is problematic for various reasons. I think a better way would be to have explicit allocation operations for GPU. I'm also concerned about performance of CUDA's Unified Memory layer.

This makes complete sense. Adding a mallocTensor operation in Backend seems like a good approach. mallocTensor returns Rep[Array[Float]] and should replace invocations of NewArray[Float].

I'll try to flesh this out.

TiarkRompf · 2018-10-06T19:59:46Z

src/main/scala/lantern/dslapi.scala

+cublasHandle_t handle;
+
+int main(int argc, char *argv[]) {
+  CUBLAS_CALL(cublasCreate(&handle));


These don't have to be in the codegen template, but could be explicitly invoked by the CUBLAS backend trait

Could you please clarify the design you have in mind?

I was thinking explicit init and cleanup methods in the backend interface.

I wonder how to hook up init/cleanup Backend-defined methods with the codegen template. AFAICT, the DslDriverXXX trait (which contains DslGenXXX) and TensorExp trait (which contains Backend) are decoupled, so I'm not sure how to invoke Backend-defined methods in DslGenXXX, where codegen templates are defined.

Perhaps I'm missing something, do you have some idea on how to hook up things?

Ah, I didn't mean that the code generator would invoke these methods, but rather that they should be called from the frontend somewhere. Does that help clear things up?

I'm sorry, I'm still a bit confused. Do you mean something like this?

val foo = new DslDriverCublas[String, Unit] with TensorExp { backend = new BackendCublas @virtualize def snippet(x: Rep[String]): Rep[Unit] = { backend.init() // ... Tensor.assertEqual(...) backend.cleanup() } }

If so, I don't quite believe this is the best approach. I would argue backend setup/cleanup are implementation details that should be hidden from DSL users if possible. Also, if backend.init() initializes cublasHandle_t handle, then it really should be performed before any computation in the snippet. Burdening DSL users with that responsibility is suboptimal.

I wasn't necessarily thinking users should call these, but for example init/reset could be ideally hidden in a withGPU { ... } construct or even as part of the DSLDriver class before and after invoking snippet.

In general, my point here is that codegen templates should be as minimal as possible, and we should define the application logic as far as possible using metaprogramming.

I see, that makes sense - thanks for the clarification.

Along those lines, a straightforward idea is to create a function wrapping snippet and compiling that function during codegen:

// `wrapper` could use a better name. def wrapper(x: Rep[String]): Rep[Unit] = { backend.init() val result = snippet(x) backend.cleanup() return result } // During code generation: lazy val code: String = { val source = new java.io.StringWriter() // Compile wrapper function. codegen.emitSource(wrapper, "Snippet", new java.io.PrintWriter(source)) source.toString }

If this seems reasonable (and it works), I'll implement it.

Note: this wrapper does add complexity to the simple mental model of "the snippet function is compiled, WYSIWYG". It would be good to make setup and cleanup idempotent if possible, in case DSL users call the functions themselves within snippet.

That looks good, yes. Let's not worry about idempotency at this point.

Add `withBackend` in a separate PR for separation of concerns.

dan-zheng · 2018-10-07T02:48:18Z

My next steps:

dan-zheng added 5 commits October 4, 2018 21:13

Implement matrix-matrix multiplication, clean up dot.

4da1af0

Fix vector-vector dot doc comment.

8c0e4e8

Fix typos.

a78ffc3

Merge branch 'master' into backend-support

dfa800d

Add cuBLAS/cuDNN code generators and drivers.

f9d5be9

- Refactored common codegen logic into a new `DslGenBase` trait. TODO: reduce code duplication between code generators and drivers. Defining a base trait for code generators makes sense.

dan-zheng requested review from GSAir and feiwang3311 October 6, 2018 17:32

dan-zheng requested a review from TiarkRompf October 6, 2018 17:44

dan-zheng force-pushed the backend-support branch from 936c2b6 to 6e6f063 Compare October 6, 2018 17:55

dan-zheng added 4 commits October 6, 2018 14:18

Implement dot for BackendCublas, refactor Backend trait.

5763cf2

- `Backend` now defines a default `dot` method that dispatches to separate v*v, m*v, m*m methods. - Implement `dot` methods for cuBLAS. - v*v: cublasSdot - m*v: cublasSgemv - m*m: cublasSgemm

Implement withBackend device placement function.

cee7c9b

`withBackend` explicitly demarcates code that should be run on a different backend. It transfers inputs/results between backends automatically. Design info: feiwang3311#8 (comment)

[NFC] Remove cuBLAS function signatures from comments.

b4a5187

These comments bloat the lines of code. Links to the cuBLAS API reference still exist, for each method.

dan-zheng force-pushed the backend-support branch from 53481e1 to b4a5187 Compare October 6, 2018 18:18

[NFC] Update comment.

d1a52c6

feiwang3311 self-assigned this Oct 6, 2018

TiarkRompf requested changes Oct 6, 2018

View reviewed changes

Remove withBackend.

a6d82fb

Add `withBackend` in a separate PR for separation of concerns.

feiwang3311 merged commit e30cd86 into feiwang3311:master Oct 7, 2018

This was referenced Oct 7, 2018

Separate backend setup/cleanup from codegen template. #20

Merged

Add explicit tensor data malloc via NewTensor. #21

Merged

TiarkRompf added this to the M1 milestone Oct 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start cuBLAS backend support. #18

Start cuBLAS backend support. #18

dan-zheng commented Oct 6, 2018

dan-zheng commented Oct 6, 2018 •

edited

Loading

TiarkRompf commented Oct 6, 2018

TiarkRompf left a comment

TiarkRompf Oct 6, 2018

dan-zheng Oct 6, 2018

TiarkRompf Oct 6, 2018 •

edited

Loading

dan-zheng Oct 6, 2018

TiarkRompf Oct 6, 2018

dan-zheng Oct 6, 2018

TiarkRompf Oct 7, 2018

dan-zheng Oct 7, 2018

TiarkRompf Oct 7, 2018 •

edited

Loading

TiarkRompf Oct 7, 2018

dan-zheng Oct 7, 2018 •

edited

Loading

TiarkRompf Oct 7, 2018

dan-zheng commented Oct 7, 2018

Start cuBLAS backend support. #18

Start cuBLAS backend support. #18

Conversation

dan-zheng commented Oct 6, 2018

dan-zheng commented Oct 6, 2018 • edited Loading

TiarkRompf commented Oct 6, 2018

TiarkRompf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TiarkRompf Oct 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TiarkRompf Oct 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dan-zheng Oct 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dan-zheng commented Oct 7, 2018

dan-zheng commented Oct 6, 2018 •

edited

Loading

TiarkRompf Oct 6, 2018 •

edited

Loading

TiarkRompf Oct 7, 2018 •

edited

Loading

dan-zheng Oct 7, 2018 •

edited

Loading