Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start cuBLAS backend support. #18

Merged
merged 11 commits into from
Oct 7, 2018

Conversation

dan-zheng
Copy link
Collaborator

A first step towards #8.

  • Add cuBLAS/cuDNN code generators and drivers.
  • Implement dot for BackendCublas, refactor Backend trait.
  • Add cuBLAS code generation test.
  • Implement withBackend device placement function.

- Refactored common codegen logic into a new `DslGenBase` trait.

TODO: reduce code duplication between code generators and drivers.
Defining a base trait for code generators makes sense.
@dan-zheng dan-zheng requested review from GSAir and feiwang3311 October 6, 2018 17:32
@dan-zheng
Copy link
Collaborator Author

dan-zheng commented Oct 6, 2018

Relevant info for code review:

  • There's a lot of code duplication in the code generators and DSL drivers. Let's tackle that in follow-up commits.
  • The cuBLAS tests fail on my GPU machine for some reason. If someone else has a GPU, please try to run the tests. I'm not sure why this occurs, cudaMallocManaged is used to allocate tensors.
$ ./snippet asdf
[1]    1914 bus error (core dumped)  ./snippet asdf
$ gdb snippet
...
Thread 1 "snippet" received signal SIGBUS, Bus error.
Snippet (x0=<optimized out>) at snippet.cpp:144
144	float x32 = x30 - x31;
  • It would be nice to set up GPU CI, but I don't think there are free services providing it. One solution adopted by other codebases is Jenkins CI + a personal GPU machine.

- `Backend` now defines a default `dot` method that dispatches to
separate v*v, m*v, m*m methods.
- Implement `dot` methods for cuBLAS.
  - v*v: cublasSdot
  - m*v: cublasSgemv
  - m*m: cublasSgemm
The cuBLAS test suite is currently disabled (`isGPUAvailable` is set to
false). Otherwise, Travis CI will fail.

TODO:
- Implement `isGPUAvailable` to actually detect whether GPU codegen is
possible.
- Factor test utility methods into a common base class.
- Set up GPU CI.
`withBackend` explicitly demarcates code that should be run on a
different backend. It transfers inputs/results between backends
automatically.

Design info: feiwang3311#8 (comment)
These comments bloat the lines of code. Links to the cuBLAS API
reference still exist, for each method.
@TiarkRompf
Copy link
Collaborator

I wanted to try it on my iMac but it seems like there is no CUDA for Mojave yet ...

@feiwang3311 feiwang3311 self-assigned this Oct 6, 2018
Copy link
Collaborator

@TiarkRompf TiarkRompf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start! Couple of comments:

  • Let's separate out the withBackend functionality into a separate PR. I think we need to get a basic GPU-only version working first and I expect getting all the transfer logic right with multiple backends will take several rounds of debugging, and perhaps we want to consider alternative designs as well.
  • Instead of globally rewiring the codegen for ArrayNew, I'd propose to introduce a malloc like operation in the backend trait, and some of way of freeing memory in a coarse grained way (freeAll or something like this). This makes memory management for Tensors a responsibility of the CPU/GPU backend.
  • The code duplication between the codegen back-ends in indeed unfortunate. It seems like the GPU-enabled codegen templates only need to add things, so I'd propose to refactor this with proper extension points (i.e. something like `includes += "<cuda_runtime.h>"

// We can use a similar memory pool technique with `cudaMallocManaged`.
val arrType = remap(a.m)
stream.println(arrType + "* " + quote(sym) + "; " + getCudaMallocManagedString(quote(sym), quote(n), arrType))
// stream.println(arrType + "* " + quote(sym) + "; " + getCudaMallocString(quote(sym), quote(n), arrType))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewiring NewArray globally is problematic for various reasons. I think a better way would be to have explicit allocation operations for GPU. I'm also concerned about performance of CUDA's Unified Memory layer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes complete sense. Adding a mallocTensor operation in Backend seems like a good approach. mallocTensor returns Rep[Array[Float]] and should replace invocations of NewArray[Float].

I'll try to flesh this out.

cublasHandle_t handle;

int main(int argc, char *argv[]) {
CUBLAS_CALL(cublasCreate(&handle));
Copy link
Collaborator

@TiarkRompf TiarkRompf Oct 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't have to be in the codegen template, but could be explicitly invoked by the CUBLAS backend trait

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify the design you have in mind?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking explicit init and cleanup methods in the backend interface.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how to hook up init/cleanup Backend-defined methods with the codegen template. AFAICT, the DslDriverXXX trait (which contains DslGenXXX) and TensorExp trait (which contains Backend) are decoupled, so I'm not sure how to invoke Backend-defined methods in DslGenXXX, where codegen templates are defined.

Perhaps I'm missing something, do you have some idea on how to hook up things?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't mean that the code generator would invoke these methods, but rather that they should be called from the frontend somewhere. Does that help clear things up?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I'm still a bit confused. Do you mean something like this?

val foo = new DslDriverCublas[String, Unit] with TensorExp {
  backend = new BackendCublas

  @virtualize
  def snippet(x: Rep[String]): Rep[Unit] = {
    backend.init()
    // ...
    Tensor.assertEqual(...)
    backend.cleanup()
  }
}

If so, I don't quite believe this is the best approach. I would argue backend setup/cleanup are implementation details that should be hidden from DSL users if possible. Also, if backend.init() initializes cublasHandle_t handle, then it really should be performed before any computation in the snippet. Burdening DSL users with that responsibility is suboptimal.

Copy link
Collaborator

@TiarkRompf TiarkRompf Oct 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't necessarily thinking users should call these, but for example init/reset could be ideally hidden in a withGPU { ... } construct or even as part of the DSLDriver class before and after invoking snippet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, my point here is that codegen templates should be as minimal as possible, and we should define the application logic as far as possible using metaprogramming.

Copy link
Collaborator Author

@dan-zheng dan-zheng Oct 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense - thanks for the clarification.

Along those lines, a straightforward idea is to create a function wrapping snippet and compiling that function during codegen:

// `wrapper` could use a better name.
def wrapper(x: Rep[String]): Rep[Unit] = {
  backend.init()
  val result = snippet(x)
  backend.cleanup()
  return result
}

// During code generation:
lazy val code: String = {
  val source = new java.io.StringWriter()
  // Compile wrapper function.
  codegen.emitSource(wrapper, "Snippet", new java.io.PrintWriter(source))
  source.toString
}

If this seems reasonable (and it works), I'll implement it.

Note: this wrapper does add complexity to the simple mental model of "the snippet function is compiled, WYSIWYG". It would be good to make setup and cleanup idempotent if possible, in case DSL users call the functions themselves within snippet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good, yes. Let's not worry about idempotency at this point.

Add `withBackend` in a separate PR for separation of concerns.
@feiwang3311 feiwang3311 merged commit e30cd86 into feiwang3311:master Oct 7, 2018
@dan-zheng
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants