Accelerated Backends (CPU & GPU) #8

TiarkRompf · 2018-09-20T14:09:33Z

High priority:

cuBLAS
cuDNN

CUDA backend todos:

Medium priority:

OpenBLAS/MKL (CPU)

Other potential back-ends to look into: TVM, XLA, ...

The text was updated successfully, but these errors were encountered:

dan-zheng · 2018-10-03T08:49:01Z

A robust solution for GPU support needs to reconcile CPU-only operations (e.g. printf) and operations that should be run on GPU (e.g. cuBLAS library functions).

For cuBLAS/cuDNN support, I think I'll start with the naive implementation of allocating all tensors on GPU memory. This is the shortest path to testing GPU code generation.

However, this essentially breaks all operations that aren't defined with GPU memory in mind: printf certainly won't work (unless it's modified to copy tensors to CPU memory) and even ops like elementwise addition need to be rewritten using library functions like cublasSaxpy.

Redefining many ops for GPU support greatly increases the surface area of the Backend trait, which is not ideal. If you have ideas for avoiding this, or if you have other ideas/feedback about backend support, please share!

dan-zheng · 2018-10-03T09:05:32Z

@jmd1011 had the idea of running all ops on the CPU by default, and only using GPU ops within an explicitly demarcated section of code (e.g. a training loop). I feel like this design is facilitated by the flexible Backend trait implementation: simply change the backend value to change the codegen target.

This approach leads to a better separation of concerns: rather than handling arbitrary mixings of CPU and GPU ops (which effectively requires each op to worry about the device allocation of its arguments and result), only "chunks" of CPU and GPU code are handled (ops assume tensors all live on either CPU or GPU). This means that the backend-swapping code is responsible for handling "copying tensors between devices" (rather than every single op).

// Adapted from mnistCNN.scala.
val mnist = new DslDriverC[String, Unit] with TensorExp {
  def snippet(a: Rep[String]): Rep[Unit] = {
    // The backend is initially CPU (`var backend: BackendNative`).
    val data = new DataLoader("mnist")
    ...

    // Start training loop. Generate GPU ops!
    backend = new BackendCudnn
    for (epoch <- 0 until epochCount: Rep[Range]) {
       data foreach { (input: Tensor, target: Rep[Int]) =>
         // It's nice to have a way to print values within the training loop.
         // Some ad-hoc mechanism for communication would be good.
         // Strawman syntax:
         // `printf("Loss: %f\n", loss.toCPU())`
         ...
       }
    }

    // Change backend back to CPU.
    backend = new BackendNative
    printf(...)
  }
}

This idea seems similar to "device placement" in TensorFlow:

results = []
a = tf.get_variable("a", ...)
b = tf.get_variable("b", ...)

# GPU 0 performs matmul.
with tf.device('/gpu:0'):
    results.append(tf.matmul(a, b))

# GPU 1 performs addition.
with tf.device('/gpu:1'):
    results.append(a + b)

# TensorFlow handles copying tensors between devices.
with tf.device('/cpu:0'):
    sum = tf.add_n(results)

Here's the equivalent feature in Swift for TensorFlow. It should be possible to implement a similar API in Lantern:

Original incomplete prototype

// Not sure what the type of `f` should be. Any tips?
def withBackend(b: Backend, f: ??? -> ???) = {
  val originalBackend = backend
  // Copy tensors to the new backend.
  // Question: what tensors need to copied?
  // Answer: the ones that are passed as arguments to `f`.
  // Change the backend (i.e. codegen target).
  backend = b
  // Call `f`.
  val result = f(...)
  // Copy `result` to the old backend, then reset the backend.
  backend = originalBackend
}

// Revised based on @GSAir's suggestion below.
def withBackend[T, U](b: Backend, input: T)(f: T => U) = {
  val originalBackend = backend
  // Transfer input to the new backend.
  transferBetweenBackends(originalBackend, b, input)

  // Change the backend (i.e. codegen target), then call `f`.
  backend = b
  val result = f(input)

  // Transfer `result` to the old backend, then reset the backend.
  transferBetweenBackends(b, originalBackend, result)
  backend = originalBackend
}

// Usage:
def withGPU[T, U](input: T)(f: T => U) = withBackend(BackendCudnn, input)(f)

// Type-inference: `withGPU[Tensor, Tensor]` below.
withGPU(Tensor.ones(2, 3)) { x => x + x }

If you have feedback or related ideas, please share!

GSAir · 2018-10-04T14:36:34Z

For the type of f, you may want to be flexible:

def withBackend[T,U](b: Backend, input: T)(f: T => U) = {
}

The currying form allow you to do:

withBackend[Int, Unit](CPU, 0) { in =>
    printf("%d\n", in)
}

`withBackend` explicitly demarcates code that should be run on a different backend. It transfers inputs/results between backends automatically. Design info: feiwang3311#8 (comment)

dan-zheng · 2018-10-07T04:41:01Z

I propose to change the cuDNN backend into a cuBLAS+cuDNN backend.

cuDNN by itself defines high-level NN operations, like convolutions and activation functions.
However, it doesn't define lower-level primitives, like matrix multiplication or basic elementwise ops.
Thus, a standalone cuDNN backend is not particularly useful.

A cuBlas+cuDNN backend can use cuBLAS for low-level linear algebra primitives and cuDNN for optimized high-level NN ops.

Rationale here: feiwang3311#8 (comment) - Move GPU test utilities to `LanternFunSuite`. - Add cuDNN test.

Rationale here: feiwang3311#8 (comment) - Move GPU test utilities to `LanternFunSuite`. - Improve CUDA/cuBLAS/cuDNN error messages. - Example: "cuBLAS error occurred: 7 (lantern-snippet.cpp:150)" - Add cuDNN test.

feiwang3311 · 2018-10-08T00:31:05Z

https://github.com/feiwang3311/Lantern/blob/master/src/main/scala/lantern/ad_lms_vector.scala#L522

@dan-zheng should this line be comparing this.shape(1) with that.shape(0)?

dan-zheng · 2018-10-08T00:46:15Z

Thanks for the catch! Fixed in db0a80f.

TiarkRompf · 2018-10-08T05:15:41Z

I propose to change the cuDNN backend into a cuBLAS+cuDNN backend.

Absolutely makes sense. The use case I had in mind was cuBLAS without cuDNN, but that's covered with BackendCudnn extends BackendCublas.

dan-zheng · 2018-10-22T10:25:18Z

FYI: I added a concrete todo list to the issue description.
Preliminary MNIST CNN support is nearly done. Afterwards, we can evaluate performance and optimize.

dan-zheng self-assigned this Sep 20, 2018

TiarkRompf added this to the M1 milestone Sep 20, 2018

TiarkRompf added the feature New feature or request label Sep 20, 2018

dan-zheng mentioned this issue Oct 3, 2018

Implement matrix-matrix multiplication #12

Closed

dan-zheng mentioned this issue Oct 6, 2018

Start cuBLAS backend support. #18

Merged

dan-zheng added a commit to dan-zheng/Lantern that referenced this issue Oct 7, 2018

Let BackendCudnn extend BackendCublas.

d4b95de

Rationale here: feiwang3311#8 (comment) - Move GPU test utilities to `LanternFunSuite`. - Add cuDNN test.

dan-zheng added a commit to dan-zheng/Lantern that referenced this issue Oct 7, 2018

Let BackendCudnn extend BackendCublas.

c831348

Rationale here: feiwang3311#8 (comment) - Move GPU test utilities to `LanternFunSuite`. - Add cuDNN test.

dan-zheng mentioned this issue Oct 10, 2018

[DoNotMerge] Implement withBackend. #25

Open

TiarkRompf referenced this issue Oct 17, 2018

add openblas as apt_addon

abf869a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerated Backends (CPU & GPU) #8

Accelerated Backends (CPU & GPU) #8

TiarkRompf commented Sep 20, 2018 •

edited by feiwang3311

Loading

dan-zheng commented Oct 3, 2018

dan-zheng commented Oct 3, 2018 •

edited

Loading

GSAir commented Oct 4, 2018

dan-zheng commented Oct 7, 2018 •

edited

Loading

feiwang3311 commented Oct 8, 2018

dan-zheng commented Oct 8, 2018

TiarkRompf commented Oct 8, 2018

dan-zheng commented Oct 22, 2018

Accelerated Backends (CPU & GPU) #8

Accelerated Backends (CPU & GPU) #8

Comments

TiarkRompf commented Sep 20, 2018 • edited by feiwang3311 Loading

dan-zheng commented Oct 3, 2018

dan-zheng commented Oct 3, 2018 • edited Loading

GSAir commented Oct 4, 2018

dan-zheng commented Oct 7, 2018 • edited Loading

feiwang3311 commented Oct 8, 2018

dan-zheng commented Oct 8, 2018

TiarkRompf commented Oct 8, 2018

dan-zheng commented Oct 22, 2018

TiarkRompf commented Sep 20, 2018 •

edited by feiwang3311

Loading

dan-zheng commented Oct 3, 2018 •

edited

Loading

dan-zheng commented Oct 7, 2018 •

edited

Loading