Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explicit tensor data malloc via NewTensor. #21

Merged
merged 1 commit into from
Oct 8, 2018

Conversation

dan-zheng
Copy link
Collaborator

@dan-zheng dan-zheng commented Oct 8, 2018

  • NewTensor is a backend-defined memory allocation operation.
    • Use NewTensor instead of NewArray for allocating tensor data.
  • Add TensorNew, TensorApply, TensorUpdate operations.
    • For Scala and C code generators, these forward to ArrayNew,
      ArrayApply, and ArrayUpdate. GPU code generator implements
      its own codegen logic for these ops using CUDA primitives/kernels.

TODO: Investigate TensorApply and TensorUpdate for GPU.

  • TensorApply currently launches a 1-block, 1-thread arrayApply
    kernel. This is probably not ideal.
  • TensorUpdate currently doesn't work.

Addresses feedback from @TiarkRompf.

}
// override def boolean_and(lhs: Rep[Boolean], rhs: Rep[Boolean])(implicit pos: SourceContext): Rep[Boolean] = __ifThenElse(lhs, rhs, unit(false))

implicit def repArrayToTensorOps[T: Manifest](a: Rep[Array[T]]) = new TensorOpsCls(a)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that repArrayToTensorOps overpowers repArrayToArrayOps during implicit resolution.
If so, I think that means all ArrayApply and ArrayUpdate ops are rewritten to TensorApply and TensorUpdate.

To avoid this, it seems necessary to distinguish Rep[Array[Float]] allocated by TensorNew from other Rep[Array[Float]]. Is this possible/desirable (perhaps via type annotation)? @TiarkRompf

- `NewTensor` is a backend-defined memory allocation operation.
  - Use `NewTensor` instead of `NewArray` for allocating tensor data.
- Add `TensorNew`, `TensorApply`, `TensorUpdate` operations.
  - For Scala and C code generators, these forward to `ArrayNew`,
  `ArrayApply`, and `ArrayUpdate`. GPU code generator implements its own
  codegen logic for these ops using CUDA primitives/kernels.

TODO: Investigate `TensorApply` and `TensorUpdate` for GPU.
- `TensorApply` currently launches a 1-block, 1-thread `arrayApply`
  kernel. This is probably not ideal.
- `TensorUpdate` currently doesn't work.
@dan-zheng
Copy link
Collaborator Author

Merging to unblock progress, happy to address feedback later.
I hope this is a step in the right direction.

@dan-zheng dan-zheng merged commit f0f68bd into feiwang3311:master Oct 8, 2018
@TiarkRompf
Copy link
Collaborator

I anticipate that we'll need to generate custom CUDA at some point, but for now we should do something simpler:

  1. only access GPU data in bulk, never element-wise -- this removes need for Apply and Update
  2. if data needs to be shipped to the CPU, make it an explicit transfer operation (array.transferToCPU()).

For this, we don't need IR nodes for TensorNew, TensorApply, TensorUpdate, but a simple unchecked("..") should suffice. I'd also prefer the to call it something like NewGPUArray since it's returning an Array, not a Tensor object.

I do expect that we'll evolve this design but I'd like to do the simplest thing that enables us to run benchmarks.

@dan-zheng
Copy link
Collaborator Author

dan-zheng commented Oct 8, 2018

@TiarkRompf Are you suggesting the following design?

  • Do not make tensor data allocation backend-dependent. Always allocate tensors on CPU.
  • Implement transfer operations to copy tensors between backends.

Example programs:

def snippet() {
  // backend: BackendCublas

  // Allocate tensor on CPU.
  val x = Tensor.ones(2, 2)
  // Doing `x+x` right now would result in an error because `x` lives on CPU
  // but codegen produces cuBLAS ops.

  // Transfer tensor to GPU.
  x = x.transferToGPU()
  // Now, `x+x` is valid.
  val y = x + x
}

def snippet2() {
  // backend: BackendCublas
  val x = Tensor.ones(2, 2)
  // We can use `withGPU` to eliminate manual transfer operations.
  // However, doing `x+x` here outside `withGPU` still produces an error.
  withGPU(x) {
    val y = x + x
  }
}

I propose an alternative design (nearly implemented):

  • Make tensor data allocation backend-dependent, to be consistent with tensor operations. This means that it is always valid to perform tensor ops, regardless of current backend.
  • Rename IR nodes to TensorDataNew, TensorDataApply, TensorDataUpdate. These nodes forward operations to the current backend.
  • Rename NewTensor to NewTensorData. NewTensorData[T] should be used instead of NewArray[T] in tensor op implementations to initialize tensor data.
  • Whether or not to fully support TensorDataApply and TensorDataUpdate on GPU is an orthogonal question. Since "Apply" doesn't work on GPU yet, we can throw a compile-time error if it's invoked. This is also necessary in the first design becausee it's possible to perform tensor(i) in a withGPU block.

Example program:

def snippet() {
  // backend: BackendCublas

  // Allocate tensor on GPU. Doing `x+x` is valid right now.
  val x = Tensor.ones(2, 2)
  val y = x + x

  withCPU(y) {
    // Transfer `y` to CPU and perform ops (`+` and `print`).
    val z = y + y
    z.print()
  }
}

TensorFlow's programming model is similar to the second design:

with tf.device('/device:GPU:2'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c = tf.matmul(a, b)

PyTorch's programming model is similar to the first design, but eliminates the potential error in the example because ops are generated based on the device placement of tensors. This is actually a more robust model than either of the two designs.

x = Tensor(...) # CPU
x + x # Performed on CPU.
x = x.cuda() # GPU
x + x # Performed on GPU.

I suppose the differences between the two designs doesn't matter for benchmarks, because most tensor programs start on the CPU (to do data loading, etc) and then compute on the GPU.

The two designs are different only when the backend is not the CPU. A small limitation of the first design is that it's not possible to allocate memory directly on GPU. Copying from CPU is always necessary, even if unnecessary.

I prefer the second design because there are fewer holes in the abstraction. If you want to push for the first design for simplicity/pragmatism, I'm fine with that, and will implement it. Please let me know.

Edit: there's a critical implementation difficulty in the second design, so I'll abandon it and pivot to the first design.

@TiarkRompf
Copy link
Collaborator

@TiarkRompf Are you suggesting the following design?

  • Do not make tensor data allocation backend-dependent. Always allocate tensors on CPU.
  • Implement transfer operations to copy tensors between backends.

Slightly different -- what I'm suggesting is to think in two layers, Arrays and Tensors:

  • Arrays can be allocated on CPU (standard NewArray) or GPU (NewArrayGPU)
  • Arrays can be transferred between CPU and GPU using explicit calls
  • CPU and GPU arrays share the same type Rep[Array[T]] just like pointers in C (T*). Apply and Update will cause bus errors for GPU-backed arrays, which we accept, since we assume that higher-level APIs do the right thing
  • Tensor storage is back-end dependent, i.e., the Tensor backend trait decides whether to allocate the backing arrays on the CPU or GPU
  • The GPU Tensor back-end takes care to only access backing arrays through GPU kernels, and may transfer data to the CPU, e.g., for printing

Does that makes sense?

I do think the design with device scopes (withCPU and withGPU) has merit but it's more complex, and the additional complexity doesn't seem on the critical path for benchmarks right now.

@dan-zheng
Copy link
Collaborator Author

dan-zheng commented Oct 8, 2018

Aha, that makes sense!

Previously, I had a misunderstanding, which is that tensor math operations and allocation operations were somehow distinct. In reality, they’re not at all: both should be backend defined. Tensor constructors like Tensor.fill and Tensor.fromData should call backend.mallocArray, which allocates memory either on CPU or GPU.

The complexity of the second design above is entirely avoidable. Thanks for your clarifications!

@TiarkRompf TiarkRompf added this to the M1 milestone Oct 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants