-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add explicit tensor data malloc via NewTensor
.
#21
Conversation
} | ||
// override def boolean_and(lhs: Rep[Boolean], rhs: Rep[Boolean])(implicit pos: SourceContext): Rep[Boolean] = __ifThenElse(lhs, rhs, unit(false)) | ||
|
||
implicit def repArrayToTensorOps[T: Manifest](a: Rep[Array[T]]) = new TensorOpsCls(a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that repArrayToTensorOps
overpowers repArrayToArrayOps
during implicit resolution.
If so, I think that means all ArrayApply
and ArrayUpdate
ops are rewritten to TensorApply
and TensorUpdate
.
To avoid this, it seems necessary to distinguish Rep[Array[Float]]
allocated by TensorNew
from other Rep[Array[Float]]
. Is this possible/desirable (perhaps via type annotation)? @TiarkRompf
- `NewTensor` is a backend-defined memory allocation operation. - Use `NewTensor` instead of `NewArray` for allocating tensor data. - Add `TensorNew`, `TensorApply`, `TensorUpdate` operations. - For Scala and C code generators, these forward to `ArrayNew`, `ArrayApply`, and `ArrayUpdate`. GPU code generator implements its own codegen logic for these ops using CUDA primitives/kernels. TODO: Investigate `TensorApply` and `TensorUpdate` for GPU. - `TensorApply` currently launches a 1-block, 1-thread `arrayApply` kernel. This is probably not ideal. - `TensorUpdate` currently doesn't work.
7969c49
to
c9fa6de
Compare
Merging to unblock progress, happy to address feedback later. |
I anticipate that we'll need to generate custom CUDA at some point, but for now we should do something simpler:
For this, we don't need IR nodes for TensorNew, TensorApply, TensorUpdate, but a simple unchecked("..") should suffice. I'd also prefer the to call it something like NewGPUArray since it's returning an Array, not a Tensor object. I do expect that we'll evolve this design but I'd like to do the simplest thing that enables us to run benchmarks. |
@TiarkRompf Are you suggesting the following design?
Example programs: def snippet() {
// backend: BackendCublas
// Allocate tensor on CPU.
val x = Tensor.ones(2, 2)
// Doing `x+x` right now would result in an error because `x` lives on CPU
// but codegen produces cuBLAS ops.
// Transfer tensor to GPU.
x = x.transferToGPU()
// Now, `x+x` is valid.
val y = x + x
}
def snippet2() {
// backend: BackendCublas
val x = Tensor.ones(2, 2)
// We can use `withGPU` to eliminate manual transfer operations.
// However, doing `x+x` here outside `withGPU` still produces an error.
withGPU(x) {
val y = x + x
}
} I propose an alternative design (nearly implemented):
Example program: def snippet() {
// backend: BackendCublas
// Allocate tensor on GPU. Doing `x+x` is valid right now.
val x = Tensor.ones(2, 2)
val y = x + x
withCPU(y) {
// Transfer `y` to CPU and perform ops (`+` and `print`).
val z = y + y
z.print()
}
} TensorFlow's programming model is similar to the second design: with tf.device('/device:GPU:2'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b) PyTorch's programming model is similar to the first design, but eliminates the potential error in the example because ops are generated based on the device placement of tensors. This is actually a more robust model than either of the two designs. x = Tensor(...) # CPU
x + x # Performed on CPU.
x = x.cuda() # GPU
x + x # Performed on GPU. I suppose the differences between the two designs doesn't matter for benchmarks, because most tensor programs start on the CPU (to do data loading, etc) and then compute on the GPU. The two designs are different only when the backend is not the CPU. A small limitation of the first design is that it's not possible to allocate memory directly on GPU. Copying from CPU is always necessary, even if unnecessary. I prefer the second design because there are fewer holes in the abstraction. If you want to push for the first design for simplicity/pragmatism, I'm fine with that, and will implement it. Please let me know. Edit: there's a critical implementation difficulty in the second design, so I'll abandon it and pivot to the first design. |
Slightly different -- what I'm suggesting is to think in two layers, Arrays and Tensors:
Does that makes sense? I do think the design with device scopes (withCPU and withGPU) has merit but it's more complex, and the additional complexity doesn't seem on the critical path for benchmarks right now. |
Aha, that makes sense! Previously, I had a misunderstanding, which is that tensor math operations and allocation operations were somehow distinct. In reality, they’re not at all: both should be backend defined. Tensor constructors like The complexity of the second design above is entirely avoidable. Thanks for your clarifications! |
NewTensor
is a backend-defined memory allocation operation.NewTensor
instead ofNewArray
for allocating tensor data.TensorNew
,TensorApply
,TensorUpdate
operations.ArrayNew
,ArrayApply
, andArrayUpdate
. GPU code generator implementsits own codegen logic for these ops using CUDA primitives/kernels.
TODO: Investigate
TensorApply
andTensorUpdate
for GPU.TensorApply
currently launches a 1-block, 1-threadarrayApply
kernel. This is probably not ideal.
TensorUpdate
currently doesn't work.Addresses feedback from @TiarkRompf.