-
Notifications
You must be signed in to change notification settings - Fork 74.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: plug-in support for new devices #4359
Comments
Few questions. About Feature RequestBy dynamically, I assume you are referring to at compile time, but self-contained code changes. If you mean at runtime via dynamic module loading, I don't know off hand how to do this, but I think we can figure it out. How do you wish to handle the kernel code? Should I plan to have new hardware supported by developers modifying the About TensorFlow |
Yes, I do mean self-contained code changes: a user should be able to use an existing binary install of TensorFlow that doesn't contain the custom device code, but then call a function to "load a device" and then another function to "load the kernels" for that device. The latter is already possible for existing devices via the tf.load_op_library() mechanism, so theoretically something similar could be done for a new tf.load_device(). I'll answer your other question on SO, but I don't think the answer there will be instructive for new devices. Every device has its own execution model, and so the way the device code is written needs to take into account the execution model. On CPU, the ThreadPoolDevice is just a device that uses Eigen's multi-threaded CPU support to implement operations. On GPU, we have to manage streams and execution order carefully because GPU devices are asynchronous, not synchronous. If you told us more about the execution model of your hardware, we might be able to suggest something more relevant -- it's likely that copying the CPU or the GPU way of doing things is not the right solution. |
I need to talk with my supervisor before giving too many details. Here is the public info, but as a high level understanding, the chip is basically a cluster computer reduced to fit on a single chip. The many DSPs need to work together to do anything useful. I didn't know about the operation interface! Its pretty awesome and I definitely think that is what I want to build. It would seem that at a minimum, a developer would need to write an Allocator, Device, DeviceFactory, and DeviceContext. This would give a non-functional device because there are no kernels registered to it. As I was developing, I noticed that some kernels seemed to be core functions like ConstOp, AssignOp, NoOp, etc. that are needed for other things to work. It would seem that a user wouldn't want to code these explicitly as they are kind of obvious and redundant. Do you think these can/should be automatically built into the framework so that every device at least has these working out of the box? |
Yes, those four pieces are the minimal requirements just to get the basics working, and then you'd have to register kernels for every op that was supported. You're right that some ops are probably trivially implementable as long as some of the basics above are implemented. We'd have to think about how to 'auto register kernels' for all devices. However, for things like NoOp, it shouldn't be too hard: it would just be an include and a registration like in
|
How do I add headers to be installed into the TensorFlow include path? And how about for the .proto files? I need to "compile" them to headers and have them installed I need several framework and common_runtime headers to compile device code. I tried to look through the Bazel files, but couldn't find anything obvious. EDIT: I tried a workaround of just setting my include path to the root of the Git repo, but found that things like EDIT 2: Also, the Eigen files don't appear to have header guards and cause recurrent includes |
FYI. I got a fake device working with contexts, factories, kernels and all, so I will begin to try to make it self contained. I posted various questions to StackOverflow and things. Those are more educational. The library problem above is more important for the purposes of this issue. Thanks a bunch! |
|
I got all the code down to one file with no changes to the main code base except the below. I list my solution ideas, but they involve modification of the main code so I wanted to make sure its okay.
To get rid of this one, I was thinking of using the existing registration, but that means exposing this function. I could just make the underlying
If there is Python access to the same interface as above, then I can just use that. I am a fan of less configuration. |
I think if you name your device "/device:KPU:0" instead of just "/KPU:0", it should work without any additional edits. "/cpu:0" and "/gpu:0" were shortcuts before we realized it was helpful to have the "/device" qualifier like we do for the other parts of a full device name. Can you give that a try and let me know? |
Yup that worked!
Once I figure out how to get those includes and compile it outside of TF, I will submit a pull request within documentation a la "Adding an Op" After that. I would like to discuss how to best support my particular hardware. My company probably wants this to be private as I will need to tell you details, so would you be willing to do it over email and not as a GitHub Issue? |
Adding @petewarden |
I have been manually adding headers to my include path and I need,
Manually adding symlinks to for all those above headers tothe TensorFlow source, I still get this error,
|
ProtoDebugString() is generated by tensorflow/tools/proto_text, which is a bit like protoc but for a minimal footprint class instead of a full protobuf one. |
Thanks. Do you know how I can get it to be recognized when I am compiling it into an external module? I am currently running
|
Hello. I think I should introduce myself. I am working for Graphcore, a UK startup making a graph processor / accelerator. It seems that I am traveling the same path as Aidan. I have a vaguely functioning device, although I was stuck on the 'having to edit the device_name_utils.cc and device.py' problem. It would be a very useful thing to be able to build a dynamically linked device/kernels module indeed. I have isolated all of my device/kernel code into the third_party branch. Does this make sense? My device depends on an external library, and so I have also added some code to the workspace.bzl file. With only one workspace for the whole of tensorflow, there doesn't seem to be a way around this. Is there is something that I have missed about bazel that would allow me to break the need for that? Cheers |
I saw the slashdot conversation about DeviceContexts. Can you clarify the difference between Devices and DeviceContexts. Why would I not store device specific information in my Device class? For instance, handles to device specific structures, mapped memory, etc? |
@DavidNorman If you look back a couple posts ago I had the same problem. The trick is calling you device something like "device:FPU". In my pull request, I do this Also, @DavidNorman were you able to get your code to compile as a .so file? I have trouble with the includes. Or are you compiling into the TensorFlow main body |
@DavidNorman with regards to the Device Contexts, I forget were I heard it, but I think @vrv told me that GPUs have multiple contexts which handle different parts of computation like memory access, pushing kernels onto the device, etc. Also, if you could put both your questions on StackOverflow, I will answer them there for better accessibility in the future. eg. My question about compilation |
@aidan-plenert-macdonald thanks for the info I do not have a separate .so file at the moment (or a .dylib as it would be on my Mac). I have a few more hurdles to get past before I try to make that happen i think. The requirement for an external library means I have to modify the workspace.bzl, so I'm not sure that there is too much point unless I can figure out how to avoid that. Have posted this question on slashdot |
@vrv I was wondering if you have any further information about adding includes. We just need those to be done. |
I really don't know :( Going to assign this one to @keveman (I'll be OOO soon for a week). |
Got it!! Simple one liner in a BUILD file. |
@DavidNorman Can I ask how you are assigning compute resources to nodes in the TF compute graph? Are you using Device to control hardware or is this done with contexts? Do you have similar problems with allocation? Do you know anything about TF's ability to do automatic resource allocation? I believe TF can auto-assign devices to operations. @vrv Is there any way to bind Allocators to specific device contexts? |
@aidan-plenert-macdonald: Typically an Allocator manages the memory for an entire device. In the GPU case, there is one allocator for each GPU device, but a GPU device has hundreds to thousands of "cores", each which has access to the global memory that the allocator is responsible for allocating, and programs are responsible for setting properly. CUDA has ways in its programming model to additionally have local fine-grained sharing of memory among cores, and that is specific to the cuda programming language, and we don't touch that really, since it's not part of the global memory pool. I suspect that your device is really more like many loosely coupled cores with message passing, which is a model we haven't really optimized for. We would normally treat each 'core' as a separate device, and then your memcopy primitives (device to device) would use whatever high-performance inter-node communication primitives you wanted. Alternatively, if you'd rather treat your entire processor as a single "device", it might still be possible: allocator.h has an AllocatorAttributes structure, which is a 64-bit opaque value blob. The top 8 bits of that structure we've made device specific. OpKernel has a GetAllocator() function that takes allocator attributes, so it might be possible for you to have the DeviceContext contain information that an OpKernel can use to set the appropriate bits of the AllocatorAttributes, and then you'd implement DeviceBase::GetAllocator() to return a different allocator based on the top 8 bits of the AllocatorAttributes. If you used all 8 bits as an allocator id, you'd be able to address 256 different allocators in a single device. Without knowing too much about your device, I'm not sure which approach is better, but those might help you make progress. |
@RichDubielzig not sure I understand the maximum code size requirement. Care to elaborate? Where in the API would you expect to see this? |
I would expect to see code size considerations taken in generation of kernel code from LLVM IR. The instuction memory for individual cores in Knureon is quite small, and it looks like XLA could JIT kernel code that would overrun program memory. Is there a mechanism to break up a block of IR into smaller pieces? |
@RichDubielzig this functionality is not supported by the current CPU/GPU backends; however if you'd plan to add a custom backend for your architecture you could implement this. |
@eliben Do you mean implement it in the main TensorFlow, or just in our library? Can you recommend a good place that it might fit into nicely? |
I'm afraid I'm having trouble understanding how to handle synchronization and asynchronous kernel resources when developing in TensorFlow using TensorFlow objects. Here is my problem: The knureon system can be thought of as a large pool of completely independent cores which can be assigned operations at any time. So, for example, given the data flow below: A: m = a * b If I have 128 available cores, then I might request A to run on 64 cores, B to run on another 64 cores, and then I would need to wait for one of the first two operations to complete before I can C on 64 cores. To this end, I have created a little launch queue object that can hold computation requests until compute resources are available, then run. In pseudocode, my naive OpKernel Compute() implementation is below. Note that this employes a Knureon-specific command queue object which can be used both to launch operations and to pend on running ones. Multiple queues are allowed to exist and run in parallel on a single Knureon system.
The problem should be apparent: I don't want TensorFlow to sit around waiting on Q if there are other operations that can be deployed right away on available resources. The alternative appears to be to use an asynchronous opkernel:
But I am not sure if this is the right approach, as it seems to be reserved for high-latency operations such as receiving over a network. I've been over and through the code, and unfortunately this has only increased my confusion. I see that the GPU uses a StreamExecutor, but the presence of all the ThenXxxx() functions in the Stream prototype makes me suspect that it is not what I want. I have also noticed that OpKernel Compute() methods can be called in parallel from concurrent Executor threads. So do I even need to sweat about parallel execution at all? When an OpKernel's Compute() method is invoked, am I guaranteed that there will be no benefit to running asynchronously because all the other OpKernels managed by my threads' Executor have data dependencies on my operation? Thank you in advance and my apologies for rambling. I've had to spend a few days figuring out how to phrase this question in a coherent manner, and I'm not sure I've met my goal, so if you need anything clarified please let me know. |
Hi! As many others I am really interested in defining my own devices. I read this thread as well as some others. Is there any documentation available to implement new devices (either in the TensorFlow source code or as a separate module)? @RichDubielzig and @aidan-plenert-macdonald were working on a guide, but at the moment I could only see this: In there anything more recent (targeting the release 1.0)? Thank you! |
@CUinNYC indeed, that doc is a great attempt to show it's done, and that's the basic idea of how the scaffolding works, but the implementation details for every new hardware device means it takes care to figure out exactly how to implement the device code. We're working with a few external folks such as those in this thread to provide some early guidance as to how it's done, and at some point we'll have some nice examples to point at (beyond say, our CPU and GPU implementations). @RichDubielzig: the dataflow model of TF execution means that any operations that can be run at a given time will be run by the executor (assuming there are enough inter-op-parallelism threads to run them). So yes, it is possible to have multiple "matmul" nodes running at once if both can be run at the same time. On CPU, we use Eigen for a lot of the computation, and each op has a shared threadpool on which to execute -- so those threads will typically always be busy if there's lots of op work to do (though I admit, it probably won't be super efficient to context switch all the time, but I digress). On GPU, we use GPU streams to enqueue asyncrhronous ops: execution of an OpKernel just queues the work to be done, and we use streams to enforce execution order at the actual device. @hawkinsp might have more to say here. |
@vrv I really appreciate the initial documentation. Thanks to that, I successfully created a "fake CPU" (really easy indeed). It required minor changes because of the TensorFlow code evolution (r1.0). So having an external-module approach would be more portable. Let me know if you need a help or feedback comments on a preliminary documentation. In particular, I am interested in how to allocate memory for accelerators. My goal is to reduce as much as possible memory copies. If anybody has comments or examples I will really appreciate that. |
Just following up on my question: I am seeing results with this approach:
|
Asked a new question on StackOverflow about the ConstTensor, which doesn't map to the DMAHelper::base() function: |
Posted another question on StackOverflow regarding my confusion with DeviceContext and when it needs to be used. I am revisiting the issue because it turns out we have a smaller limit than I thought on open queues with the system and I'm wondering if I should
|
Another issue we have run into in attempting to debug XLA: It doesn't seem like we are able to exercise any backend compiler when running in a debugger. The issue is here: |
@RichDubielzig I've been told that XLA-related development discussion has been moved to https://groups.google.com/forum/#!forum/xla-dev -- you might get more help / feedback there now. |
Asked a new question related to this on stackoverflow. Will also post to the xla-dev forum https://stackoverflow.com/questions/44271775/can-tensorflow-xla-be-supported-by-plug-ins |
It has been 14 days with no activity and this issue has an assignee.Please update the label and/or status accordingly. |
1 similar comment
It has been 14 days with no activity and this issue has an assignee.Please update the label and/or status accordingly. |
Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks! |
We need to be able to support using new devices in TensorFlow without requiring editing the TensorFlow binary source code.
This requires a few changes. Among many others:
The text was updated successfully, but these errors were encountered: