-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running with DeepSpeech (TensorFlow OpenCL/ComputeCpp) #31
Comments
VC4CL is using
220 secs on a fully powered machine is very long and will be much longer on a weak Raspbery Pi. If it will take that long, I don't know (see below).
No, there currently is no such option. If you have the OpenCL source-code (or any intermediate version) as a file (e.g. the Using VC4CL there are 2 things that could take very long:
To further analyse what goes wrong, there are a few options:
If you can point me to the kernel files or send me the intermediate compilation results ( |
Thanks for the quick reply @doe300. I'm not sure if I can get my hands onto the OpenCL source code, it's being generated by ComputeCpp layer. Is there any way to dump it somehow? This way I could share them for sure with you. I don't seem to be able to get any new file I'm also giving a try to very small model (protocobuffer file is 54k), but so far it got blocked at the same level. Latest TensorFlow's logging trace is I'm going to have a look at |
Okay, after a few minutes, nothing at all seems to kick-in on the GPU:
I'm also not seeing anything in
|
Rebuilding
|
This could be, because the compilation results are cached. At least that is what your first post suggests.
The Is this log the whole log? If so, then it seems to hang before it actually executes the kernel. |
That's the full log filtered on Just to make it clear, Just to be extra-cautious, I did even unplug power for a few minutes ... Any hint on what kind of extra debug I could add / hack into vc4cl to see why it hangs? i'm already running builds of VC4C and VC4CL with |
Ah okay, I misunderstood that. If this is all the VC4CL log, then I currently do not know where the problem is. Both likely cases, I mentioned earlier are now excluded, and it looks like more a problem in the VC4CL runtime. Are there instructions I can use to build the program and debug it/test it out? |
@doe300 I can share you binaries and data, building deepspeech with OpenCL is quite painful. I wanted to give a try to your CircleCI's binaries, but somehow there's no artifact available, at least for master branch :-(. |
@doe300 You should have that in your mailbox (one used for your git commits) :-) |
I've added some more debug into queue handler:
And then it loops somehow forever on empty queue. |
I will take a look the at the execution in the next few days. The output and your assessment of it sounds like a discrepancy between the program and the VC4CL library (or within the VC4CL library) regarding the handling of events... |
@doe300 Thanks! It's interesting, because I do remember also having to deal with some kind of weirdness related to events and/or handling of some error cases, in the past, when I was testing with the Intel Neo driver: somehow, TensorFlow or ComputeCpp layer (I cannot remember which one) was handling exception cases in an unexpected way, and this was making the tracing / profiling of OpenCL in this stack configuration impossible. If you have any hint on where I should poke around in vc4cl to try and find some hint about what is going on, do not hesitate :-). |
Just found out I only ran Hacking deeper inside This means that as of now, only I just got But in the end, this is not helping :( |
So, current debugging gets me trapped inside the pthread mutex on the |
@doe300 So, from further investigation, it seems TensorFlow using properly ComputeCpp library to push stuff to the GPU and then it sits in I will try and continue to find what's going on inside VC4CL and why does this call to |
Not sure yet why, but disabling
After that, I have more CPU activity. But I'm still unsure what's going on. |
In the second version, since it allocated a few buffers, it looks like it is may now be running a kernel.
This buffer size looks about the right size for kernel code |
Yes, this was my thought as well, but as much as I can tell, nothing shows up on |
Thanks for looking that deep into the issue, especially the pointer to |
@doe300 Removing the
|
@doe300 Complement to my previous answer: somehow, after killing some locks (MUCH BAD) and mutexes (BAD), I've got one run to complete. It failed, though, but that (seems) to be not because of vc4cl. Looks like there are some deadlock issues to address though :( |
Out of the locking issues, I have been able to catch some tensorflow-level errors, and now I'm hitting a vc4c compilation error, which is kind of good :) :
|
Right, I need to find a way to dump the source code being passed |
The easiest thing at this stage is likely to grab it from the VC4CL side - if you look at clCreateProgramWithBinaries, there's a const char ** argument, which will be the kernels. Since you already have debugging set up, this will be easier than the alternative (which would be trying to find the correct files that are crashing, which will also all only exist in /tmp, somewhere!) ETA: I hope you don't mind me popping in here, we've been following this issue with interest! :) |
Thanks, I'll investigate this path. I was also on the verge of swapping |
So to dump the kernel, there are a few possibilities in VC4C with only slight modifications needed:
|
So I found the reason for the blocking:
I looked into the OpenCL 1.2 specification which states that if the status has already been reached, the callback still needs to be triggered. I adapted the code according to that (using the same behaviour as intel/beignet and pocl) Now the test command you sent me fails with:
Looks like the error is throw here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/graph/subgraph.cc#L134, which I don't know anything about. |
Thanks @doe300, this error is a fallout of a missing tensorflow op in the binary I sent you. I've rebuilt libdeepspeech.so including Snapshot op, I'll send it to you 😀. I'm sorry about that but it slipped through, I did not need that when testing on my Intel GPU and only found out after hacking the locks. You should have received the new .so :-) |
Now, independent of the VC4C front-end used, the compilation fails with a compilation error due to the kernel using double, which is not supported (it also uses 64-bit integer which are also not supported, except in some edge cases). |
@doe300 Which 64 bits instructions ? Vectorized version I've sent you should be okay with the model I've sent earlier today (previous model had some leftover indeed), there's no 64 bits inside, and it fails like this:
Or I misunderstood something? |
Unfortunately some of the kernels are quite large and might well end up using lots of registers. I don't have any specific recommendations for reducing the number of registers the kernels use, but I can try to find out tomorrow. |
No I did. But for the vectorized version I get the same error that some |
Ok, I just want to make sure that this @DuncanMcBain I'm checking Right now, I'm able to dump to offending kernel's LLVM-IR bytecode, and there are lines like (in the disassembled):
I suspect this is what you suggested to look at, then find the matching For example, this seems like one of the SPIR kernel containing offending
|
Good work guys! The name of the kernel is a type. In the case of Eigen kernels, it will be the type of the function objects passed in to the parallel_for call. I believe the mangling of the name should be roughly the same in that case, though I am having trouble decoding it at the moment. |
Could we be lucky ? Just found that: https://stackoverflow.com/questions/44557876/a-puzzle-on-spir-mangling-on-type-size-t Reading the SPIR doc at https://www.khronos.org/registry/SPIR/specs/spir_spec-2.0.pdf page 37 (annex A31), it would looks like There's something close here: https://github.com/doe300/VC4CLStdLib/blob/b17db7b38d84aa461042e3cbfd0a6df90d1e3020/include/opencl-c.h#L11704 Also, if the
Reading the mangling rules of SPIR specification, I'm really questionning myself about the Another interesting finding: https://github.com/google/clspv/blob/8e13814d0fd80ab8c89bbddb4c0f77949dbf7ea5/lib/ReplaceOpenCLBuiltinPass.cpp#L973-L976 |
Yeah, there is the problem. I do have the |
@doe300 One thing I'm asking myself in my head is whether |
True, than this could be a bug in LLVM?! |
@doe300 That's what I'm starting to think. Maybe @DuncanMcBain can shed some light there? As much as I can look into, ComputeCpp relies on some SVN of LLVM post 3.8.0. Looking at 0.7.0 and 0.8.0, it's actually confusing me more:
So, from ComputeCpp 0.7.0:
|
So, LLVM 3.8 generates Let's try something ...
|
@doe300 With the hack documented above, I'm hitting some register issue as well:
Is that the same issue you mentionned ? |
That would explain why I never had this error before, I usually compile with only one LLVM version.
And using 2. and 3. together generates the problem...
Yes it is |
@doe300 Yes, I stopped using SPIRV-LLVM. Actually, ComputeCpp seems to use some SVN snapshot after 3.8.0 release, but obviously before some 3.9.0 change that does the mangling we want. My RPi3 is using 3.9 from the Raspbian repos. |
Hi @lissyx, I'm a colleague of @DuncanMcBain, and I work specifically on the ComputeCpp compiler. I briefly looked through the discussion of this issue about the vload and mangling issue. ComputeCpp outputs SPIR 1.2 modules, which means that builtins are mangled using the Itanium mangling rules as implemented by the Khronos modified clang 3.2. Your driver seems to expect the upstream 3.9 mangling. So as you noticed _Z6vload4jPKU3AS1f is SPIR 1.2 compilant but the upstream 3.9 clang will mangle it as _Z6vload4jPU3AS1Kf. To answer the last comment, compute++ is based on LLVM 3.9, but it produces SPIR modules not LLVM 3.9 modules. So builtins will be mangled according to the SPIR 1.2 specification. |
Thanks @Naghasan for the info. Lets see, if I understand it correctly: Continuing that thought: |
@Naghasan I might be missing something, but according to A.1 and A.3 in https://www.khronos.org/registry/SPIR/specs/spir_spec-1.2.pdf I'm not able to understand how |
@doe300 Yes, but you could still encounter some issues with the metadata (as they changed a bit). Another way would be to have a module that wrap those functions to redirect them to what the driver understand, so something like that:
When consuming a SPIR module, you can then link the wrapper module to the user one. That should allow you to maintain both manglings. @lissyx That's a good point, I think lost track of what is said in the spec and what has become de-facto supported mistakes ... |
I added wrapper-functions for SPIR mangling to LLVM mangling (more accurately: function aliases) and now the vector version passes the normalization and optimization steps and runs out of RAM in code generation. BTW, the program takes up more than 1h of processor time up to this point (on Raspberry Pi 3B+) |
@doe300 Thanks ! I'm getting to the same point, with register allocations failing. I might try and see if I can help there. However, I'm clearly way less than 1h of processing time on a RPi3B, that's strange it is taking that much time for you. EDIT: Ok, 1h of user time, on the 4 cores, so 16m real. That's a lot, but not surprising given what I saw on my laptop, on the Intel GPU. |
To fix the register issues, we would either need to write a much smarter register allocator, which could be hard given the hardware characteristics (unless I did something very stupid;)). |
Thanks for all of your efficient fixes @doe300. Issue linked doe300/VC4C#60 is quite clear about the challenges, I'm afraid I don't know the hardware at all, even though I'd be glad to help on the register spilling. Even very slow, I'd be happy we could get inference work on the GPU :-) |
@doe300 Sorry, I got pulled away from that work, have you been able to make progress on register spilling ? I'll be away for one month starting early august, but I might be able to work on something in the meantime. I understand the (big picture of the) issues described in issue #60 about latencies, etc., but I lack understanding on how to technically perform the spilling itself. |
No, I have not. I am still looking for some strategy to determine which locals to spill (and in which span) to have the minimum/a small number of spills. If you have any ideas on that, they are very welcome. The spilling itself is not the problem (unless they become too much to not fit into the VPM buffer anymore...) |
So, I guess nobody got time to hack that :-). I'll try to test again this hardware with our simpler, streaming-oriented model that we landed a weeks ago. If we are lucky enough, it will put less pressure on the register. Also, this new model should (finally) be able to be optimized by TFLite and so maybe we can hope for some further less pressure. |
@doe300 awesome work man 🥇 . Thank you for this awesome repo. Today I got fully functional |
I'm currently trying to assert the status of expectation we can have on this setup for DeepSpeech, relying on TensorFlow with ComputeCpp. I have been able to cross-build the driver, and most of the
TestVC4C
do run (properly or not). That means, I can seeclang
doing its job and compile some cl stuff.The GPU is also visible by
computecpp_info
.Now, I'm trying to run our code on top of that. So far, it's not being very successfull, but in an unexpected way: as documented in codeplaysoftware/computecpp-sdk#117 (comment), ComputeCpp does see the GPU, and makes use of it. But then, monitoring the system, it's sitting with the
deepspeech
process at 100%.I don't see that much of
clang
running, but I did spot some processllvm-spirv /tmp/vc4c-EgnXeW /dev/stdin
being ran. The file/tmp/vc4c-EgnXeW
seems to be non-zero size. But no error when running, so I don't know if there's something going on.The OpenCL kernels might be big (too big for the current limitations? I'm not sure how to check that), and/or the project might be still too young?
As a comparison, we are able to run with the same stack on the Intel Neo driver on my laptop (i7-8650U) using the GPU. The first run of the intel driver does compile the OpenCL code and can cache it on-disk, and this ~220 secs to compile.
Resulting
cl_cache
for Intel is:I've let run
deepspeech
with theVC4
driver run for ~120m without any visible output or error: is it possible our code is too much compute intensive for now, and it's expected to take that much time ? Or could there be some silent error happening and breaking something ?As much as I could read of the docs / wiki available, I could not find anything (e.g., env variable) that could be used to get a bit more of informations at runtime. I'm a bit relunctant to try a debug build, considering how slow things are already with a release build, but if that can provide useful feedback, I'd be glad to give it a try.
The text was updated successfully, but these errors were encountered: