Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running with DeepSpeech (TensorFlow OpenCL/ComputeCpp) #31

Open
lissyx opened this issue May 23, 2018 · 112 comments
Open

Running with DeepSpeech (TensorFlow OpenCL/ComputeCpp) #31

lissyx opened this issue May 23, 2018 · 112 comments
Assignees

Comments

@lissyx
Copy link

lissyx commented May 23, 2018

I'm currently trying to assert the status of expectation we can have on this setup for DeepSpeech, relying on TensorFlow with ComputeCpp. I have been able to cross-build the driver, and most of the TestVC4C do run (properly or not). That means, I can see clang doing its job and compile some cl stuff.

The GPU is also visible by computecpp_info.

Now, I'm trying to run our code on top of that. So far, it's not being very successfull, but in an unexpected way: as documented in codeplaysoftware/computecpp-sdk#117 (comment), ComputeCpp does see the GPU, and makes use of it. But then, monitoring the system, it's sitting with the deepspeech process at 100%.

I don't see that much of clang running, but I did spot some process llvm-spirv /tmp/vc4c-EgnXeW /dev/stdin being ran. The file /tmp/vc4c-EgnXeW seems to be non-zero size. But no error when running, so I don't know if there's something going on.

The OpenCL kernels might be big (too big for the current limitations? I'm not sure how to check that), and/or the project might be still too young?

As a comparison, we are able to run with the same stack on the Intel Neo driver on my laptop (i7-8650U) using the GPU. The first run of the intel driver does compile the OpenCL code and can cache it on-disk, and this ~220 secs to compile.

Resulting cl_cache for Intel is:

$ LC_ALL=C ll cl_cache/
total 16M
drwxr-xr-x 1 alex alex  700 May 23 16:31 .
drwxr-xr-x 1 alex alex  222 May 23 16:28 ..
-rw-r--r-- 1 alex alex  73K May 23 16:31 34dcc67199396f96.cl_cache
-rw-r--r-- 1 alex alex 734K May 23 16:30 3b459da76e3bb7c4.cl_cache
-rw-r--r-- 1 alex alex 6.7M May 23 16:30 3d76036abef34a33.cl_cache
-rw-r--r-- 1 alex alex 206K May 23 16:31 66637feca75f4aba.cl_cache
-rw-r--r-- 1 alex alex 443K May 23 16:30 6ad895b3f2af8633.cl_cache
-rw-r--r-- 1 alex alex 728K May 23 16:31 6d252b514c3b944f.cl_cache
-rw-r--r-- 1 alex alex 165K May 23 16:30 8a505eb1718a2ecd.cl_cache
-rw-r--r-- 1 alex alex 425K May 23 16:30 98e4482263ced159.cl_cache
-rw-r--r-- 1 alex alex 523K May 23 16:30 9e55731fa2c7901e.cl_cache
-rw-r--r-- 1 alex alex 124K May 23 16:30 a493ba4fcff0b7d1.cl_cache
-rw-r--r-- 1 alex alex 122K May 23 16:29 b1a69f78813be187.cl_cache
-rw-r--r-- 1 alex alex 108K May 23 16:31 d2e0ea42ff6b24ac.cl_cache
-rw-r--r-- 1 alex alex 344K May 23 16:31 ec50bb7158952ee2.cl_cache
-rw-r--r-- 1 alex alex 5.2M May 23 16:31 f4146d630607b157.cl_cache

I've let run deepspeech with the VC4 driver run for ~120m without any visible output or error: is it possible our code is too much compute intensive for now, and it's expected to take that much time ? Or could there be some silent error happening and breaking something ?

As much as I could read of the docs / wiki available, I could not find anything (e.g., env variable) that could be used to get a bit more of informations at runtime. I'm a bit relunctant to try a debug build, considering how slow things are already with a release build, but if that can provide useful feedback, I'd be glad to give it a try.

@doe300
Copy link
Owner

doe300 commented May 24, 2018

I don't see that much of clang running, but I did spot some process llvm-spirv /tmp/vc4c-EgnXeW /dev/stdin being ran. The file /tmp/vc4c-EgnXeW seems to be non-zero size. But no error when running, so I don't know if there's something going on.

VC4CL is using clang to convert OpenCL to LLVM-IR and in your configuration also llvm-spirv for LLVM-IR to SPIR-V conversion. The remaining compilation is done in VC4C itself, so no sub-processes will shown for this step.

The OpenCL kernels might be big (too big for the current limitations? I'm not sure how to check that), and/or the project might be still too young?

As a comparison, we are able to run with the same stack on the Intel Neo driver on my laptop (i7-8650U) using the GPU. The first run of the intel driver does compile the OpenCL code and can cache it on-disk, and this ~220 secs to compile.

220 secs on a fully powered machine is very long and will be much longer on a weak Raspbery Pi. If it will take that long, I don't know (see below).

As much as I could read of the docs / wiki available, I could not find anything (e.g., env variable) that could be used to get a bit more of informations at runtime.

No, there currently is no such option. If you have the OpenCL source-code (or any intermediate version) as a file (e.g. the /tmp/vc4c-xxxxxx file), you can run the standalone VC4C compiler and generate a ton of debug information. But built in into the VC4CL runtime, it will stay silent (except for errors).

Using VC4CL there are 2 things that could take very long:

  1. Trying to compile a very large kernel with VC4C
  2. Executing an erroneous kernel on the GPU. This hangs the GPU and will result in a time-out. If the error is not caught, it will time out every single kernel work-group, which can be a long time.

To further analyse what goes wrong, there are a few options:

  1. can be analyzed with trying to compile the kernels/intermediate code with standalone VC4C.
  2. can be analyzed with running the additional tool provided in the VC4CL project v3d_profile, which will show the GPU usage. If it is high then either the GPU is really doing work, just enough that it takes very long or the GPU hangs somewhere, which is more likely.

If you can point me to the kernel files or send me the intermediate compilation results (/tmp/vc4c-xxxxxx files), I can also take a look at the problem.

@doe300 doe300 self-assigned this May 24, 2018
@lissyx
Copy link
Author

lissyx commented May 24, 2018

Thanks for the quick reply @doe300. I'm not sure if I can get my hands onto the OpenCL source code, it's being generated by ComputeCpp layer. Is there any way to dump it somehow? This way I could share them for sure with you. I don't seem to be able to get any new file /tmp/vc4c-xxx.

I'm also giving a try to very small model (protocobuffer file is 54k), but so far it got blocked at the same level. Latest TensorFlow's logging trace is Instantiating kernel for node: bidirectional_rnn/fw/basic_lstm_cell/bias = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1976] values: 0.000547592354 -0.0 0127190584 -0.00106085732...>, _device="/job:localhost/replica:0/task:0/device:SYCL:0"](), if it can be any hint.

I'm going to have a look at v3d_profile, see if it helps getting an insight :-)

@lissyx
Copy link
Author

lissyx commented May 24, 2018

Okay, after a few minutes, nothing at all seems to kick-in on the GPU: v3d_profile has all the columns to 0, and the output from v3d_info is not very different:

$ sudo /usr/local/bin/v3d_info 
V3D Info:

Mailbox Info:
      Firmware Revision:        5ad4e898
            Board Model:               0
         Board Revision:          a22082
            MAC Address:        c8eb27b8
           Board Serial:        3ac827be 0
             ARM Memory:       994050048 Bytes (948 MB)
    VideoCore IV Memory:        79691776 Bytes (76 MB)
       Clock Rate (ARM):            1200 MHz (600 to 1200 MHz)
      Clock Rate (Core):             400 MHz (250 to 400 MHz)
       Clock Rate (V3D):             300 MHz (250 to 300 MHz)
       Clock Rate (PWM):               0 MHz (0 to 500 MHz)
        SoC Temperature:              56 C (max 85 C)
V3D Status Register Info:
           V3D revision:               1
            HDR support:              no
             Semaphores:              16
                   QPUs:              12
                 Slices:               3
        VPM Memory size:              12 KB
          VPM User size:               4 KB
          Program queue:           0/0/0 requests/completed/in queue
                 Errors:                
Testing maximum single allocation size:
Maximum single allocation: 39845888 bytes (38 MB)

I'm also not seeing anything in /tmp/, but I re-verified and TestVC4C does properly compile, from all I can tell. Checking with htop, I see some VC4CL Queue Han thread listed with some (light) activity. There are two other threads, one being the main process, taking 100% of CPU. Attaching strace on both reveals:

  • main process blocked on a futex()
  • other thread constantly calling sched_yield()

@lissyx
Copy link
Author

lissyx commented May 24, 2018

Rebuilding vc4cl with -DBUILD_DEBUG=ON, this is where I'm getting.

$ sudo TF_CPP_MIN_VLOG_LEVEL=1 ./deepspeech ~/tmp/deepspeech/models/output_graph-n_hidden_1.pb ~/tmp/deepspeech/models/alphabet.txt ~/tmp/deepspeech/audio/ -t 2>&1 | grep VC4
[VC4CL] get extension function address: clIcdGetPlatformIDsKHR
[VC4CL] get extension function address: clGetPlatformInfo
[VC4CL] base=0x3fc00000, mem=0x76fc4000
[VC4CL] V3D base: 0x76fc4000
[VC4CL] Tracking live-time of object: cl_context
[VC4CL] Starting queue handler thread...
[VC4CL] Tracking live-time of object: cl_command_queue
[VC4CL] Mailbox file descriptor opened: 4
[VC4CL] Mailbox buffer before:
[VC4CL] 0000: 0x00000020
[VC4CL] 0004: 0x00000000
[VC4CL] 0008: 0x00030012
[VC4CL] 000c: 0x00000008
[VC4CL] 0010: 0x00000004
[VC4CL] 0014: 0x00000001
[VC4CL] 0018: 0x00000004
[VC4CL] 001c: 0x01528be8
[VC4CL] Mailbox buffer after:
[VC4CL] 0000: 0x00000020
[VC4CL] 0004: 0x80000000
[VC4CL] 0008: 0x00030012
[VC4CL] 000c: 0x00000008
[VC4CL] 0010: 0x80000004
[VC4CL] 0014: 0x00000000
[VC4CL] 0018: 0x00000004
[VC4CL] 001c: 0x01528be8
[VC4CL] Mailbox buffer before:
[VC4CL] 0000: 0x00000020
[VC4CL] 0004: 0x00000000
[VC4CL] 0008: 0x00010006
[VC4CL] 000c: 0x00000008
[VC4CL] 0010: 0x00000000
[VC4CL] 0014: 0x00000081
[VC4CL] 0018: 0x7fffffff
[VC4CL] 001c: 0x01528be8
[VC4CL] Mailbox buffer after:
[VC4CL] 0000: 0x00000020
[VC4CL] 0004: 0x80000000
[VC4CL] 0008: 0x00010006
[VC4CL] 000c: 0x00000008
[VC4CL] 0010: 0x80000008
[VC4CL] 0014: 0x3b400000
[VC4CL] 0018: 0x04c00000
[VC4CL] 001c: 0x01528be8
[VC4CL] Mailbox request: succeeded
[VC4CL] Mailbox buffer before:
[VC4CL] 0000: 0x00000020
[VC4CL] 0004: 0x00000000
[VC4CL] 0008: 0x00010006
[VC4CL] 000c: 0x00000008
[VC4CL] 0010: 0x00000000
[VC4CL] 0014: 0x6dbb6d33
[VC4CL] 0018: 0x04c00000
[VC4CL] 001c: 0x7e806e5c
[VC4CL] Mailbox buffer after:
[VC4CL] 0000: 0x00000020
[VC4CL] 0004: 0x80000000
[VC4CL] 0008: 0x00010006
[VC4CL] 000c: 0x00000008
[VC4CL] 0010: 0x80000008
[VC4CL] 0014: 0x3b400000
[VC4CL] 0018: 0x04c00000
[VC4CL] 001c: 0x7e806e5c
[VC4CL] Mailbox request: succeeded
[VC4CL] Mailbox buffer before:
[VC4CL] 0000: 0x00000020
[VC4CL] 0004: 0x00000000
[VC4CL] 0008: 0x00010006
[VC4CL] 000c: 0x00000008
[VC4CL] 0010: 0x00000000
[VC4CL] 0014: 0x6dbb6d33
[VC4CL] 0018: 0x00000001
[VC4CL] 001c: 0x00000000
[VC4CL] Mailbox buffer after:
[VC4CL] 0000: 0x00000020
[VC4CL] 0004: 0x80000000
[VC4CL] 0008: 0x00010006
[VC4CL] 000c: 0x00000008
[VC4CL] 0010: 0x80000008
[VC4CL] 0014: 0x3b400000
[VC4CL] 0018: 0x04c00000
[VC4CL] 001c: 0x00000000
[VC4CL] Mailbox request: succeeded
[VC4CL] Tracking live-time of object: cl_mem
[VC4CL] Mailbox buffer before:
[VC4CL] 0000: 0x00000024
[VC4CL] 0004: 0x00000000
[VC4CL] 0008: 0x0003000c
[VC4CL] 000c: 0x0000000c
[VC4CL] 0010: 0x0000000c
[VC4CL] 0014: 0x00000030
[VC4CL] 0018: 0x00001000
[VC4CL] 001c: 0x0000000c
[VC4CL] 0020: 0x01528be8
[VC4CL] Mailbox buffer after:
[VC4CL] 0000: 0x00000024
[VC4CL] 0004: 0x80000000
[VC4CL] 0008: 0x0003000c
[VC4CL] 000c: 0x0000000c
[VC4CL] 0010: 0x80000004
[VC4CL] 0014: 0x0000000b
[VC4CL] 0018: 0x00001000
[VC4CL] 001c: 0x0000000c
[VC4CL] 0020: 0x01528be8
[VC4CL] Mailbox buffer before:
[VC4CL] 0000: 0x00000020
[VC4CL] 0004: 0x00000000
[VC4CL] 0008: 0x0003000d
[VC4CL] 000c: 0x00000008
[VC4CL] 0010: 0x00000004
[VC4CL] 0014: 0x0000000b
[VC4CL] 0018: 0x01528be8
[VC4CL] 001c: 0x0000000b
[VC4CL] Mailbox buffer after:
[VC4CL] 0000: 0x00000020
[VC4CL] 0004: 0x80000000
[VC4CL] 0008: 0x0003000d
[VC4CL] 000c: 0x00000008
[VC4CL] 0010: 0x80000004
[VC4CL] 0014: 0xbeae9000
[VC4CL] 0018: 0x01528be8
[VC4CL] 001c: 0x0000000b
[VC4CL] base=0x3eae9000, mem=0x76fc3000
[VC4CL] Allocated 48 bytes of buffer: handle 11, device address 3199111168, host address 0x76fc3000
[VC4CL] Tracking live-time of object: cl_event

@doe300
Copy link
Owner

doe300 commented May 24, 2018

I don't seem to be able to get any new file /tmp/vc4c-xxx.

This could be, because the compilation results are cached. At least that is what your first post suggests.

Checking with htop, I see some VC4CL Queue Han thread listed with some (light) activity. There are two other threads, one being the main process, taking 100% of CPU. Attaching strace on both reveals:

  • main process blocked on a futex()
  • other thread constantly calling sched_yield()

The VC4CL Queue Handler is the thread actually communicating with the GPU. Main uses a condition variable to wait for it, but should do this without utilizing 100% CPU.

Is this log the whole log? If so, then it seems to hang before it actually executes the kernel.
Did you reboot your Raspberry Pi before testing this? Because when the GPU is screwed up once, it sometimes does not recover (except when powered off).

@lissyx
Copy link
Author

lissyx commented May 24, 2018

That's the full log filtered on VC4. I already tried rebooting several times, for the same reason you mention, because I suspected it might be the case.

Just to make it clear, cl_cache in my first post refers to Intel's Neo OpenCL driver's caching feature, nothing at deepspeech, tensorflow nor computecpp level.

Just to be extra-cautious, I did even unplug power for a few minutes ... Any hint on what kind of extra debug I could add / hack into vc4cl to see why it hangs? i'm already running builds of VC4C and VC4CL with -DBUILD_DEBUG=ON.

@doe300
Copy link
Owner

doe300 commented May 24, 2018

Just to make it clear, cl_cache in my first post refers to Intel's Neo OpenCL driver's caching feature

Ah okay, I misunderstood that.

If this is all the VC4CL log, then I currently do not know where the problem is. Both likely cases, I mentioned earlier are now excluded, and it looks like more a problem in the VC4CL runtime.
For further analysis, the exact location as to where the program hangs and which OpenCL API-calls were done before will have to be analyzed.

Are there instructions I can use to build the program and debug it/test it out?

@lissyx
Copy link
Author

lissyx commented May 24, 2018

@doe300 I can share you binaries and data, building deepspeech with OpenCL is quite painful. I wanted to give a try to your CircleCI's binaries, but somehow there's no artifact available, at least for master branch :-(.

@lissyx
Copy link
Author

lissyx commented May 24, 2018

@doe300 You should have that in your mailbox (one used for your git commits) :-)

@lissyx
Copy link
Author

lissyx commented May 24, 2018

I've added some more debug into queue handler:

[VC4CL] base=0x3eaed000, mem=0x76f08000
[VC4CL] Allocated 48 bytes of buffer: handle 7, device address 3199127552, host address 0x76f08000
[VC4CL] Tracking live-time of object: cl_event
[VC4CL] PUSH event: 0x6a3fe73c
[VC4CL] POP event: 
[VC4CL] POP event: 0x6abfecf8
[VC4CL] Event updated 
[VC4CL] HAS event->action 
[VC4CL] event status: 0
[VC4CL] POP event: 
[VC4CL] POP empty event: 
[VC4CL] POP returned empty event: 

And then it loops somehow forever on empty queue.

@doe300
Copy link
Owner

doe300 commented May 24, 2018

I will take a look the at the execution in the next few days. The output and your assessment of it sounds like a discrepancy between the program and the VC4CL library (or within the VC4CL library) regarding the handling of events...

@lissyx
Copy link
Author

lissyx commented May 24, 2018

@doe300 Thanks! It's interesting, because I do remember also having to deal with some kind of weirdness related to events and/or handling of some error cases, in the past, when I was testing with the Intel Neo driver: somehow, TensorFlow or ComputeCpp layer (I cannot remember which one) was handling exception cases in an unexpected way, and this was making the tracing / profiling of OpenCL in this stack configuration impossible.

If you have any hint on where I should poke around in vc4cl to try and find some hint about what is going on, do not hesitate :-).

@lissyx
Copy link
Author

lissyx commented May 24, 2018

Just found out I only ran TestVC4C. I'm running TestVC4CL as well :-).
Looks like TestVC4CL allows me to reproduce some of the issues, like ./TestVC4CL --executions that does seem to block doing nothing.

Hacking deeper inside vc4c::runProcess(), I could somehow improve the execution of ./TestVC4CL --executions by changing the stop condition on the while: seems like it's not able to properly detect child process exit and/or stdout/stderr being done. I'll need to check that thoroughly.

This means that as of now, only ./TestVC4CL --extensions does somehow block with no useful output.

I just got ./TestVC4CL --extensions to be unblocked: looks like a lock on counterAccessMutex is already been held by the call to VC4CL_FUNC(clCreatePerformanceCounterVC4CL) when reaching the statement PerformanceCounter* counter = newOpenCLObject<PerformanceCounter>(counter_type, counterIndex); which then does a new PerformanceCounter(...).
The constructor of PerformanceCounter tries to acquire the lock counterAccessMutex

But in the end, this is not helping :(

@lissyx
Copy link
Author

lissyx commented May 24, 2018

So, current debugging gets me trapped inside the pthread mutex on the libComputeCpp.so side. I'm not sure there is any thing actionnable on the VC4CL side for the moment :)

@lissyx
Copy link
Author

lissyx commented May 24, 2018

@doe300 So, from further investigation, it seems TensorFlow using properly ComputeCpp library to push stuff to the GPU and then it sits in wait_and_throw() waiting for all elements in the queue to be dealt with. More details here: codeplaysoftware/computecpp-sdk#117 (comment)

I will try and continue to find what's going on inside VC4CL and why does this call to wait_and_throw() waits forever. if you have any suggestions on where I should concentrate, I'm all ears :)

@lissyx
Copy link
Author

lissyx commented May 25, 2018

Not sure yet why, but disabling eventAvailable.wait_for(lock, WAIT_DURATION); in src/queue_handler.cpp gets me further ... :

pi@rpi3-opencl-20180518:~/deepspeech $ sudo ./deepspeech ~/tmp/deepspeech/models/output_graph-n_hidden_1.pb ~/tmp/deepspeech/models/alphabet.txt ~/tmp/deepspeech/audio/ -t 2>&1 | grep -i allocated
[VC4CL] Allocated 48 bytes of buffer: handle 67, device address 3198869504, host address 0x76f1e000
pi@rpi3-opencl-20180518:~/deepspeech sudo ./deepspeech ~/tmp/deepspeech/models/output_graph-n_hidden_1.pb ~/tmp/deepspeech/models/alphabet.txt ~/tmp/deepspeech/audio/ -t 2>&1 | grep -i allocated
[VC4CL] Allocated 48 bytes of buffer: handle 68, device address 3198865408, host address 0x76f94000
[VC4CL] Allocated 48 bytes of buffer: handle 69, device address 3198861312, host address 0x76f93000
[VC4CL] Allocated 16 bytes of buffer: handle 70, device address 3198857216, host address 0x76f92000
[VC4CL] Allocated 48 bytes of buffer: handle 71, device address 3198853120, host address 0x76f91000
[VC4CL] Allocated 48 bytes of buffer: handle 72, device address 3198849024, host address 0x76f90000
[VC4CL] Allocated 16 bytes of buffer: handle 73, device address 3198844928, host address 0x76f8f000
[VC4CL] Allocated 1984 bytes of buffer: handle 74, device address 3198836736, host address 0x76f8e000
[VC4CL] Allocated 16 bytes of buffer: handle 75, device address 3198832640, host address 0x76f8d000
[VC4CL] Allocated 16 bytes of buffer: handle 76, device address 3198828544, host address 0x76f8c000
pi@rpi3-opencl-20180518:~/deepspeech $ 

After that, I have more CPU activity. But I'm still unsure what's going on.

@doe300
Copy link
Owner

doe300 commented May 25, 2018

In the second version, since it allocated a few buffers, it looks like it is may now be running a kernel.

[VC4CL] Allocated 1984 bytes of buffer: handle 74, device address 3198836736, host address 0x76f8e000

This buffer size looks about the right size for kernel code

@lissyx
Copy link
Author

lissyx commented May 25, 2018

Yes, this was my thought as well, but as much as I can tell, nothing shows up on v3d_profile. It is somehow again stuck.

@doe300
Copy link
Owner

doe300 commented May 25, 2018

Thanks for looking that deep into the issue, especially the pointer to eventAvailable.wait_for(lock, WAIT_DURATION) helps a lot. I will investigate this issue in depth tomorrow.

@lissyx
Copy link
Author

lissyx commented May 25, 2018

@doe300 Removing the lock there, it seems I can (randomly) get further, sometimes. I got an instance of execution where all the allocations would finish and be deallocated :)

$ sudo ./deepspeech ~/tmp/deepspeech/models/output_graph-n_hidden_1.pb ~/tmp/deepspeech/models/alphabet.txt ~/tmp/deepspeech/audio/ -t 2>&1 | grep -i allocated
[VC4CL] Allocated 48 bytes of buffer: handle 124, device address 3198623744, host address 0x76fa3000
[VC4CL] Allocated 48 bytes of buffer: handle 125, device address 3198619648, host address 0x76fa2000
[VC4CL] Allocated 16 bytes of buffer: handle 126, device address 3198615552, host address 0x76fa1000
[VC4CL] Allocated 48 bytes of buffer: handle 127, device address 3198611456, host address 0x76fa0000
[VC4CL] Allocated 48 bytes of buffer: handle 128, device address 3198607360, host address 0x76f9f000
[VC4CL] Allocated 16 bytes of buffer: handle 129, device address 3198603264, host address 0x76f9e000
[VC4CL] Allocated 1984 bytes of buffer: handle 130, device address 3198595072, host address 0x76f9d000
[VC4CL] Allocated 16 bytes of buffer: handle 131, device address 3198590976, host address 0x76f9c000
[VC4CL] Allocated 16 bytes of buffer: handle 132, device address 3198586880, host address 0x76f9b000
[VC4CL] Allocated 16 bytes of buffer: handle 133, device address 3198582784, host address 0x76f9a000
[VC4CL] Allocated 16 bytes of buffer: handle 134, device address 3198578688, host address 0x76f99000
[VC4CL] Allocated 16 bytes of buffer: handle 135, device address 3198574592, host address 0x6cbce000
[VC4CL] Allocated 16 bytes of buffer: handle 136, device address 3198570496, host address 0x6cbcd000
[VC4CL] Allocated 16 bytes of buffer: handle 137, device address 3198566400, host address 0x6cbcc000
[VC4CL] Allocated 16 bytes of buffer: handle 138, device address 3198562304, host address 0x6cbcb000
[VC4CL] Allocated 128 bytes of buffer: handle 139, device address 3198558208, host address 0x6cbca000
[VC4CL] Allocated 128 bytes of buffer: handle 140, device address 3198554112, host address 0x6cbc9000
[VC4CL] Allocated 16 bytes of buffer: handle 141, device address 3198550016, host address 0x6cbc8000
[VC4CL] Deallocated 48 bytes of buffer: handle 124, device address 3198623744, host address 0x76fa3000
[VC4CL] Deallocated 48 bytes of buffer: handle 125, device address 3198619648, host address 0x76fa2000
[VC4CL] Deallocated 16 bytes of buffer: handle 126, device address 3198615552, host address 0x76fa1000
[VC4CL] Deallocated 48 bytes of buffer: handle 127, device address 3198611456, host address 0x76fa0000
[VC4CL] Deallocated 48 bytes of buffer: handle 128, device address 3198607360, host address 0x76f9f000
[VC4CL] Deallocated 16 bytes of buffer: handle 129, device address 3198603264, host address 0x76f9e000
[VC4CL] Deallocated 1984 bytes of buffer: handle 130, device address 3198595072, host address 0x76f9d000
[VC4CL] Deallocated 16 bytes of buffer: handle 131, device address 3198590976, host address 0x76f9c000
[VC4CL] Deallocated 16 bytes of buffer: handle 132, device address 3198586880, host address 0x76f9b000
[VC4CL] Deallocated 16 bytes of buffer: handle 133, device address 3198582784, host address 0x76f9a000
[VC4CL] Deallocated 16 bytes of buffer: handle 134, device address 3198578688, host address 0x76f99000
[VC4CL] Deallocated 16 bytes of buffer: handle 135, device address 3198574592, host address 0x6cbce000
[VC4CL] Deallocated 16 bytes of buffer: handle 136, device address 3198570496, host address 0x6cbcd000
[VC4CL] Deallocated 16 bytes of buffer: handle 137, device address 3198566400, host address 0x6cbcc000
[VC4CL] Deallocated 16 bytes of buffer: handle 138, device address 3198562304, host address 0x6cbcb000
[VC4CL] Deallocated 128 bytes of buffer: handle 139, device address 3198558208, host address 0x6cbca000
[VC4CL] Deallocated 128 bytes of buffer: handle 140, device address 3198554112, host address 0x6cbc9000
[VC4CL] Deallocated 16 bytes of buffer: handle 141, device address 3198550016, host address 0x6cbc8000
[VC4CL] Allocated 48 bytes of buffer: handle 141, device address 3198623744, host address 0x76fa3000

@lissyx
Copy link
Author

lissyx commented May 25, 2018

@doe300 Complement to my previous answer: somehow, after killing some locks (MUCH BAD) and mutexes (BAD), I've got one run to complete. It failed, though, but that (seems) to be not because of vc4cl. Looks like there are some deadlock issues to address though :(

@lissyx
Copy link
Author

lissyx commented May 25, 2018

Out of the locking issues, I have been able to catch some tensorflow-level errors, and now I'm hitting a vc4c compilation error, which is kind of good :) :

[E] Fri May 25 15:41:17 2018:  (1) /usr/local/lib/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xb1 [0x6dc37b96]
[E] Fri May 25 15:41:17 2018:  (2) /usr/local/lib/libVC4CC.so.1.2 : vc4c::spirv2qasm::SPIRVParser::parse(vc4c::Module&)+0x3c1 [0x6dddbf4c]
[E] Fri May 25 15:41:17 2018:  (3) /usr/local/lib/libVC4CC.so.1.2 : vc4c::Compiler::convert()+0xbb [0x6dc38cf8]
[E] Fri May 25 15:41:17 2018:  (4) /usr/local/lib/libVC4CC.so.1.2 : vc4c::Compiler::compile(std::istream&, std::ostream&, vc4c::Configuration, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)+0x193 [0x6dc3939c]
[E] Fri May 25 15:41:17 2018:  (8) /home/pi/deepspeech/libComputeCpp.so : cl::sycl::detail::program::build_current_program(std::string, bool)+0x19f [0x70b8f54c]
[E] Fri May 25 15:41:17 2018:  (9) /home/pi/deepspeech/libComputeCpp.so : cl::sycl::detail::program::build(unsigned char const*, unsigned int, std::string, bool)+0x47 [0x70b8f7c8]
[E] Fri May 25 15:41:17 2018:  (10) /home/pi/deepspeech/libComputeCpp.so : cl::sycl::detail::context::create_program_for_binary(std::shared_ptr<cl::sycl::detail::context> const&, unsigned char const*, int, bool)+0x11b [0x70b76310]
[E] Fri May 25 15:41:17 2018:  (11) /home/pi/deepspeech/libComputeCpp.so : cl::sycl::program::create_program_for_kernel_impl(std::string, unsigned char const*, int, char const* const*, std::shared_ptr<cl::sycl::detail::context>, bool)+0x87 [0x70b8a4ec]
[E] Fri May 25 15:41:17 2018:  (12) /home/pi/deepspeech/libdeepspeech.so : +0x1357564 [0x71f3c564]
[E] Fri May 25 15:41:17 2018:  (13) /home/pi/deepspeech/libdeepspeech.so : +0x13569e8 [0x71f3b9e8]
[E] Fri May 25 15:41:17 2018:  (14) /home/pi/deepspeech/libdeepspeech.so : +0x1354830 [0x71f39830]
[E] Fri May 25 15:41:17 2018:  (15) /home/pi/deepspeech/libdeepspeech.so : +0x135453c [0x71f3953c]
[E] Fri May 25 15:41:17 2018:  (16) /home/pi/deepspeech/libdeepspeech.so : +0x1354184 [0x71f39184]
[E] Fri May 25 15:41:17 2018:  (17) /home/pi/deepspeech/libdeepspeech.so : +0x1353ee4 [0x71f38ee4]
[E] Fri May 25 15:41:17 2018:  (18) /home/pi/deepspeech/libdeepspeech.so : +0x1353d44 [0x71f38d44]
[E] Fri May 25 15:41:17 2018:  (19) /home/pi/deepspeech/libdeepspeech.so : +0x118c7c4 [0x71d717c4]
[E] Fri May 25 15:41:17 2018:  (20) /home/pi/deepspeech/libdeepspeech.so : +0x3fc8d34 [0x74badd34]
[E] Fri May 25 15:41:17 2018:  (21) /home/pi/deepspeech/libdeepspeech.so : +0x3fc9b6c [0x74baeb6c]
[E] Fri May 25 15:41:17 2018:  (22) /home/pi/deepspeech/libdeepspeech.so : +0x4112a88 [0x74cf7a88]
[E] Fri May 25 15:41:17 2018:  (23) /home/pi/deepspeech/libdeepspeech.so : +0x4112034 [0x74cf7034]
[E] Fri May 25 15:41:17 2018:  (24) /usr/lib/arm-linux-gnueabihf/libstdc++.so.6 : +0x9cd44 [0x704bed44]
[E] Fri May 25 15:41:17 2018: Compiler threw exception: Parser: Unsupported operation: 

2018-05-25 15:41:17.393737: I ./tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorDeallocation { allocator_name: "device:SYCL" }
terminate called after throwing an instance of 'cl::sycl::detail::exception_implementation<(cl::sycl::detail::exception_types)7, cl::sycl::detail::exception_implementation<(cl::sycl::detail::exception_types)6, cl::sycl::exception> >'

@lissyx
Copy link
Author

lissyx commented May 25, 2018

Right, I need to find a way to dump the source code being passed llvm-spirv, but this will have to wait next monday.

@DuncanMcBain
Copy link

DuncanMcBain commented May 25, 2018

The easiest thing at this stage is likely to grab it from the VC4CL side - if you look at clCreateProgramWithBinaries, there's a const char ** argument, which will be the kernels. Since you already have debugging set up, this will be easier than the alternative (which would be trying to find the correct files that are crashing, which will also all only exist in /tmp, somewhere!)

ETA: I hope you don't mind me popping in here, we've been following this issue with interest! :)

@lissyx
Copy link
Author

lissyx commented May 25, 2018

Thanks, I'll investigate this path. I was also on the verge of swapping llvm-spirvwith a shell script to intercept the stdin stream 😎. But next monday 😁

@doe300
Copy link
Owner

doe300 commented May 26, 2018

So to dump the kernel, there are a few possibilities in VC4C with only slight modifications needed:

  1. If you remove the deletion of the file in the descructor in VC4C in src/precompilation/TemporaryFile.cpp, you will get the temporary files for every intermediate compilation steps.
  2. When VC4CL is built in debug mode, it dumps The whole memory buffer for the kernel execution itself into /tmp/vc4cl-dump.bin just before running the first work-group. So if the program gets to this point, the output could be analysed for code which hangs the GPU.

@doe300
Copy link
Owner

doe300 commented May 26, 2018

So I found the reason for the blocking:

  • The program registers an event callback on finishing the memory copy and waits for the callback to be triggered before continuing with anything
  • VC4CL did only trigger the callback if it was registered after the status for which it was registered was reached.

I looked into the OpenCL 1.2 specification which states that if the status has already been reached, the callback still needs to be triggered. I adapted the code according to that (using the same behaviour as intel/beignet and pocl)

Now the test command you sent me fails with:

Error running session: Not found: FetchOutputs node logits_output_node: not found
terminate called after throwing an instance of 'cl::sycl::exception'

Looks like the error is throw here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/graph/subgraph.cc#L134, which I don't know anything about.

@lissyx
Copy link
Author

lissyx commented May 26, 2018

Thanks @doe300, this error is a fallout of a missing tensorflow op in the binary I sent you. I've rebuilt libdeepspeech.so including Snapshot op, I'll send it to you 😀. I'm sorry about that but it slipped through, I did not need that when testing on my Intel GPU and only found out after hacking the locks.

You should have received the new .so :-)

@doe300
Copy link
Owner

doe300 commented May 26, 2018

Now, independent of the VC4C front-end used, the compilation fails with a compilation error due to the kernel using double, which is not supported (it also uses 64-bit integer which are also not supported, except in some edge cases).
This is due to the hardware being 32-bit and implementing software-base support for 64-bit types is not that easy.

@lissyx
Copy link
Author

lissyx commented May 29, 2018

@doe300 Which 64 bits instructions ? Vectorized version I've sent you should be okay with the model I've sent earlier today (previous model had some leftover indeed), there's no 64 bits inside, and it fails like this:

[VC4CL] Compilation error: Normalizer: Not normalized instruction found: <4 x f32> %tmp.38341 = <4 x f32> vload4jPKU3AS1f(i32 0, f32* %tmp.38340)

Or I misunderstood something?

@DuncanMcBain
Copy link

Unfortunately some of the kernels are quite large and might well end up using lots of registers. I don't have any specific recommendations for reducing the number of registers the kernels use, but I can try to find out tomorrow.

@doe300
Copy link
Owner

doe300 commented May 29, 2018

Or I misunderstood something?

No I did. But for the vectorized version I get the same error that some vload4 overload is missing.

@lissyx
Copy link
Author

lissyx commented May 29, 2018

No I did. But for the vectorized version I get the same error that some vload4 overload is missing.

Ok, I just want to make sure that this vload4 is NOT an instance of a 64 bit error. I have to continue digging into tensorflow sycl build to find where that is coming from, but I'm a bit lost.

@DuncanMcBain I'm checking compute++ but I don't see any way to dump some OpenCL code. I know that the C++ code is being ingested and directly outputed as SPIR, but isn't there a way to instruct the compiler to dump some OpenCL ?

Right now, I'm able to dump to offending kernel's LLVM-IR bytecode, and there are lines like (in the disassembled):

define hidden spir_kernel void @SYCL_struct_Eigen__TensorSycl__ExecExprFunctorKernel_const_class_Eigen__TensorAssignOp_class_Eigen__TensorMap_class_Eigen__Tensor_float__2__1__int___16__MakePointer___const_class_Eigen__TensorContractionOp_const_class_Eigen__array_struct_Eigen__IndexPair_int___1___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer_______struct_utility__tuple__Tuple_struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_____struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_______struct_utility__tuple__Tuple_struct_Eigen__RangeAccess_cl__sycl__access__mode__read_write___struct_Eigen__RangeAccess_cl__sycl__access__mode__read_____false_(i8 addrspace(1)*, i32, i8 addrspace(1)*, i32, i32) #1 {

I suspect this is what you suggested to look at, then find the matching tensorflow/core/kernel code. But first, the mangling seems different that what C++ uses, and it looks like a bunch of templates. Would there be any tool to demangle that?

For example, this seems like one of the SPIR kernel containing offending vload4 call:

define hidden spir_kernel void @SYCL_struct_Eigen__TensorSycl__ExecExprFunctorKernel_const_class_Eigen__TensorAssignOp_class_Eigen__TensorMap_class_Eigen__Tensor_float__2__1__int___16__MakePointer___const_class_Eigen__TensorContractionOp_const_class_Eigen__array_struct_Eigen__IndexPair_int___1___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer_______struct_utility__tuple__Tuple_struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_____struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_______struct_utility__tuple__Tuple_struct_Eigen__RangeAccess_cl__sycl__access__mode__read_write___struct_Eigen__RangeAccess_cl__sycl__access__mode__read_____true_(i8 addrspace(1)*, i32, i8 addrspace(1)*, i32, i32, i32) {
  %7 = tail call spir_func i32 @_Z13get_global_idj(i32 0) #0
  %8 = tail call spir_func i32 @_Z13get_global_idj(i32 1) #0
  %9 = tail call spir_func i32 @_Z13get_global_idj(i32 2) #0
  %10 = tail call spir_func i32 @_Z15get_global_sizej(i32 1) #0
  %11 = tail call spir_func i32 @_Z15get_global_sizej(i32 2) #0
  %12 = mul i32 %10, %7
  %13 = add i32 %12, %8
  %14 = mul i32 %13, %11
  %15 = add i32 %14, %9
  %16 = icmp slt i32 %15, %5
  br i1 %16, label %17, label %40

; <label>:17:                                     ; preds = %6
  %18 = bitcast i8 addrspace(1)* %0 to float addrspace(1)*
  %19 = lshr i32 %1, 2
  %20 = getelementptr inbounds float, float addrspace(1)* %18, i32 %19
  %21 = bitcast i8 addrspace(1)* %2 to float addrspace(1)*
  %22 = lshr i32 %3, 2
  %23 = getelementptr inbounds float, float addrspace(1)* %21, i32 %22
  %24 = shl nsw i32 %15, 2
  %25 = getelementptr inbounds float, float addrspace(1)* %23, i32 %24
  %26 = tail call spir_func <4 x float> @_Z6vload4jPKU3AS1f(i32 0, float addrspace(1)* %25), !noalias !326
  %27 = getelementptr inbounds float, float addrspace(1)* %20, i32 %24
  tail call spir_func void @_Z7vstore4Dv4_fjPU3AS1f(<4 x float> %26, i32 0, float addrspace(1)* %27)
  %28 = shl i32 %5, 2
  %29 = add nsw i32 %15, %28
  %30 = icmp slt i32 %29, %4
  br i1 %30, label %31, label %40

; <label>:31:                                     ; preds = %17
  br label %32

; <label>:32:                                     ; preds = %32, %31
  %33 = phi i32 [ %37, %32 ], [ %29, %31 ]
  %34 = getelementptr inbounds float, float addrspace(1)* %23, i32 %33
  %35 = load float, float addrspace(1)* %34, align 4, !tbaa !315
  %36 = getelementptr inbounds float, float addrspace(1)* %20, i32 %33
  store float %35, float addrspace(1)* %36, align 4, !tbaa !315
  %37 = add nsw i32 %33, %5
  %38 = icmp slt i32 %37, %4
  br i1 %38, label %32, label %39, !llvm.loop !331

; <label>:39:                                     ; preds = %32
  br label %40

; <label>:40:                                     ; preds = %39, %17, %6
  ret void
}

@DuncanMcBain
Copy link

Good work guys!
Unfortunately, there is no way to dump the OpenCL C code corresponding to the kernels - it doesn't exist! We output directly to SPIR (or SPIR-V, or PTX). I suppose it might technically be possible, but we have done no work in that direction, and I'm not a compiler guy, so I don't know if it would ever work. But I don't see why not.

The name of the kernel is a type. In the case of Eigen kernels, it will be the type of the function objects passed in to the parallel_for call. I believe the mangling of the name should be roughly the same in that case, though I am having trouble decoding it at the moment.
All that being said, ExecExprFunctorKernel is probably the place to look. I'm afraid I don't know a lot about the internals of Eigen - I've only tried debugging it once or twice and tried not to go too deep!

@lissyx
Copy link
Author

lissyx commented May 30, 2018

Could we be lucky ? Just found that: https://stackoverflow.com/questions/44557876/a-puzzle-on-spir-mangling-on-type-size-t

Reading the SPIR doc at https://www.khronos.org/registry/SPIR/specs/spir_spec-2.0.pdf page 37 (annex A31), it would looks like _Z6vload4jPKU3AS1f is supposed to be vload4((unsigned int), (const __global float *)). Manging rules are the same for 1.2 as well: https://www.khronos.org/registry/SPIR/specs/spir_spec-1.2.pdf

There's something close here: https://github.com/doe300/VC4CLStdLib/blob/b17db7b38d84aa461042e3cbfd0a6df90d1e3020/include/opencl-c.h#L11704

Also, if the K moves a little bit, then there's a match:

grep '_Z6vload4jPU3AS1Kf' VC4CLStdLib.ll
define spir_func <4 x float> @_Z6vload4jPU3AS1Kf(i32 %offset, float addrspace(1)* nocapture readonly %ptr) local_unnamed_addr #7 {

Reading the mangling rules of SPIR specification, I'm really questionning myself about the PKU3AS1f vs PU3AS1Kf.

Another interesting finding: https://github.com/google/clspv/blob/8e13814d0fd80ab8c89bbddb4c0f77949dbf7ea5/lib/ReplaceOpenCLBuiltinPass.cpp#L973-L976

@doe300
Copy link
Owner

doe300 commented May 30, 2018

Also, if the K moves a little bit, then there's a match:

Yeah, there is the problem. I do have the float4 vload4(size_t, const __global float*) overload in the standard-library implementation (the implementation is here), but it is converted to a different (only by the position of the K) mangled name, hence it does not fit. But I have no idea why.
Google seems to do something different for exactly this one overload, but they dont state as to why

@lissyx
Copy link
Author

lissyx commented May 30, 2018

@doe300 One thing I'm asking myself in my head is whether _Z6vload4jPKU3AS1fis valid. SPIR 1.2 and 2.0 specs' mangling rule in annex A.3 would not completely agree to that, because they distinguish P<builtin-type> from PU3ASN<builtin-type>. So a PK would fall into the first category, but then what <builtin-type> would be U3AS1f ? There's no U and there is a u for vendor extended type.

@doe300
Copy link
Owner

doe300 commented May 30, 2018

True, than this could be a bug in LLVM?!

@lissyx
Copy link
Author

lissyx commented May 30, 2018

@doe300 That's what I'm starting to think. Maybe @DuncanMcBain can shed some light there? As much as I can look into, ComputeCpp relies on some SVN of LLVM post 3.8.0.

Looking at 0.7.0 and 0.8.0, it's actually confusing me more:

$ ag -f 'vload4jPKU3AS1f' ../ComputeCpp-CE-0.*.0-Ubuntu-16.04-x86_64/include/ 
../ComputeCpp-CE-0.7.0-Ubuntu-16.04-x86_64/include/SYCL/spir_load_store_builtins.h
4448:extern "C" cl::sycl::vec<float, 4>::vector_t _Z6vload4jPKU3AS1f(
4470:  vec.set_data(_Z6vload4jPKU3AS1f(offset, ptr));

$ ag -f 'vload4jPU3AS1Kf' ../ComputeCpp-CE-0.*.0-Ubuntu-16.04-x86_64/include/ 
../ComputeCpp-CE-0.7.0-Ubuntu-16.04-x86_64/include/SYCL/opencl_math_builtins.hpp
78609:::cl::sycl::detail::__sycl_vector<::cl::sycl::cl_float, 4> _Z6vload4jPU3AS1Kf(
82927:  return ::_Z6vload4jPU3AS1Kf(offset, p);

../ComputeCpp-CE-0.8.0-Ubuntu-16.04-x86_64/include/SYCL/sycl_device_builtins.h
128480:::cl::sycl::detail::__sycl_vector<::cl::sycl::cl_float, 4> _Z6vload4jPU3AS1Kf(
136141:      ::_Z6vload4jPU3AS1Kf(offset, p));

So, from ComputeCpp 0.7.0:

extern "C" cl::sycl::vec<float, 4>::vector_t _Z6vload4jPKU3AS1f(                
    int32_t,                                                                    
    const cl::sycl::multi_ptr<                                                  
        float, cl::sycl::access::address_space::global_space>::pointer_t); 

@lissyx
Copy link
Author

lissyx commented May 30, 2018

$ clang-5.0 -cc1 -triple spir-unknown-unknown -O3 -ffp-contract=off -cl-std=CL1.2 -cl-kernel-arg-info -cl-single-precision-constant -Wno-all -Wno-gcc-compat -x cl -emit-llvm-bc -o VC4CLStdLib-5.0.bc VC4CLStdLib/include/VC4CLStdLib.h 2>/dev/null 
$ clang-3.8 -cc1 -triple spir-unknown-unknown -O3 -ffp-contract=off -cl-std=CL1.2 -cl-kernel-arg-info -cl-single-precision-constant -Wno-all -Wno-gcc-compat -x cl -emit-llvm-bc -o VC4CLStdLib-3.8.bc VC4CLStdLib/include/VC4CLStdLib.h 2>/dev/null 
$ clang-3.9 -cc1 -triple spir-unknown-unknown -O3 -ffp-contract=off -cl-std=CL1.2 -cl-kernel-arg-info -cl-single-precision-constant -Wno-all -Wno-gcc-compat -x cl -emit-llvm-bc -o VC4CLStdLib-3.9.bc VC4CLStdLib/include/VC4CLStdLib.h 2>/dev/null 
$ llvm-dis-3.8 VC4CLStdLib-3.8.bc
$ llvm-dis-3.9 VC4CLStdLib-3.9.bc
$ llvm-dis-5.0 VC4CLStdLib-5.0.bc
$ ll VC4CLStdLib-*.bc
-rw-r--r-- 1 alex alex 1,2M mai   30 11:28 VC4CLStdLib-3.8.bc
-rw-r--r-- 1 alex alex 1,8M mai   30 11:28 VC4CLStdLib-3.9.bc
-rw-r--r-- 1 alex alex 2,1M mai   30 11:28 VC4CLStdLib-5.0.bc
$ grep -E '_Z6vload4jPU3AS1Kf|_Z6vload4jPKU3AS1f' VC4CLStdLib-*.ll
VC4CLStdLib-3.8.ll:define spir_func <4 x float> @_Z6vload4jPKU3AS1f(i32 %offset, float addrspace(1)* nocapture readonly %ptr) #7 {
VC4CLStdLib-3.9.ll:define spir_func <4 x float> @_Z6vload4jPU3AS1Kf
_Z6vload4jPU3AS1Kf(i32 %offset, float addrspace(1)* nocapture readonly %ptr) local_unnamed_addr #7 {
VC4CLStdLib-5.0.ll:define spir_func <4 x float> @_Z6vload4jPU3AS1Kf(i32 %offset, float addrspace(1)* nocapture readonly %ptr) local_unnamed_addr #7 {

So, LLVM 3.8 generates _Z6vload4jPKU3AS1f but starting with 3.9 it's _Z6vload4jPU3AS1Kf.

Let's try something ...

diff --git a/include/_vector.h b/include/_vector.h
index 4f90f95..6ce8c03 100644
--- a/include/_vector.h
+++ b/include/_vector.h
@@ -210,6 +210,11 @@ VECTOR_LOAD(uint)
 VECTOR_LOAD(int)
 VECTOR_LOAD(float)
 
+float4 _Z6vload4jPKU3AS1f(size_t offset, const __global float * ptr)
+{
+       return *((const __global float4 *)(ptr + offset * 4));
+}
+
 VECTOR_STORE(uchar)
 VECTOR_STORE(char)
 VECTOR_STORE(ushort)

@lissyx
Copy link
Author

lissyx commented May 30, 2018

@doe300 With the hack documented above, I'm hitting some register issue as well:

[VC4CL] base=0x3de1e000, mem=0x6aaa5000
[VC4CL] Allocated 193648 bytes of buffer: handle 1022, device address 3185696768, host address 0x6aaa5000
[VC4CL] API call: cl_int clEnqueueWriteBuffer(cl_command_queue 0x1024784, cl_mem 0x6370160c, cl_bool 0, size_t 0, size_t 193648, void* 0x11ecc20, cl_uint 0, const cl_event* 0, cl_event* 0x688fe9f8)
[VC4CL] Tracking live-time of object: cl_event
[VC4CL] API call: cl_int clRetainEvent(cl_event 0x6370174c)
[VC4CL] API call: cl_int clSetEventCallback(cl_event 0x6370174c, cl_int 0, void(CL_CALLBACK*)(cl_event event, cl_int event_command_exec_status, void* user_data) 0x688febfc, void* 0)
[VC4CL] API call: cl_int clFlush(cl_command_queue 0x1024784)
[VC4CL] API call: cl_int clGetEventInfo(cl_event 0x6370174c, cl_event_info 4563, size_t 4, void* 0x688fecbc, size_t* 0)
[VC4CL] API call: cl_int clGetEventInfo(cl_event 0x6370174c, cl_event_info 4563, size_t 4, void* 0x688fec3c, size_t* 0)
[VC4CL] API call: cl_int clReleaseEvent(cl_event 0x6370174c)
[VC4CL] API call: cl_int clReleaseEvent(cl_event 0x6370174c)
[VC4CL] Releasing live-time of object: cl_event
[VC4CL] API call: cl_program clCreateProgramWithBinary(cl_context 0x1023cd4, cl_uint 1, const cl_device_id* 0x63806300, const size_t* 0x63806330, const unsigned char** 0x63806320, cl_int* 0x63806310, cl_int* 0x660fde68)
[VC4CL] Tracking live-time of object: cl_program
[VC4CL] API call: cl_int clBuildProgram(cl_program 0x638ca884, cl_uint 1, const cl_device_id* 0x63806300, const char* "-x spir -spir-std=1.2 ", void(CL_CALLBACK*)(cl_program program, void* user_data) 0x660fde48, void* 0)
[VC4CL] Linking complete with status: 0
[VC4CL] Compiling source with: -x spir -spir-std=1.2 
[VC4CL] Compilation error: Label/Register Mapping: Failed to assign local to ANY register: i32 %tmp.38074
[VC4CL] Compilation complete with status: -11
[VC4CL] Compilation log: [E] Wed May 30 10:15:01 2018: Background worker threw error: std::bad_alloc
[E] Wed May 30 10:15:01 2018: While running worker task: Code Generator for: SYCL_struct_Eigen__TensorSycl__internal__KernelConstructorNoshared_class_Eigen__TensorContractionOp_const_class_Eigen__array_struct_Eigen__IndexPair_int___1___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer_____float__float__float__struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_____struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_____struct_Eigen__RangeAccess_cl__sycl__access__mode__read_write___int__class_Eigen__array_int__1___class_Eigen__array_int__1___class_Eigen__array_int__1___false__false__true__64__64__1__4__4__16__16__0__0__struct_utility__tuple__Tuple_struct_Eigen__RangeAccess_cl__sycl__access__mode__read_____struct_utility__tuple__Tuple_struct_Eigen__RangeAccess_cl__sycl__access__mode__read_____struct_Eigen__SyclKernelDevice__false__true_
[E] Wed May 30 10:15:01 2018: Background worker threw error: std::bad_alloc
[E] Wed May 30 10:15:01 2018: While running worker task: Code Generator for: SYCL_struct_Eigen__TensorSycl__internal__GeneralTensorVector_class_Eigen__TensorContractionOp_const_class_Eigen__array_struct_Eigen__IndexPair_int___1___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer_____float__float__float__struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_____struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_____class_cl__sycl__accessor_float__1__cl__sycl__access__mode__read_write__cl__sycl__access__target__local__cl__sycl__access__placeholder__false_t___struct_Eigen__RangeAccess_cl__sycl__access__mode__write___class_cl__sycl__accessor_unsigned_char__1__cl__sycl__access__mode__read_write__cl__sycl__access__target__global_buffer__cl__sycl__access__placeholder__false_t___int__class_Eigen__array_int__1___class_Eigen__array_int__1___class_Eigen__array_int__1___false__false__false__16__256__16__16__1__16__1__struct_utility__tuple__Tuple_struct_Eigen__RangeAccess_cl__sycl__access__mode__read_____struct_utility__tuple__Tuple_struct_Eigen__RangeAccess_cl__sycl__access__mode__read_____struct_Eigen__SyclKernelDevice__false__false__true_
[E] Wed May 30 10:15:03 2018: Background worker threw error: std::bad_alloc
[E] Wed May 30 10:15:03 2018: While running worker task: Code Generator for: SYCL_struct_Eigen__TensorSycl__internal__GeneralTensorVector_class_Eigen__TensorContractionOp_const_class_Eigen__array_struct_Eigen__IndexPair_int___1___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer_____float__float__float__struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_____struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_____class_cl__sycl__accessor_float__1__cl__sycl__access__mode__read_write__cl__sycl__access__target__local__cl__sycl__access__placeholder__false_t___struct_Eigen__RangeAccess_cl__sycl__access__mode__write___class_cl__sycl__accessor_unsigned_char__1__cl__sycl__access__mode__read_write__cl__sycl__access__target__global_buffer__cl__sycl__access__placeholder__false_t___int__class_Eigen__array_int__1___class_Eigen__array_int__1___class_Eigen__array_int__1___false__false__false__16__256__16__16__1__16__1__struct_utility__tuple__Tuple_struct_Eigen__RangeAccess_cl__sycl__access__mode__read_____struct_utility__tuple__Tuple_struct_Eigen__RangeAccess_cl__sycl__access__mode__read_____struct_Eigen__SyclKernelDevice__false__true__true_
[E] Wed May 30 10:15:04 2018: Background worker threw error: std::bad_alloc
[E] Wed May 30 10:15:05 2018: While running worker task: Code Generator for: SYCL_struct_Eigen__TensorSycl__internal__GeneralTensorVector_class_Eigen__TensorContractionOp_const_class_Eigen__array_struct_Eigen__IndexPair_int___1___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer_____float__float__float__struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_____struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_____class_cl__sycl__accessor_float__1__cl__sycl__access__mode__read_write__cl__sycl__access__target__local__cl__sycl__access__placeholder__false_t___struct_Eigen__RangeAccess_cl__sycl__access__mode__write___class_cl__sycl__accessor_unsigned_char__1__cl__sycl__access__mode__read_write__cl__sycl__access__target__global_buffer__cl__sycl__access__placeholder__false_t___int__class_Eigen__array_int__1___class_Eigen__array_int__1___class_Eigen__array_int__1___false__false__false__16__256__16__16__1__16__1__struct_utility__tuple__Tuple_struct_Eigen__RangeAccess_cl__sycl__access__mode__read_____struct_utility__tuple__Tuple_struct_Eigen__RangeAccess_cl__sycl__access__mode__read_____struct_Eigen__SyclKernelDevice__false__false__false_
[E] Wed May 30 10:15:05 2018: Background worker threw error: std::bad_alloc
[E] Wed May 30 10:15:06 2018: While running worker task: Code Generator for: SYCL_struct_Eigen__TensorSycl__internal__GenericPartialReduction_float__int__class_cl__sycl__accessor_unsigned_char__1__cl__sycl__access__mode__read__cl__sycl__access__target__global_buffer__cl__sycl__access__placeholder__false_t___class_cl__sycl__accessor_unsigned_char__1__cl__sycl__access__mode__write__cl__sycl__access__target__global_buffer__cl__sycl__access__placeholder__false_t___struct_Eigen__RangeAccess_cl__sycl__access__mode__write___class_cl__sycl__accessor_float__1__cl__sycl__access__mode__read_write__cl__sycl__access__target__local__cl__sycl__access__placeholder__false_t___struct_Eigen__internal__SumReducer_float___struct_Eigen__TensorSycl__internal__ReductionPannel_int__16__16___true__true__false_
[E] Wed May 30 10:15:08 2018: Background worker threw error: std::bad_alloc
[E] Wed May 30 10:15:08 2018: While running worker task: Code Generator for: SYCL_struct_Eigen__TensorSycl__internal__GeneralTensorVector_class_Eigen__TensorContractionOp_const_class_Eigen__array_struct_Eigen__IndexPair_int___1___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer_____float__float__float__struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_____struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_____class_cl__sycl__accessor_float__1__cl__sycl__access__mode__read_write__cl__sycl__access__target__local__cl__sycl__access__placeholder__false_t___struct_Eigen__RangeAccess_cl__sycl__access__mode__write___class_cl__sycl__accessor_unsigned_char__1__cl__sycl__access__mode__read_write__cl__sycl__access__target__global_buffer__cl__sycl__access__placeholder__false_t___int__class_Eigen__array_int__1___class_Eigen__array_int__1___class_Eigen__array_int__1___false__true__true__16__256__16__16__1__16__1__struct_utility__tuple__Tuple_struct_Eigen__RangeAccess_cl__sycl__access__mode__read_____struct_utility__tuple__Tuple_struct_Eigen__RangeAccess_cl__sycl__access__mode__read_____struct_Eigen__SyclKernelDevice__false__false__true_
[E] Wed May 30 10:15:09 2018: Background worker threw error: std::bad_alloc
[E] Wed May 30 10:15:10 2018: While running worker task: Code Generator for: SYCL_struct_Eigen__TensorSycl__internal__KernelConstructorNoshared_class_Eigen__TensorContractionOp_const_class_Eigen__array_struct_Eigen__IndexPair_int___1___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_float__2__1__int___16__MakePointer_____float__float__float__struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_____struct_utility__tuple__Tuple_struct_Eigen__DSizes_int__2_____struct_Eigen__RangeAccess_cl__sycl__access__mode__read_write___int__class_Eigen__array_int__1___class_Eigen__array_int__1___class_Eigen__array_int__1___false__t
[VC4CL] API call: cl_int clGetProgramBuildInfo(cl_program 0x638ca884, cl_device_id 0x104da10, cl_program_build_info 4483, size_t 0, void* 0, size_t* 0x660fd98c)
[VC4CL] API call: cl_int clGetProgramBuildInfo(cl_program 0x638ca884, cl_device_id 0x104da10, cl_program_build_info 4483, size_t 8193, void* 0x61400018, size_t* 0)
[VC4CL] API call: cl_int clReleaseProgram(cl_program 0x638ca884)
[VC4CL] Releasing live-time of object: cl_program
terminate called after throwing an instance of 'cl::sycl::detail::exception_implementation<(cl::sycl::detail::exception_types)7, cl::sycl::detail::exception_implementation<(cl::sycl::detail::exception_types)6, cl::sycl::exception> >'

Is that the same issue you mentionned ?

@doe300
Copy link
Owner

doe300 commented May 30, 2018

That would explain why I never had this error before, I usually compile with only one LLVM version.
So we have:

  1. SPIRV-LLVM (which you don't use right now, right?) based on 3.6, which probably uses the wrong mangling
  2. The LLVM used to compile the ComputeCpp kernels to SPIR(V), which is LLVM 3.9 if I understand correctly
  3. The LLVM used by VC4C to link the modules (and compile the VC4CL std-lib module), which seems to have a version > 3.9 on your machine.

And using 2. and 3. together generates the problem...

Is that the same issue you mentionned ?

Yes it is

@lissyx
Copy link
Author

lissyx commented May 30, 2018

@doe300 Yes, I stopped using SPIRV-LLVM. Actually, ComputeCpp seems to use some SVN snapshot after 3.8.0 release, but obviously before some 3.9.0 change that does the mangling we want. My RPi3 is using 3.9 from the Raspbian repos.

@Naghasan
Copy link

Hi @lissyx,

I'm a colleague of @DuncanMcBain, and I work specifically on the ComputeCpp compiler. I briefly looked through the discussion of this issue about the vload and mangling issue.

ComputeCpp outputs SPIR 1.2 modules, which means that builtins are mangled using the Itanium mangling rules as implemented by the Khronos modified clang 3.2. Your driver seems to expect the upstream 3.9 mangling. So as you noticed _Z6vload4jPKU3AS1f is SPIR 1.2 compilant but the upstream 3.9 clang will mangle it as _Z6vload4jPU3AS1Kf.
So you are right in your comment #31 (comment)

To answer the last comment, compute++ is based on LLVM 3.9, but it produces SPIR modules not LLVM 3.9 modules. So builtins will be mangled according to the SPIR 1.2 specification.

@doe300
Copy link
Owner

doe300 commented May 30, 2018

Thanks @Naghasan for the info. Lets see, if I understand it correctly:
If I manage to force VC4C to use the SPIR mangling, it should be completely compatible with ComputeCpp?

Continuing that thought:
For code directly compiled with VC4C it doesn't matter, everything is compiled with the same arguments and therefore the same mangling (like before).
This would mean that VC4CL clCreateProgramWithBInary/clCreateProgramWithILKHR no longer supports "default" LLVM IR (in the version of the LLVM used by VC4C), but instead would support SPIR (independent of the LLVM version generating the SPIR code), which would be actually more deterministic.

@lissyx
Copy link
Author

lissyx commented May 30, 2018

@Naghasan I might be missing something, but according to A.1 and A.3 in https://www.khronos.org/registry/SPIR/specs/spir_spec-1.2.pdf I'm not able to understand how PKU3AS1f is valid :-). That's the SPIR-1.2 doc, which, to my understand, is what you says ComputeCpp follow ?

@Naghasan
Copy link

@doe300 Yes, but you could still encounter some issues with the metadata (as they changed a bit). Another way would be to have a module that wrap those functions to redirect them to what the driver understand, so something like that:

float4 _Z6vload4jPKU3AS1f(....) {
  return _Z6vload4jPU3AS1Kf(....);
}

When consuming a SPIR module, you can then link the wrapper module to the user one. That should allow you to maintain both manglings.

@lissyx That's a good point, I think lost track of what is said in the spec and what has become de-facto supported mistakes ...
FYI this is the list of mangled builtins https://github.com/KhronosGroup/SPIR-Tools/wiki/SPIR-1.2-built-in-functions/c25612ec5539636758cd63974759be4f33452590. It does not mean this is correct w.r.t. the spec, but that's what is understood by SPIR compliant drivers. Unfortunately, the spec is completely inactive now, so I don't expect any clarifications or fixes to appear.

doe300 added a commit to doe300/VC4CLStdLib that referenced this issue May 31, 2018
@doe300
Copy link
Owner

doe300 commented May 31, 2018

I added wrapper-functions for SPIR mangling to LLVM mangling (more accurately: function aliases) and now the vector version passes the normalization and optimization steps and runs out of RAM in code generation.
doe300/VC4C#102 fixes that problem too, resulting in the same register allocation errors as the scalar version.

BTW, the program takes up more than 1h of processor time up to this point (on Raspberry Pi 3B+)

@lissyx
Copy link
Author

lissyx commented Jun 1, 2018

@doe300 Thanks ! I'm getting to the same point, with register allocations failing. I might try and see if I can help there. However, I'm clearly way less than 1h of processing time on a RPi3B, that's strange it is taking that much time for you.

EDIT: Ok, 1h of user time, on the 4 cores, so 16m real. That's a lot, but not surprising given what I saw on my laptop, on the Intel GPU.

@doe300
Copy link
Owner

doe300 commented Jun 1, 2018

To fix the register issues, we would either need to write a much smarter register allocator, which could be hard given the hardware characteristics (unless I did something very stupid;)).
The probably easier and more efficient way would be to add register spilling. The spilling itself is not very hard, and I think I have the code for it lying around somewhere.
The greater problem is how to determine which registers to spill and for which duration, so that we keep the number of spillings at a minimum. See also doe300/VC4C#60.

@lissyx
Copy link
Author

lissyx commented Jun 5, 2018

Thanks for all of your efficient fixes @doe300. Issue linked doe300/VC4C#60 is quite clear about the challenges, I'm afraid I don't know the hardware at all, even though I'd be glad to help on the register spilling. Even very slow, I'd be happy we could get inference work on the GPU :-)

@lissyx
Copy link
Author

lissyx commented Jul 10, 2018

@doe300 Sorry, I got pulled away from that work, have you been able to make progress on register spilling ? I'll be away for one month starting early august, but I might be able to work on something in the meantime. I understand the (big picture of the) issues described in issue #60 about latencies, etc., but I lack understanding on how to technically perform the spilling itself.

@doe300
Copy link
Owner

doe300 commented Jul 10, 2018

No, I have not. I am still looking for some strategy to determine which locals to spill (and in which span) to have the minimum/a small number of spills. If you have any ideas on that, they are very welcome.

The spilling itself is not the problem (unless they become too much to not fit into the VPM buffer anymore...)

@lissyx
Copy link
Author

lissyx commented Sep 12, 2018

So, I guess nobody got time to hack that :-). I'll try to test again this hardware with our simpler, streaming-oriented model that we landed a weeks ago. If we are lucky enough, it will put less pressure on the register. Also, this new model should (finally) be able to be optimized by TFLite and so maybe we can hope for some further less pressure.

@abhiTronix
Copy link

abhiTronix commented Jan 20, 2019

@doe300 awesome work man 🥇 . Thank you for this awesome repo. Today I got fully functional OpenCL on my Raspberry Pi-3 and Achieved Full 1080p rendering with ffmpeg. Also got almost 2x framerate on my OpenCV algorithms. Thanks a lot for bringing this up.

@lissyx
Copy link
Author

lissyx commented Jun 24, 2019

@doe300 So, long time I have not been able to secure some time to play around with your code. I see that it is still an active project, but issue #60 is still open. Do you think it's useful I retry again ?

@doe300 doe300 removed the bug label Sep 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants