Implement register spilling #60

doe300 · 2018-03-14T17:04:35Z

We need to implement register spilling to be able to support more complex kernels.

If the size of spilled registers (times 12 QPUs!) is small enough, we could store them in VPM and save from accessing memory. Access to the spilled registers would still need to be synchronized via the hardware-mutex.

The actual problem of this implementation is not the spilling/loading of locals, but in determining the minimum number of registers to spill.

(see doe300/VC4CL#24)

nomaddo · 2018-04-06T06:33:40Z

Notice:
To spill registers, cache incoherence will be a problem between VPM and TMU.
DMA Load & Store doesn't use L2-cache, but TMU does.

Example (assembled by py-videocore):

import numpy as np

from videocore.assembler import qpu
from videocore.driver import Driver

@qpu
def dma_store(asm, preload):
    mov(ra0, uniform)
    mov(ra1, uniform)

    shl(r0, element_number, 2)
    iadd(r0, r0, ra0)

    if preload: # ATTENTION: difference of of the value `preload` chenge the result
        # we don't use the value loaded here
        mov(tmu0_s, r0)
        nop(sig='load tmu0')

    setup_vpm_write()
    mov(vpm, element_number)

    setup_dma_store(nrows=1)
    start_dma_store(ra0) # store the value
    wait_dma_store()

    mov(tmu0_s, r0)
    nop(sig='load tmu0') # load the value, which is just store into the buffer

    setup_vpm_write()
    mov(vpm, r4)

    setup_dma_store(nrows=1)
    start_dma_store(ra1) # resultにstore
    wait_dma_store()

    exit()
	
with Driver() as drv:
    print('----- enable preload -----')
    buffer = drv.alloc(16, 'uint32')
    buffer[:] = 0
    result = drv.alloc(16, 'uint32')
    result[:] = 0

    print('[Before]')
    print(buffer)
    print(result)

    drv.execute(
        n_threads=1,
        program=drv.program(dma_store, True),
        uniforms=[buffer.address, result.address]
    )

    print('[After]')
    print(buffer)
    print(result)
	
with Driver() as drv:
    print('----- disable preload -----')
    buffer = drv.alloc(16, 'uint32')
    buffer[:] = 0
    result = drv.alloc(16, 'uint32')
    result[:] = 0

    print('[Before]')
    print(buffer)
    print(result)

    drv.execute(
        n_threads=1,
        program=drv.program(dma_store, False),
        uniforms=[buffer.address, result.address]
    )

    print('[After]')
    print(buffer)
    print(result)

doe300 · 2018-04-06T16:08:31Z

Yeah, I think this is also the problem in #30

doe300 added the enhancement label Mar 14, 2018

doe300 mentioned this issue Mar 29, 2018

Enhance general porpose optimization #55

Merged

doe300 added help wanted optimization related to an optimization step labels Apr 7, 2018

doe300 mentioned this issue Apr 17, 2018

Speed-up memory access #15

Closed

doe300 mentioned this issue Jun 1, 2018

Running with DeepSpeech (TensorFlow OpenCL/ComputeCpp) doe300/VC4CL#31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement register spilling #60

Implement register spilling #60

doe300 commented Mar 14, 2018 •

edited

Loading

nomaddo commented Apr 6, 2018 •

edited

Loading

doe300 commented Apr 6, 2018

Implement register spilling #60

Implement register spilling #60

Comments

doe300 commented Mar 14, 2018 • edited Loading

nomaddo commented Apr 6, 2018 • edited Loading

doe300 commented Apr 6, 2018

doe300 commented Mar 14, 2018 •

edited

Loading

nomaddo commented Apr 6, 2018 •

edited

Loading