Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement register spilling #60

Open
doe300 opened this issue Mar 14, 2018 · 2 comments
Open

Implement register spilling #60

doe300 opened this issue Mar 14, 2018 · 2 comments
Labels
enhancement help wanted optimization related to an optimization step

Comments

@doe300
Copy link
Owner

doe300 commented Mar 14, 2018

We need to implement register spilling to be able to support more complex kernels.

If the size of spilled registers (times 12 QPUs!) is small enough, we could store them in VPM and save from accessing memory. Access to the spilled registers would still need to be synchronized via the hardware-mutex.

The actual problem of this implementation is not the spilling/loading of locals, but in determining the minimum number of registers to spill.

(see doe300/VC4CL#24)

@nomaddo
Copy link
Collaborator

nomaddo commented Apr 6, 2018

Notice:
To spill registers, cache incoherence will be a problem between VPM and TMU.
DMA Load & Store doesn't use L2-cache, but TMU does.

Example (assembled by py-videocore):

import numpy as np

from videocore.assembler import qpu
from videocore.driver import Driver

@qpu
def dma_store(asm, preload):
    mov(ra0, uniform)
    mov(ra1, uniform)

    shl(r0, element_number, 2)
    iadd(r0, r0, ra0)

    if preload: # ATTENTION: difference of of the value `preload` chenge the result
        # we don't use the value loaded here
        mov(tmu0_s, r0)
        nop(sig='load tmu0')

    setup_vpm_write()
    mov(vpm, element_number)

    setup_dma_store(nrows=1)
    start_dma_store(ra0) # store the value
    wait_dma_store()

    mov(tmu0_s, r0)
    nop(sig='load tmu0') # load the value, which is just store into the buffer

    setup_vpm_write()
    mov(vpm, r4)

    setup_dma_store(nrows=1)
    start_dma_store(ra1) # resultにstore
    wait_dma_store()

    exit()
	
with Driver() as drv:
    print('----- enable preload -----')
    buffer = drv.alloc(16, 'uint32')
    buffer[:] = 0
    result = drv.alloc(16, 'uint32')
    result[:] = 0

    print('[Before]')
    print(buffer)
    print(result)

    drv.execute(
        n_threads=1,
        program=drv.program(dma_store, True),
        uniforms=[buffer.address, result.address]
    )

    print('[After]')
    print(buffer)
    print(result)
	
with Driver() as drv:
    print('----- disable preload -----')
    buffer = drv.alloc(16, 'uint32')
    buffer[:] = 0
    result = drv.alloc(16, 'uint32')
    result[:] = 0

    print('[Before]')
    print(buffer)
    print(result)

    drv.execute(
        n_threads=1,
        program=drv.program(dma_store, False),
        uniforms=[buffer.address, result.address]
    )

    print('[After]')
    print(buffer)
    print(result)

@doe300
Copy link
Owner Author

doe300 commented Apr 6, 2018

Yeah, I think this is also the problem in #30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement help wanted optimization related to an optimization step
Projects
None yet
Development

No branches or pull requests

2 participants