Enhance `combineLoadingLiterals` #67

nomaddo · 2018-03-28T06:55:20Z

The output of deepCL/backproweights.cl can be improved by enhacement of reducing the number of ldi.
For example, loading constant 256 happens 69 times.
If we can combine them to one instruction, rapid speedup can be archived.
https://gist.github.com/nomaddo/220b867143eff68f2c0f83f9188ab382

More small example, the following code can be improved:

kernel void f(global float * a) {
  for (int i = 0; i < 4; i++){
    a[i] += 129;
  }
}

// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'f' with 37 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items)
// label: %start_of_function
or ra0, unif, unif
// label: %tmp.0
nop.never 
or tmu0s, ra0, ra0
nop.load_tmu0.never 
ldi r0, 1124139008                       // should be fused
fadd r0, r4, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 4, elements: 1 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
add tmu0s, ra0, 4 (4)
nop.load_tmu0.never 
ldi r1, 1124139008                       // should be fused
fadd r0, r4, r1
or vpm, r0, r0
add tmu0s, ra0, 8 (8)
nop.load_tmu0.never 
fadd r0, r4, r1
or vpm, r0, r0
add tmu0s, ra0, 12 (12)
nop.load_tmu0.never 
ldi r0, 1124139008                       // should be fused
fadd r0, r4, r0
or vpm, r0, r0
or vpw_addr, ra0, ra0
or mutex_rel, 1 (1), 1 (1)
// label: %end_of_function
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -33 // to %start_of_function
nop.never 
nop.never 
nop.never 
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never

The text was updated successfully, but these errors were encountered:

nomaddo · 2018-04-03T03:01:31Z

Currently, combineLoadingLietrals are limited: The replacement works within previous 6 instructions specified as ACCUMLATOR_THRESHOLD_HINT in canReplaceLietralLoad.

I think the following scheme is better

combine all loading instruction in a basic block if possible
estimate the number of registers needed (probably functions for register-allocation can be reused?)
split lifetimes of variables by inserting load instruction if the number is exceed

nomaddo · 2018-04-03T07:31:57Z

Tested in https://github.com/nomaddo/VC4C/tree/addOption.
With --Xthreshold=1000, the the number of output lines of deepCL/backproweights.cl reduce from 2302 to 1940.

More bigger threshold, register-allocation failed due to lacking the number of register.

doe300 · 2018-04-03T08:15:38Z

That's impressive. Do you have any numbers on compilation time slow-down?

nomaddo · 2018-04-03T08:28:46Z

It takes longer, but I think it's not problem...
I will calculate other test-cases in deepCL.

time	time(opt)
4.26 s	6.48 s

nomaddo@nomaddo-AS:~/VC4C$ time ./build/VC4C -Dcl_clang_storage_class_specifiers -DSIGMOID=1 -DgInPerBlock=16 -DgOutPerBlock=16 -DgNumFilters=16 -DgFilterSize=16 -DgHalfFilterSize=8 -DgFilterSizeSquared=256 -DgPadZeros=0 -DgPoolingSize=16 -DgNumPlanes=16 -DgInputPlanes=16 -DgNumInputPlanes=16 -DgMargin=8 -DgNumOutputPlanes=16 -DinputRow=0 -DoutputRow=0 -DgInputSize=32 -DgInputSizeSquared=1024 -DgOutputSize=32 -DgOutputSizeSquared=1024 -DgNumStripes=2 -DgInputStripeOuterSize=2 -DgInputStripeInnerSize=2 -DgInputStripeMarginSize=0 -DgOutputStripeSize=16 -DgOutputStripeNumRows=12 -DgWorkgroupSize=12  -DgEven=0 -DgPixelsPerThread=128 --Xthreshold=1000 --quiet -o /tmp/hoge testing/deepCL/backpropweights.cl
threshold=1000

real	0m5.961s
user	0m6.480s
sys	0m2.300s
nomaddo@nomaddo-AS:~/VC4C$ time ./build/VC4C -Dcl_clang_storage_class_specifiers -DSIGMOID=1 -DgInPerBlock=16 -DgOutPerBlock=16 -DgNumFilters=16 -DgFilterSize=16 -DgHalfFilterSize=8 -DgFilterSizeSquared=256 -DgPadZeros=0 -DgPoolingSize=16 -DgNumPlanes=16 -DgInputPlanes=16 -DgNumInputPlanes=16 -DgMargin=8 -DgNumOutputPlanes=16 -DinputRow=0 -DoutputRow=0 -DgInputSize=32 -DgInputSizeSquared=1024 -DgOutputSize=32 -DgOutputSizeSquared=1024 -DgNumStripes=2 -DgInputStripeOuterSize=2 -DgInputStripeInnerSize=2 -DgInputStripeMarginSize=0 -DgOutputStripeSize=16 -DgOutputStripeNumRows=12 -DgWorkgroupSize=12  -DgEven=0 -DgPixelsPerThread=128 --quiet -o /tmp/hoge testing/deepCL/backpropweights.cl

real	0m3.875s
user	0m4.264s
sys	0m2.432s

nomaddo · 2018-04-03T08:52:55Z

Comparison between --Xthreshold=1000 and --Xthreshold=6 (same as the default value).

filename	line-num	line-num (opt)	time	time (opt)
BackpropWeightsScratch.cl	1767	1703	4.38s	4.46s
BackpropWeightsScratchLarge.cl	2550	2344	6.908s	8.128s
PoolingBackwardGpuNaive.cl	---	failed	---	---
SGD.cl	80	80	3.516s	3.568s
activate.cl	340	338	3.552s	3.676s
addscalar.cl	57	57	3.440s	3.576s
applyActivationDeriv.cl	127	127	3.520s	3.464s
backpropweights.cl	2301	1939	4.316s	7.672s
backpropweights_byrow.cl	966	954	3.968s	4.040s
backward.cl	2288	1925	4.312s	6.144s
backward_cached.cl	1752	1688	4.364s	4.440s
copy.cl	224	224	3.552s	3.552s
dropout.cl	161	161	3.492s	3.500s
forward1.cl	3024	2627	5.656s	8.572s
forward_byinputplane.cl	3031	2844	5.512s	8.016s
forward_fc.cl	1043	930	3.876s	3.912s
forwardfc_workgroupperfilterplane.cl	13	13	3.536s	3.516s
inv.cl	75	75	3.492s	3.456s
per_element_add.cl	---	failed	---	---
per_element_mult.cl	61	61	3.504s	3.496s
pooling.cl	2279	1915	4.268s	6.280s
reduce_segments.cl	124	123	3.576s	3.580s
sqrt.cl	61	61	3.608s	3.544s
squared.cl	56	56	3.464s	3.560s

doe300 · 2018-04-03T08:57:42Z

Thanks. That is a difference I think we can well live with.
From the table it looks like the duration of the optimization is directly linked to the number of instructions saved. So there should be no case where the optimization runs long without any reduced execution time.

nomaddo · 2018-04-03T09:22:45Z

I think we can change the default value (6 to 100, for example?).
For better optimization, I am wondering how to improve it

Optimize by specifying the parameter (meaning optimize by man-hand).
Implement smarter feature (auto-adjustment using estimation of register-pressure)

@doe300 Any suggestion?

doe300 · 2018-04-03T09:38:07Z

6 to 100, for example

I doubt we can do this in general, since the parameter is used at several places, where some of them would fail to compile a lot sooner.
I came up with 6 by manually testing a few values until my sample code compiled correctly. If anyone has an idea how to auto-detect the "perfect" value, that would be great. Otherwise, I think we'll need to re-test a few values and set a new hint manually.

nomaddo · 2018-04-03T10:31:34Z

Thanks. Currently, it seems better to only add options to control the threshold by users.

doe300 · 2018-04-07T10:33:17Z

@nomaddo , is this resolved?

nomaddo · 2018-04-09T05:02:14Z

Not yet. As this issue has huge impact of performance, I will improve it by better way.

nomaddo · 2018-04-11T19:06:02Z

I think this can be done with life-range analysis. After fusion of ldi, if the number of used registers exceed the number of existence registers, we "forget" the variables marked as ldi by inserting ldi.

Example

In this example, we assume we have only 5 registers to simplify the example.

example code:

ldi a, 1000
iadd a, a, b
iadd c, a, d
imul24 e, f, c
iadd g, a, c
...
... 
ldi a, 1000
iadd a, a, e
...
...  instructions using b, c, d, e

Rename variables to make variables read-only if possible

Now, a is read-only except first assignment

ldi a, 1000
iadd a2, a, b
iadd c, a2, d
imul24 e, f, c
iadd g, a2, c
... 
... // replace a to a2
...  instructions using b, c, d, e
ldi a, 1000
iadd a2, a, e
...
... // replace a to a2
...  instructions using b, c, d, e, f

Fuse all possible ldi

ldi a, 1000
iadd a2, a, b
iadd c, a2, d
imul24 e, f, c
iadd g, a2, c
... 
... 
// ldi a, 1000
iadd a2, a, e
...
...  instructions using b, c, d, e

Life-range analysis of variables with labels of ldi (hint for next step to forget variables) to count register usage.

ldi a, 1000                 // b, d, f
iadd a2, a, b               // a(ldi), b, d, f
iadd c, a2, d               // a(ldi), a2, b, d, f
imul24 e, f, c              // a(ldi), b, c, d, f
iadd g, a2, c               // a(ldi), a2, b, c, d, e       !!! exceed the number of register !!!
... 
... 
iadd a2, a, e               // a(ldi), b, c, d, e
...
...  instructions using b, c, d, e

If register usage exceed the maximum numbers of registers, insert ldi to make life-range of variable shorter.

ldi a, 1000                  // b, d, f
iadd a2, a, b               // b, d, f
iadd c, a2, d               // a2, b, d, f
imul24 e, f, c              // b, c, d, f
iadd g, a2, c               // a2, b, c, d, e       !!! It doesn't exceed the number of register !!!
... 
... 
ldi a, 1000
iadd a2, a, e               // a(ldi), b, c, d, e
...
...  instructions using b, c, d, e

doe300 · 2018-04-11T19:32:22Z

This is a great idea.
One note though: Since we have 2 register-banks, we should set the limit for this pass not to the actual number of registers, but to some smaller value to prevent more register-association failures.

nomaddo added the enhancement label Mar 28, 2018

nomaddo self-assigned this Apr 3, 2018

nomaddo mentioned this issue Apr 6, 2018

Add optimization option --fcombine-load-threshold=XXX #77

Merged

doe300 added the optimization related to an optimization step label Apr 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance `combineLoadingLiterals` #67

Enhance `combineLoadingLiterals` #67

nomaddo commented Mar 28, 2018 •

edited

Loading

nomaddo commented Apr 3, 2018

nomaddo commented Apr 3, 2018

doe300 commented Apr 3, 2018

nomaddo commented Apr 3, 2018 •

edited

Loading

nomaddo commented Apr 3, 2018 •

edited

Loading

doe300 commented Apr 3, 2018

nomaddo commented Apr 3, 2018 •

edited

Loading

doe300 commented Apr 3, 2018

nomaddo commented Apr 3, 2018

doe300 commented Apr 7, 2018

nomaddo commented Apr 9, 2018

nomaddo commented Apr 11, 2018 •

edited

Loading

doe300 commented Apr 11, 2018

Enhance combineLoadingLiterals #67

Enhance combineLoadingLiterals #67

Comments

nomaddo commented Mar 28, 2018 • edited Loading

nomaddo commented Apr 3, 2018

nomaddo commented Apr 3, 2018

doe300 commented Apr 3, 2018

nomaddo commented Apr 3, 2018 • edited Loading

nomaddo commented Apr 3, 2018 • edited Loading

doe300 commented Apr 3, 2018

nomaddo commented Apr 3, 2018 • edited Loading

doe300 commented Apr 3, 2018

nomaddo commented Apr 3, 2018

doe300 commented Apr 7, 2018

nomaddo commented Apr 9, 2018

nomaddo commented Apr 11, 2018 • edited Loading

Example

doe300 commented Apr 11, 2018

Enhance `combineLoadingLiterals` #67

Enhance `combineLoadingLiterals` #67

nomaddo commented Mar 28, 2018 •

edited

Loading

nomaddo commented Apr 3, 2018 •

edited

Loading

nomaddo commented Apr 3, 2018 •

edited

Loading

nomaddo commented Apr 3, 2018 •

edited

Loading

nomaddo commented Apr 11, 2018 •

edited

Loading