Combine DMA loads #146

long-long-float · 2020-07-26T14:34:46Z

Implemented a combiner of DNA loads (see #144)

TODO

Create new file of ValueExpr and related functions.
Support other vloadn functions than vload16.
Write unit tests.
Support a case that multiple types of loading are in one block.
~~Support variable offsets at first argument of vloadn.~~
- The offset is limited to 4096 bytes (MPITCHB). However there are no method the compiler to know the real offset value.
Use Expression instead of ValueExpr.
Write the documentation

Example

Three loads (vload16) are combined to one load by combineDMALoads.

#define TYPE float
#define VTYPE float16

__kernel void dma_loads(int width, int height, __global TYPE *in, __global TYPE *out)
{
    for (int y = 1; y < height - 1; y++) {
        // These are not combined because a variable is used at offset (Future work).
        /* VTYPE up   = vload16(y - width, in);
        VTYPE center = vload16(y, in);
        VTYPE down = vload16(y + width + 1, in); */

        // These are combined.
        VTYPE up     = vload16(y - 1, in);
        VTYPE center = vload16(y,     in);
        VTYPE down   = vload16(y + 1, in);

        VTYPE r = (
            up                                               / (VTYPE)(3) +
            center                                           / (VTYPE)(3) +
            down                                             / (VTYPE)(3));

        vstore16(r, y, out);
    }
}

; snip
or -, mutex_acq, mutex_acq
ldi vpr_setup, vdr_setup(rows: 3, columns: 16 words, address: h32(0,0))
ldi vpr_setup, vdr_setup(memory pitch: 64 bytes)
add vpr_addr, r0, ra4
or r0, ra0, ra0
add r2, r0, 1 (1)
or -, vpr_wait, vpr_wait
or mutex_rel, 1 (1), 1 (1)
or -, mutex_acq, mutex_acq
ldi vpr_setup, vpm_setup(num: 3, size: 16 words, stride: 1 rows, address: h32(0))
or r1, vpm, vpm
or mutex_rel, 1 (1), 1 (1)
or -, mutex_acq, mutex_acq
or r0, vpm, vpm
or mutex_rel, 1 (1), 1 (1)
or -, mutex_acq, mutex_acq
or ra2, vpm, vpm
or mutex_rel, 1 (1), 1 (1)

(A code without combineDMALoads)

; snip
or -, mutex_acq, mutex_acq
ldi vpr_setup, vdr_setup(rows: 1, columns: 16 words, address: h32(0,0))
ldi vpr_setup, vdr_setup(memory pitch: 0 bytes)
add vpr_addr, ra3, r0
or r0, ra1, ra1
add ra0, r0, 1 (1)
shl r0, r0, 6 (6)
or -, vpr_wait, vpr_wait
ldi vpr_setup, vpm_setup(num: 1, size: 16 words, stride: 1 rows, address: h32(0))
or r1, vpm, vpm
or mutex_rel, 1 (1), 1 (1)
or -, mutex_acq, mutex_acq
ldi vpr_setup, vdr_setup(rows: 1, columns: 16 words, address: h32(0,0))
ldi vpr_setup, vdr_setup(memory pitch: 0 bytes)
add vpr_addr, ra3, r0
or -, vpr_wait, vpr_wait
ldi vpr_setup, vpm_setup(num: 1, size: 16 words, stride: 1 rows, address: h32(0))
or r3, vpm, vpm
or mutex_rel, 1 (1), 1 (1)
or r0, ra0, ra0
shl r0, r0, 6 (6)
or -, mutex_acq, mutex_acq
ldi vpr_setup, vdr_setup(rows: 1, columns: 16 words, address: h32(0,0))
ldi vpr_setup, vdr_setup(memory pitch: 0 bytes)
add vpr_addr, ra3, r0
ldi rep_all|r5, 1051372203
or -, vpr_wait, vpr_wait
ldi vpr_setup, vpm_setup(num: 1, size: 16 words, stride: 1 rows, address: h32(0))
or r2, vpm, vpm
or mutex_rel, 1 (1), 1 (1)

doe300

Just some high-level comments for now

src/periphery/VPM.h

doe300 · 2020-08-19T05:42:42Z

src/optimization/ValueExpr.h

+{
+    namespace optimizations
+    {
+        class ValueExpr


Is there a specific reason you did not use Expression here?

I didn't know this class. I'll try to use it.

doe300 · 2020-08-19T05:43:22Z

src/normalization/Normalizer.cpp

+        auto kernels = module.getKernels();
+        for(Method* kernelFunc : kernels)
+        {
+            optimizations::combineDMALoads(module, *kernelFunc, config);


Does this code have to be run before the normalization steps (e.g. before the memory accesses are rewritten)?

Yes. It is easy to combine vloadn, but a function inlineMethods replaces vloadn. So it should be run before inlineMethods. (I don't think it's not appropriate that an optimization process is in normalization steps.)

Can you please add this to a comment

src/optimization/Combiner.cpp

doe300 · 2020-08-23T07:52:51Z

To fix the build error, you will need to rebase on the latest master, I kind of screwed up there...

long-long-float · 2020-08-29T11:03:01Z

@doe300 I have a question. I want to create a variable typed i8*. But when I execute following program, I got an error General: Internal error: Cannot create complex type without a complex type!.
TYPE_INT8.getPointerType() seems to be nullptr. How can I do it?

auto in = assign(inIt, DataType(TYPE_INT8.getPointerType()), "%in") = UNIFORM_REGISTER;

doe300 · 2020-08-30T07:41:30Z

Well, getPointerType() extracts the pointer-type specific information of the current type, which a TYPE_INT8 does not have.
To create a pointer type from a type, you will need to call createPointerType(TYPE_INT_8) on your current Module or Method object, so:

auto bytePointer = method.createPointerType(TYPE_INT_8);

This is necessary, since "complex types" require additional data, which is stored not in the type object itself, but in the module (think like a symbol table for these data types) while all information for "simple types" is stored in the type object itself.

long-long-float · 2020-11-01T11:18:37Z

Evaluation of peformance

3x3 smoothing filter
Input: 800x533 24bit RGBA image

Without combineDMALoads: GPU: 0.4144 sec
With    combineDMALoads: GPU: 0.3454 sec

Speed-up of about 16%.

#define I(x, y) ((y) * width / 4 + (x))
__kernel void stencil(int width, int height, __global uchar *in, __global uchar *out)
{
    for (int x = 0; x < width / 4; x++) {
        for (int y = 1; y < height - 1; y++) {
            uchar4 left4  = (x > 0) ? vload4(I(x, y) * 4 - 1, in) : (uchar4)(0);
            uchar4 right4 = (x < (width / 4 - 1)) ? vload4(I(x, y) * 4 + 4, in) : (uchar4)(0);

            size_t idx = I(x, y);
            size_t yy = 800 / 4;
            uchar16 up     = vload16(idx - yy, in);
            uchar16 center = vload16(idx,      in);
            uchar16 down   = vload16(idx + yy, in);
            uchar16 left   = (uchar16)(left4, center.s01234567, center.s89AB);
            uchar16 right  = (uchar16)(center.s456789AB, center.sCDEF, right4);

            uchar16 r = (
                up     / (uchar16)(5) +
                left   / (uchar16)(5) +
                center / (uchar16)(5) +
                right  / (uchar16)(5) +
                down   / (uchar16)(5));

            vstore16(r, I(x, y), out);

        }
    }
}

long-long-float · 2020-11-01T11:21:37Z

@doe300 The work is finished. Please review changes.

doe300 · 2020-11-02T05:52:41Z

src/Expression.h

@@ -109,6 +109,9 @@ namespace vc4c
        // A fake operation to indicate an unsigned multiplication
        static constexpr OpCode FAKEOP_UMUL{"umul", 132, 132, 2, false, false, FlagBehavior::NONE};

+        static constexpr OpCode FAKEOP_MUL{"mul", 132, 132, 2, false, false, FlagBehavior::NONE};


Are these signed or unsigned? Can you please name them correctly and add a comment?

doe300 · 2020-11-02T05:53:57Z

src/normalization/Normalizer.cpp

+        auto kernels = module.getKernels();
+        for(Method* kernelFunc : kernels)
+        {
+            optimizations::combineDMALoads(module, *kernelFunc, config);


Can you please add this to a comment

doe300 · 2020-11-02T05:55:14Z

src/optimization/Combiner.cpp

+    }
+}
+
+SubExpression iiToExpr(const Value& value, const LocalUser* inst)


What is an ii? Please make the function name more expressive

doe300 · 2020-11-02T05:57:17Z

src/optimization/Combiner.cpp

+    }
+};
+
+void expandExpression(const SubExpression& subExpr, ExpandedExprs& expanded)


Is there a reason why you don't return the result, but instead have an output-parameter?

doe300 · 2020-11-02T05:59:41Z

src/optimization/Combiner.cpp

+                }
+                else
+                {
+                    expanded.push_back(


Use emplace_back instead of push_back, same below

doe300 · 2020-11-02T06:14:21Z

src/optimization/Combiner.cpp

+                                    {
+                                        // TODO: gather these instructions in one mutex lock
+                                        it = method.vpm->insertLockMutex(it, true);
+                                        assign(it, output) = VPM_IO_REGISTER;


If we use the "scratch" area, then this does not work, since the "scratch" area might be overridden by another QPU. For this to work, we would need to use an own VPM area per QPU.

doe300 · 2020-11-02T06:14:59Z

src/periphery/VPM.cpp

@@ -1169,15 +1170,17 @@ VPWDMASetup VPMArea::toWriteDMASetup(DataType elementType, uint8_t numRows) cons
    return setup;
 }

-VPRGenericSetup VPMArea::toReadSetup(DataType elementType, uint8_t numRows) const
+VPRGenericSetup VPMArea::toReadSetup(DataType elementType/*, uint8_t numRows*/) const


Why did you remove this?

doe300 · 2020-11-02T06:15:42Z

src/periphery/VPM.cpp

@@ -1186,7 +1189,10 @@ VPRGenericSetup VPMArea::toReadSetup(DataType elementType, uint8_t numRows) cons
    // if we can pack into a single row, do so. Otherwise set stride to beginning of next row
    const uint8_t stride =
        canBePackedIntoRow() ? 1 : static_cast<uint8_t>(TYPE_INT32.getScalarBitCount() / type.getScalarBitCount());
-    VPRGenericSetup setup(getVPMSize(type), stride, numRows, calculateQPUSideAddress(type, rowOffset, 0));
+
+    if (numRows_ >= 16) numRows_ = 1;


I think this is an error which we can't simply rewrite, but need to fail on. Or am I mistaken?

doe300 · 2020-11-02T06:17:08Z

test/TestOptimizationSteps.cpp

+        periphery::VPRDMASetup expectedDMASetup(dmaSetupMode, vectorType.getVectorWidth() % 16, numOfLoads, vpitch, 0);
+        periphery::VPRGenericSetup expectedVPRSetup(vprSize, vprStride, numOfLoads, 0);
+
+        inputMethod.dumpInstructions();


These should probably be removed

doe300 · 2020-11-02T06:18:21Z

test/TestOptimizationSteps.cpp

+
+        testCombineDMALoadsSub(module, inputMethod, config, Float16);
+    }
+


Please also add some negative tests, e.g. vloadn instructions accessing different sources or with different offsets

long-long-float force-pushed the combine-dma-loads branch from bb55411 to 1d4d720 Compare August 15, 2020 11:11

doe300 reviewed Aug 19, 2020

View reviewed changes

long-long-float force-pushed the combine-dma-loads branch from d21bd2b to c1a1378 Compare August 23, 2020 08:05

long-long-float force-pushed the combine-dma-loads branch from eefdda8 to 9704c72 Compare September 13, 2020 09:38

long-long-float force-pushed the combine-dma-loads branch from 56521ff to c2e2e31 Compare September 30, 2020 07:37

long-long-float added 22 commits October 4, 2020 14:26

Add first implementation

ad654ba

Implement checking the equal difference

b83a38a

Format

77150c6

Add implementation for other cases

76952cf

Fix to process at each blocks

a5fd0d3

Fix a little

57f4bff

Use memory pitch from the difference

a5a1397

Fix a little

408477a

Finished implementation

0a46290

Fixed a little

7a31e86

Care types other than uchar

70d1d85

Fix

fd9fed0

Create new files of ValueExpr

c39687c

Move combineDMALoads to Combiner

2d18ead

Add a test of CombineDMALoads

c7358c3

Move makeValueBinaryOpFromLocal to ValueExpr.*

a055fbe

Fix a test to check multiple types of vloadn

6ca4768

Support vloadn other than vload16

4bb82e9

Remove deep nests

86cf23b

Add test cases and fix

efebd92

Fix test codes

59cd141

Fix offset of vloadn

28de130

long-long-float added 6 commits October 4, 2020 14:26

Fix a test to check the generic setup

3b3bfb7

Remove unnecessary method

e96cd41

Revert SPIRVHelper.h and .cpp

99961e7

Fix to use Expression instead of ValueExpr and remove it

90c50ca

Fix calcValueExpr to take ExpandedExprs

6237397

Write the documentation of combineDMALoads

998e468

long-long-float force-pushed the combine-dma-loads branch from 583cd2a to 998e468 Compare October 11, 2020 08:21

long-long-float added 2 commits October 22, 2020 18:08

Implement replaceLocalToExpr

6e5fafa

Treat or as add

414e448

long-long-float changed the title ~~WIP: Combine DMA loads~~ Combine DMA loads Nov 1, 2020

doe300 requested changes Nov 2, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine DMA loads #146

Combine DMA loads #146

long-long-float commented Jul 26, 2020 •

edited

Loading

doe300 left a comment

doe300 Aug 19, 2020

long-long-float Aug 23, 2020

doe300 Aug 19, 2020

long-long-float Aug 23, 2020

doe300 Nov 2, 2020

doe300 commented Aug 23, 2020

long-long-float commented Aug 29, 2020

doe300 commented Aug 30, 2020

long-long-float commented Nov 1, 2020 •

edited

Loading

long-long-float commented Nov 1, 2020

doe300 Nov 2, 2020

doe300 Nov 2, 2020

doe300 Nov 2, 2020

doe300 Nov 2, 2020

doe300 Nov 2, 2020

doe300 Nov 2, 2020

doe300 Nov 2, 2020

doe300 Nov 2, 2020

doe300 Nov 2, 2020

doe300 Nov 2, 2020


		testCombineDMALoadsSub(module, inputMethod, config, Float16);
		}

Combine DMA loads #146

Are you sure you want to change the base?

Combine DMA loads #146

Conversation

long-long-float commented Jul 26, 2020 • edited Loading

TODO

Example

doe300 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

doe300 commented Aug 23, 2020

long-long-float commented Aug 29, 2020

doe300 commented Aug 30, 2020

long-long-float commented Nov 1, 2020 • edited Loading

long-long-float commented Nov 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

long-long-float commented Jul 26, 2020 •

edited

Loading

long-long-float commented Nov 1, 2020 •

edited

Loading