Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine DMA loads #146

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

long-long-float
Copy link
Contributor

@long-long-float long-long-float commented Jul 26, 2020

Implemented a combiner of DNA loads (see #144)

TODO

  • Create new file of ValueExpr and related functions.
  • Support other vloadn functions than vload16.
  • Write unit tests.
  • Support a case that multiple types of loading are in one block.
  • Support variable offsets at first argument of vloadn.
    • The offset is limited to 4096 bytes (MPITCHB). However there are no method the compiler to know the real offset value.
  • Use Expression instead of ValueExpr.
  • Write the documentation

Example

Three loads (vload16) are combined to one load by combineDMALoads.

#define TYPE float
#define VTYPE float16

__kernel void dma_loads(int width, int height, __global TYPE *in, __global TYPE *out)
{
    for (int y = 1; y < height - 1; y++) {
        // These are not combined because a variable is used at offset (Future work).
        /* VTYPE up   = vload16(y - width, in);
        VTYPE center = vload16(y, in);
        VTYPE down = vload16(y + width + 1, in); */

        // These are combined.
        VTYPE up     = vload16(y - 1, in);
        VTYPE center = vload16(y,     in);
        VTYPE down   = vload16(y + 1, in);

        VTYPE r = (
            up                                               / (VTYPE)(3) +
            center                                           / (VTYPE)(3) +
            down                                             / (VTYPE)(3));

        vstore16(r, y, out);
    }
}
; snip
or -, mutex_acq, mutex_acq
ldi vpr_setup, vdr_setup(rows: 3, columns: 16 words, address: h32(0,0))
ldi vpr_setup, vdr_setup(memory pitch: 64 bytes)
add vpr_addr, r0, ra4
or r0, ra0, ra0
add r2, r0, 1 (1)
or -, vpr_wait, vpr_wait
or mutex_rel, 1 (1), 1 (1)
or -, mutex_acq, mutex_acq
ldi vpr_setup, vpm_setup(num: 3, size: 16 words, stride: 1 rows, address: h32(0))
or r1, vpm, vpm
or mutex_rel, 1 (1), 1 (1)
or -, mutex_acq, mutex_acq
or r0, vpm, vpm
or mutex_rel, 1 (1), 1 (1)
or -, mutex_acq, mutex_acq
or ra2, vpm, vpm
or mutex_rel, 1 (1), 1 (1)

(A code without combineDMALoads)

; snip
or -, mutex_acq, mutex_acq
ldi vpr_setup, vdr_setup(rows: 1, columns: 16 words, address: h32(0,0))
ldi vpr_setup, vdr_setup(memory pitch: 0 bytes)
add vpr_addr, ra3, r0
or r0, ra1, ra1
add ra0, r0, 1 (1)
shl r0, r0, 6 (6)
or -, vpr_wait, vpr_wait
ldi vpr_setup, vpm_setup(num: 1, size: 16 words, stride: 1 rows, address: h32(0))
or r1, vpm, vpm
or mutex_rel, 1 (1), 1 (1)
or -, mutex_acq, mutex_acq
ldi vpr_setup, vdr_setup(rows: 1, columns: 16 words, address: h32(0,0))
ldi vpr_setup, vdr_setup(memory pitch: 0 bytes)
add vpr_addr, ra3, r0
or -, vpr_wait, vpr_wait
ldi vpr_setup, vpm_setup(num: 1, size: 16 words, stride: 1 rows, address: h32(0))
or r3, vpm, vpm
or mutex_rel, 1 (1), 1 (1)
or r0, ra0, ra0
shl r0, r0, 6 (6)
or -, mutex_acq, mutex_acq
ldi vpr_setup, vdr_setup(rows: 1, columns: 16 words, address: h32(0,0))
ldi vpr_setup, vdr_setup(memory pitch: 0 bytes)
add vpr_addr, ra3, r0
ldi rep_all|r5, 1051372203
or -, vpr_wait, vpr_wait
ldi vpr_setup, vpm_setup(num: 1, size: 16 words, stride: 1 rows, address: h32(0))
or r2, vpm, vpm
or mutex_rel, 1 (1), 1 (1)

Copy link
Owner

@doe300 doe300 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some high-level comments for now

src/periphery/VPM.h Show resolved Hide resolved
{
namespace optimizations
{
class ValueExpr
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason you did not use Expression here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know this class. I'll try to use it.

auto kernels = module.getKernels();
for(Method* kernelFunc : kernels)
{
optimizations::combineDMALoads(module, *kernelFunc, config);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this code have to be run before the normalization steps (e.g. before the memory accesses are rewritten)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is easy to combine vloadn, but a function inlineMethods replaces vloadn. So it should be run before inlineMethods. (I don't think it's not appropriate that an optimization process is in normalization steps.)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add this to a comment

src/optimization/Combiner.cpp Outdated Show resolved Hide resolved
@doe300
Copy link
Owner

doe300 commented Aug 23, 2020

To fix the build error, you will need to rebase on the latest master, I kind of screwed up there...

@long-long-float
Copy link
Contributor Author

@doe300 I have a question. I want to create a variable typed i8*. But when I execute following program, I got an error General: Internal error: Cannot create complex type without a complex type!.
TYPE_INT8.getPointerType() seems to be nullptr. How can I do it?

auto in = assign(inIt, DataType(TYPE_INT8.getPointerType()), "%in") = UNIFORM_REGISTER;

@doe300
Copy link
Owner

doe300 commented Aug 30, 2020

Well, getPointerType() extracts the pointer-type specific information of the current type, which a TYPE_INT8 does not have.
To create a pointer type from a type, you will need to call createPointerType(TYPE_INT_8) on your current Module or Method object, so:

auto bytePointer = method.createPointerType(TYPE_INT_8);

This is necessary, since "complex types" require additional data, which is stored not in the type object itself, but in the module (think like a symbol table for these data types) while all information for "simple types" is stored in the type object itself.

@long-long-float
Copy link
Contributor Author

long-long-float commented Nov 1, 2020

Evaluation of peformance

  • 3x3 smoothing filter
  • Input: 800x533 24bit RGBA image
Without combineDMALoads: GPU: 0.4144 sec
With    combineDMALoads: GPU: 0.3454 sec

Speed-up of about 16%.

#define I(x, y) ((y) * width / 4 + (x))
__kernel void stencil(int width, int height, __global uchar *in, __global uchar *out)
{
    for (int x = 0; x < width / 4; x++) {
        for (int y = 1; y < height - 1; y++) {
            uchar4 left4  = (x > 0) ? vload4(I(x, y) * 4 - 1, in) : (uchar4)(0);
            uchar4 right4 = (x < (width / 4 - 1)) ? vload4(I(x, y) * 4 + 4, in) : (uchar4)(0);

            size_t idx = I(x, y);
            size_t yy = 800 / 4;
            uchar16 up     = vload16(idx - yy, in);
            uchar16 center = vload16(idx,      in);
            uchar16 down   = vload16(idx + yy, in);
            uchar16 left   = (uchar16)(left4, center.s01234567, center.s89AB);
            uchar16 right  = (uchar16)(center.s456789AB, center.sCDEF, right4);

            uchar16 r = (
                up     / (uchar16)(5) +
                left   / (uchar16)(5) +
                center / (uchar16)(5) +
                right  / (uchar16)(5) +
                down   / (uchar16)(5));

            vstore16(r, I(x, y), out);

        }
    }
}

@long-long-float long-long-float changed the title WIP: Combine DMA loads Combine DMA loads Nov 1, 2020
@long-long-float
Copy link
Contributor Author

@doe300 The work is finished. Please review changes.

@@ -109,6 +109,9 @@ namespace vc4c
// A fake operation to indicate an unsigned multiplication
static constexpr OpCode FAKEOP_UMUL{"umul", 132, 132, 2, false, false, FlagBehavior::NONE};

static constexpr OpCode FAKEOP_MUL{"mul", 132, 132, 2, false, false, FlagBehavior::NONE};
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these signed or unsigned? Can you please name them correctly and add a comment?

auto kernels = module.getKernels();
for(Method* kernelFunc : kernels)
{
optimizations::combineDMALoads(module, *kernelFunc, config);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add this to a comment

}
}

SubExpression iiToExpr(const Value& value, const LocalUser* inst)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an ii? Please make the function name more expressive

}
};

void expandExpression(const SubExpression& subExpr, ExpandedExprs& expanded)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you don't return the result, but instead have an output-parameter?

}
else
{
expanded.push_back(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use emplace_back instead of push_back, same below

{
// TODO: gather these instructions in one mutex lock
it = method.vpm->insertLockMutex(it, true);
assign(it, output) = VPM_IO_REGISTER;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the "scratch" area, then this does not work, since the "scratch" area might be overridden by another QPU. For this to work, we would need to use an own VPM area per QPU.

@@ -1169,15 +1170,17 @@ VPWDMASetup VPMArea::toWriteDMASetup(DataType elementType, uint8_t numRows) cons
return setup;
}

VPRGenericSetup VPMArea::toReadSetup(DataType elementType, uint8_t numRows) const
VPRGenericSetup VPMArea::toReadSetup(DataType elementType/*, uint8_t numRows*/) const
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this?

@@ -1186,7 +1189,10 @@ VPRGenericSetup VPMArea::toReadSetup(DataType elementType, uint8_t numRows) cons
// if we can pack into a single row, do so. Otherwise set stride to beginning of next row
const uint8_t stride =
canBePackedIntoRow() ? 1 : static_cast<uint8_t>(TYPE_INT32.getScalarBitCount() / type.getScalarBitCount());
VPRGenericSetup setup(getVPMSize(type), stride, numRows, calculateQPUSideAddress(type, rowOffset, 0));

if (numRows_ >= 16) numRows_ = 1;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an error which we can't simply rewrite, but need to fail on. Or am I mistaken?

periphery::VPRDMASetup expectedDMASetup(dmaSetupMode, vectorType.getVectorWidth() % 16, numOfLoads, vpitch, 0);
periphery::VPRGenericSetup expectedVPRSetup(vprSize, vprStride, numOfLoads, 0);

inputMethod.dumpInstructions();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably be removed


testCombineDMALoadsSub(module, inputMethod, config, Float16);
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add some negative tests, e.g. vloadn instructions accessing different sources or with different offsets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants