-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine DMA loads #146
base: master
Are you sure you want to change the base?
Combine DMA loads #146
Conversation
bb55411
to
1d4d720
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some high-level comments for now
src/optimization/ValueExpr.h
Outdated
{ | ||
namespace optimizations | ||
{ | ||
class ValueExpr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason you did not use Expression
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know this class. I'll try to use it.
auto kernels = module.getKernels(); | ||
for(Method* kernelFunc : kernels) | ||
{ | ||
optimizations::combineDMALoads(module, *kernelFunc, config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this code have to be run before the normalization steps (e.g. before the memory accesses are rewritten)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It is easy to combine vloadn
, but a function inlineMethods
replaces vloadn
. So it should be run before inlineMethods
. (I don't think it's not appropriate that an optimization process is in normalization steps.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add this to a comment
To fix the build error, you will need to rebase on the latest master, I kind of screwed up there... |
d21bd2b
to
c1a1378
Compare
@doe300 I have a question. I want to create a variable typed auto in = assign(inIt, DataType(TYPE_INT8.getPointerType()), "%in") = UNIFORM_REGISTER; |
Well,
This is necessary, since "complex types" require additional data, which is stored not in the type object itself, but in the module (think like a symbol table for these data types) while all information for "simple types" is stored in the type object itself. |
eefdda8
to
9704c72
Compare
56521ff
to
c2e2e31
Compare
583cd2a
to
998e468
Compare
Evaluation of peformance
Speed-up of about 16%. #define I(x, y) ((y) * width / 4 + (x))
__kernel void stencil(int width, int height, __global uchar *in, __global uchar *out)
{
for (int x = 0; x < width / 4; x++) {
for (int y = 1; y < height - 1; y++) {
uchar4 left4 = (x > 0) ? vload4(I(x, y) * 4 - 1, in) : (uchar4)(0);
uchar4 right4 = (x < (width / 4 - 1)) ? vload4(I(x, y) * 4 + 4, in) : (uchar4)(0);
size_t idx = I(x, y);
size_t yy = 800 / 4;
uchar16 up = vload16(idx - yy, in);
uchar16 center = vload16(idx, in);
uchar16 down = vload16(idx + yy, in);
uchar16 left = (uchar16)(left4, center.s01234567, center.s89AB);
uchar16 right = (uchar16)(center.s456789AB, center.sCDEF, right4);
uchar16 r = (
up / (uchar16)(5) +
left / (uchar16)(5) +
center / (uchar16)(5) +
right / (uchar16)(5) +
down / (uchar16)(5));
vstore16(r, I(x, y), out);
}
}
} |
@doe300 The work is finished. Please review changes. |
@@ -109,6 +109,9 @@ namespace vc4c | |||
// A fake operation to indicate an unsigned multiplication | |||
static constexpr OpCode FAKEOP_UMUL{"umul", 132, 132, 2, false, false, FlagBehavior::NONE}; | |||
|
|||
static constexpr OpCode FAKEOP_MUL{"mul", 132, 132, 2, false, false, FlagBehavior::NONE}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these signed or unsigned? Can you please name them correctly and add a comment?
auto kernels = module.getKernels(); | ||
for(Method* kernelFunc : kernels) | ||
{ | ||
optimizations::combineDMALoads(module, *kernelFunc, config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add this to a comment
} | ||
} | ||
|
||
SubExpression iiToExpr(const Value& value, const LocalUser* inst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is an ii? Please make the function name more expressive
} | ||
}; | ||
|
||
void expandExpression(const SubExpression& subExpr, ExpandedExprs& expanded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why you don't return the result, but instead have an output-parameter?
} | ||
else | ||
{ | ||
expanded.push_back( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use emplace_back
instead of push_back
, same below
{ | ||
// TODO: gather these instructions in one mutex lock | ||
it = method.vpm->insertLockMutex(it, true); | ||
assign(it, output) = VPM_IO_REGISTER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use the "scratch" area, then this does not work, since the "scratch" area might be overridden by another QPU. For this to work, we would need to use an own VPM area per QPU.
@@ -1169,15 +1170,17 @@ VPWDMASetup VPMArea::toWriteDMASetup(DataType elementType, uint8_t numRows) cons | |||
return setup; | |||
} | |||
|
|||
VPRGenericSetup VPMArea::toReadSetup(DataType elementType, uint8_t numRows) const | |||
VPRGenericSetup VPMArea::toReadSetup(DataType elementType/*, uint8_t numRows*/) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove this?
@@ -1186,7 +1189,10 @@ VPRGenericSetup VPMArea::toReadSetup(DataType elementType, uint8_t numRows) cons | |||
// if we can pack into a single row, do so. Otherwise set stride to beginning of next row | |||
const uint8_t stride = | |||
canBePackedIntoRow() ? 1 : static_cast<uint8_t>(TYPE_INT32.getScalarBitCount() / type.getScalarBitCount()); | |||
VPRGenericSetup setup(getVPMSize(type), stride, numRows, calculateQPUSideAddress(type, rowOffset, 0)); | |||
|
|||
if (numRows_ >= 16) numRows_ = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an error which we can't simply rewrite, but need to fail on. Or am I mistaken?
periphery::VPRDMASetup expectedDMASetup(dmaSetupMode, vectorType.getVectorWidth() % 16, numOfLoads, vpitch, 0); | ||
periphery::VPRGenericSetup expectedVPRSetup(vprSize, vprStride, numOfLoads, 0); | ||
|
||
inputMethod.dumpInstructions(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should probably be removed
|
||
testCombineDMALoadsSub(module, inputMethod, config, Float16); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add some negative tests, e.g. vloadn instructions accessing different sources or with different offsets
Implemented a combiner of DNA loads (see #144)
TODO
ValueExpr
and related functions.vloadn
functions thanvload16
.Support variable offsets at first argument ofvloadn
.MPITCHB
). However there are no method the compiler to know the real offset value.Expression
instead ofValueExpr
.Example
Three loads (
vload16
) are combined to one load bycombineDMALoads
.(A code without
combineDMALoads
)