Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sensible Default values? #12

Closed
kronocharia opened this issue Feb 12, 2015 · 2 comments
Closed

Sensible Default values? #12

kronocharia opened this issue Feb 12, 2015 · 2 comments

Comments

@kronocharia
Copy link

"You should think about what "sensible default" means: the TBB user guide gives some guidance in the form of a "rule of thumb". Your rule of thumb should probably take into account the size of the inner loop: remember that the work is O(n^2), and the work within each original parfor iteration is O(n). But on the other side, you always want enough tasks to keep the processors occupied (no matter how many there are), so you can't set the chunk size to K=n."

The guide suggest timing to ensure they run for at least several thousand clock cycles, presumably that wasn't what we were meant to do?

and later on

"As before, if HPCE_FFT_LOOP_K is not set, choose a sensible default based on your analysis of the scaling with n, and/or experiments. Though remember, it should be a sensible default for all machines (even those with 64 cores)."

I'm not sure how to evaluate the performance in order to tune these 'sensible default's for all kinds of machines

@m8pple
Copy link
Contributor

m8pple commented Feb 12, 2015

Exactly how many instructions a given chunk of code takes is going to vary
from machine to machine and compiler to compiler, so you can't be exact
without timing.

But, you can look at a piece of code and make a decent estimate. So for
example, if I look at:

float *wibble = ...;
for(unsigned i=0; i<n; i+=K){
    for(unsigned j=i; j<i+K; j++){ // assuming K is a factor of n
        wibble[j] = wibble[j]*wibble[j];
    }
}

Then I guess that the inner loop is a compare, branch, load, multiply, and a store.
Given this code is probably vectorisable,
and a super-scalar architecture can issue multiple instructions at once, I might play it safe
and assume one instruction per iteration. So in that case, I might choose K=16384 as a
reasonable point to stop splitting.

If I had something like:

float split(int begin, int n)
{
    if(n==1){
        return leaf(begin);
    }else{
        float a=f(begin, n/2);
        float b=f(begin+n/2, n/2);
        return branch(a,b);
    }
}

then I just need to estimate what value of (end-begin) I think will results
in just over the threshold. If I look at leaf and it takes about 16 instructions
I could make the approximation that for n=1024 I have about the right amount
of work. If I thought that branch took a lot of time, I might go crazy and apply
the master theorem, but it really
isn't needed.

As you vary K from K=1 to n/K=1 you usually find there is quite a large region
where K results in good speedup. Whether you have 1e4 or 1e6 instructions
doesn't matter, what matters is if you have 1e2 instructions, or there are only
two tasks when you have 8 CPUs.

@m8pple m8pple closed this as completed Feb 12, 2015
@kronocharia
Copy link
Author

It seems like the parfor K keeps on getting better the larger it is, this was from 1 to 5096, with larger k with trendlines that are to the right of the graph, this cant be correct?
screen shot 2015-02-12 at 19 40 53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants