-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OMP accelerated loops causing massive slow-down #1063
Comments
Interesting. Note that the QUDA code is also buggy as |
What's the usual size of |
Also, as |
Thanks for pointing out the thread local bug. I had it coded properly, then deleted it all, but wanted to add the quarantined code back in so copied it sloppily :( I did have the following booleans in there at one point: // Continue for the other columns
if(n_kr - (i+1)) > 64) {
#pragma omp parallel for schedule(static,32)
for(int j=i+1; j < n_kr; j++) {
Complex tempp = R[i][j];
R[i][j] -= (T11 * tempp + T12 * R[i+1][j]);
R[i+1][j] -= (T21 * tempp + T22 * R[i+1][j]);
}
} else {
for(int j=i+1; j < n_kr; j++) {
temp = R[i][j];
R[i][j] -= (T11 * temp + T12 * R[i+1][j]);
R[i+1][j] -= (T21 * temp + T22 * R[i+1][j]);
}
} (and similar for the other loops) as a heuristic way of ensuring OpenMP offload only when it might be worth it. It seemed to work well on my laptop, but here in QUDA, any OMP pragmas destroy the loop speed. |
BTW, for a decent calculation size, |
I did wonder it if was data ordering. On my laptop, I use |
A possiblity might be that your Eigen-based code is simply very slow, so it scales well with the number of threads. |
@kostrzewa These plots are gorgeous! I tested on two architectures: Haswell and Power9. I actually resolved the issue on Haswell because there were a proliferation of background jobs going on during the OMP tests. Once they had all calmed down (and I used an all important // Continue for the other columns
#ifdef _OPENMP
#pragma omp parallel for schedule(static,32)
#endif
for(int j=i+1; j < n_kr; j++) {
Complex tempp = R[i][j];
R[i][j] -= (T11 * tempp + T12 * R[i+1][j]);
R[i+1][j] -= (T21 * tempp + T22 * R[i+1][j]);
}
}
// Rotate R and V, i.e. H->RQ. V->VQ
// Loop over columns of upper Hessenberg
for(int j = 0; j < n_kr - 1; j++) {
if(abs(R11[j]) > tol) {
// Loop over the rows, up to the sub diagonal element i=j+1
#ifdef _OPENMP
#pragma omp parallel
{
#pragma omp for schedule(static,32) nowait
#endif
for(int i = 0; i < j+2; i++) {
Complex tempp = R[i][j];
R[i][j] -= (R11[j] * tempp + R12[j] * R[i][j+1]);
R[i][j+1] -= (R21[j] * tempp + R22[j] * R[i][j+1]);
}
#ifdef _OPENMP
#pragma omp for schedule(static,32) nowait
#endif
for(int i = 0; i < n_kr; i++) {
Complex tempp = Q[i][j];
Q[i][j] -= (R11[j] * tempp + R12[j] * Q[i][j+1]);
Q[i][j+1] -= (R21[j] * tempp + R22[j] * Q[i][j+1]);
}
}
}
} I've yet to observe this speed up on Power9. I'm wondering if there is some equivalent |
So, on Summit's Power9, the command to use is:
I tested with
N_CPU=2: ~56 secs
N_CPU=4: ~44 secs
N_CPU=8: ~ 42 secs
N_CPU=16: 35 secs
N_CPU=32: ~75 secs
So using N_CPU=16 gives the best speed-up. In contrast, the baseline Eigen implementation with LAPACKE acceleration, using 1 CPU and 16 CPUS gives:
EIGEN N_CPU=16: ~66secs
So it looks like the best speed up comes from using the host OMP accelerated code, 16 CPUs, and a QR tolerance set to one order of magnitude smaller than the eigensolver tolerance. |
Cheers, these are essentially one-liners in ggplot2 (https://ggplot2.tidyverse.org/). I'm happy that you've found scaling after all! |
In the IRAM eigensolver there is some quarantined code in
qrIteration
function that uses OMP. When activated it wall cause a x100 slowdown on the loop. The routine was prototyped in laptop codehttps://github.com/cpviolator/VOATOL/blob/master/include/algoHelpers.h#L470
and gives good speed-up there. The QUDA code is here:
https://github.com/lattice/quda/blob/feature/arnoldi/lib/eig_iram.cpp#L272
Not sure what the problem is as the
#pragma
s are pretty straight forward.The text was updated successfully, but these errors were encountered: