-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle asynchronous computation #245
Comments
The QMCPACK CUDA code is dead. My CUDA code does not depend on a single stream for synchronization. It does depend on being able to construct transfers (most important) and evaluations (not very important) from more than one trialWF worth of data. Synchronization pattern is going to depend on architecture and whether you have bothered to have a device specialization for a particular operation and which other operations have been specialized. SPOs, Determinants, and Jastrows should not have to know the implementations of each other nor should the logic for synchronizing between them be spread through the object hierarchy. Looking to the future I prefer this sort of thing. Reading the input(s) should produce a parameters object for each QMC run. //in the driver anything with the semantics of std::async could be used.
driver<Async>::do_ratioGrad
{
//crowd is in scope
det = crowd_wf[0].wfc[i];
spo = crowd_wfc_spo_map[det.id];
//crowd_calc_location(spo) returns a device tag for that spo
std::future<int> fut_spo_eval = likeSTDAsync(crowd_calc_location(spo), launch_type, multi_func<SPOType,DEVICE>.evaluate(spo, positions, iels, crowd_els, ions, crowd_v, crowd_g, crowd_h);
fut_spo_eval.get();
std::future<std::vector<ValueType>> fut_ratio_grad = likeSTDAsync(crowd_calc_location(dets), launch_type, multi_func<DetType,DEVICE>.ratioGrad(dets, crowd_v, crowd_g, iat);
}
driver<Sync>::do_ratioGrad
{
//crowd is in scope
multi_func<SPOType, DEVICE>.evaluate(spo, positions, iels, crowd_els, ions, crowd_v, crowd_g, crowd_h);
multi_func<DetType, DEVICE>.ratioGrad(crowd_v, crowd_g, iat);
}
template<class SPOTType, Device DT = CPU>
class multiFunc {
//default implementation
evaluate(SPOType spo,auto positions, auto iels, auto crowd_els, auto ions, auto& crowd_v, auto& crowd_g, auto& crowd_h) {
//or parallel block construct of your choice
for(i = 0; i < crowd_v.size(); ++i)
spo.evaluate(positions[i],crowd_v[i],crowd_g[i], crowd_h[i];
}
}; |
you actually don't need the driver specialization if you make likeSTDAsync default to a blocking synchronous evaluation. |
If we pursue either of these, we need to be very careful of the trade-offs associated with moving to this model of programming. New programmers coming to the code are likely to have to learn to reason about these constructs for the first time. Also, we will have to be incredibly sure that our unit tests / integration tests are robust to catching the sort of race conditions that may not occur for every order of the evaluations. Are we totally convinced that the speed-up gained from this programming model is worth the other costs? |
@lshulen Good point. Totally agree. MiniQMC may be a good place to play, but for QMCPACK step 1 should involve the absolute minimum complexity and therefore minimum asynchronicity, possibly none. Only when that is working and we see a clear and significant benefit to a more complex and capable implementation should we move forward. |
My rough ideas for incorporating asynchronous computation but hide the detail at the lowest possible level. Since we have limited confidence in applying a tasking programming model to the whole code, the follow code may achieve hopefully sufficient asynchronous behaviour and performance.
When we compute the trial wavefunction, the call sequence is
TrialWF->ratioGrad()
{
TrialWF->WFC[0]->ratioGrad(iel) //determinant
{
SPO->evaluate(iel);
getInvRow(psi_inv);
dot(spo_v, psi_inv);
}
TrialWF->WFC[1]->ratioGrad(iel); //Jastrow
}
Instead, we separate ratioGrad into two parts. The async launching part and the wait
TrialWF->ratioGrad()
{
TrialWF->WFC[0]->ratioGradLaunchAsync(iel) //determinant
{
SPO->evaluateLaunchAsync(iel);
getInvRowLaunchAsync(psi_inv);
}
TrialWF->WFC[1]->ratioGradLaunchAsync(iel); //Jastrow
/// finish launching async calls of all the WFCs
TrialWF->WFC[0]->ratioGrad(iel) //determinant
{
SPO->evaluate(iel); // wait completion inside
getInvRow(psi_inv); // wait completion inside
dot(spo_v, psiM[iel]);
}
TrialWF->WFC[1]->ratioGrad(iel); //Jastrow
}
This is similar to what we have in the CUDA code but I'm expanding it to allows working through levels if necessary. CUDA or OpenMP offload can be hidden beneath. In the case of CUDA, delayed update engine and SPO can use different streams to maximize asynchronous concurrent execution. The QMCPACK CUDA code relies on a single stream to enforce synchronization. SPO can also be OpenMP offload and the asynchronous control is self contained. If necessary, the TrialWF->ratioGrad can also split into ratioGradLaunchAsync and ratioGrad which can be called by the driver.
Any piece not needing async remains unchanged.
Pros: we explicitly control dependency.
Cons: we explicitly control wait instead of the runtime.
The text was updated successfully, but these errors were encountered: