-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No binary reproducibility with tuning turned on #182
Comments
Maybe I should say that calls are made to the multi-shift CG and the GCR both with the Clover operator. |
Strange. One thing that might be the issue: When tuning there is one last call. Is that call used for the further calculation or is the Kernel called once more with the result of the tuning? On Dec 5, 2014, at 17:27, Frank Winter <[email protected]mailto:[email protected]> wrote: Qudanauts, I am not seeing binary reproducibility when QUDA tuning is turned on. I may be wrong, but as far as I understand one should get the same result (talking Delta H here) as long as one repeats the trajectory on the same machine partition and makes sure that every one MPI rank gets the same coordinate within the machine grid across runs. (This eliminates the possible difference due to non-associative floats when it comes to adding numbers across nodes.) I tested my assumption and it seems to hold but only if QUDA has finished tuning. To clarify: I run a short trajectory with only 1 (quite) large step and have set Dslash tuning on: 1traj, 1step, QUDA (latest master, 1d31cbbhttps://github.com/lattice/quda/commit/1d31cbb7079c2f533ebd88019303c4e977d16f4f), tune on: rm tunecache 1st run: Delta H = -0.154441864483488 After HMC trajectory call: time= 534.147396 secs 2nd run: Delta H = -0.149310405935012 After HMC trajectory call: time= 386.239995 secs 3rd run: Delta H = -0.149310405935012 After HMC trajectory call: time= 386.19768 secs rm tunecache 4th run: Delta H = -0.153873271329758 After HMC trajectory call: time= 536.969036 secs I am repeating the trajectory here to check Chroma + QDP-JIT/NVVM + QUDA correctness. You can see that once tuning has settled after the 1st run, Delta H stays constant from the 2nd to the 3rd run. Removing the tunecache file and thus having QUDA to tune again has an impact on Delta H: The Delta H value from the 4th run seems uncorrelated to all previous one. This was a 24^3x64 lattice on a 4 x K40m machine (1x1x1x4). I use the MPI rank number to determine the device number for QUDA. Since QMP based the calculation of the node coordinate on the rank number too these runs are completely compareable and should reproduce the same Delta H. — |
This is definitely a bug and probably indicates an oversight in the definition of a preTune() or postTune() somewhere. Frank: If you have the patience (or the script-fu), deleting a single line at a time from tunecache.tsv should let you zero in on the problem function. |
Frank, can you also just post the tunecache file that the run created? |
Hi All, On the other hand in a 4-GPU system with other users, perhaps the tunings Mike mentioned that perhaps doing the reductions with the special algorithm This doesn’t mean there is not a bug, its just that I have seen this elsewhere Best,
Dr Balint Joo High Performance Computational Scientist email: [email protected] |
I did a 5th run (for completeness I am posting the 4th run again) (rm tunecache) We see the same pattern as before in run 1-3. We're not settling to the same Delta H to what we settled before, but I stress that we don't necessarily have to (this depends on how QUDA tunes local reductions, local meaning within 1 MPI process). I compared the tunecache files after each run. They are identical! Thus, once generated after the 4th run, the tuncache file is not altered anymore. Notice however that Delta H does change from run 4 to run 5. I have no explanation for that. https://www.dropbox.com/s/u8tqfsipo2nbabv/tunecache_1.tsv This seems weird to me. It looks as if after the tunecache file was read in the 4th run QUDA decides to re-tune a reduction kernel while in the 5th and 6th run (reading the same tunecache file) decides not to do so and go with the cached values. Doesn't make sense to me. |
Frank, (rm tunecache.tsv) line in your explanation mean that the 4th run started without an existing tunecache.tsv? |
Mathias, the 4th run started with no tunecache file present. Given what I wrote I don't see how this can be mis-understood. You correctly asserted that tuning was done in the 4th run and further concluded that this is the reason why Delta H changed in the 5th run. Please bear with me and share your line of argument because now it's me who doesn't follow. |
I think one has to distinguish between two types of tuning: One that does affect binary reproducibility and one that doesn't. Both, of course, have impact on performance where the latter has impact on performance only. An example for the latter would be tuning a saxpy or Dslash operation; an example for the former would be searching for the optimal hierarchy of recursive reductions for an operation like 'norm2'. In those operations the outcome of tuning is impacted by the non-associativity of floats due to rounding errors. If I look through the entries in the tunecache file and search for entries that might look like reductions operations I find things like 12x24x24x16 N4quda22HeavyQuarkResidualNormI7double37double2S2_EE vol=110592,stride=110592,precision=8 96 1 1 60 If in QUDA tuning of reduction operations is done by searching for the highest performance reduction scheme, by testing different orders in writing to shared memory, varying the number of elements per reduction step, etc. and if performance varies from run to run then tuning will inevitably lead to unpredictable rounding errors in the result. These rounding errors will propagate in the MD and lead to difference in Delta H -- for sure. On the other hand if QUDA determines the hierarchy based on available shared memory only (keeping orders when writing to shmem), e.g. making maximal use of that memory in order to reduce the number of kernel calls, then tuning should have no impact on reproducibility. In that case I don't understand differences in Delta H. |
Sorry, you were completely clear I was only confused by your ‘I have no explanation for that’ and wanted to get overcome my confusion. Anyway, to go on: What might work is to sort the Kernel by type (i.e. copy, blas, reduction, dslash, other) Kernels and remove them groupwise or do a binary search with forcing always some of the Kernels to be retuned (by providing a tunecache with these Kernels removed). That requires some runs, but maybe with the suggested grouping we can track to down to 3 or four runs. |
Regarding the types of Kernels: I completely agree with you. |
Mathias, precisely not! Tuning has finished after the 4th run. This can be drawn from the fact that the tunecache file does not change anymore. And even then: Delta H changes in the 5th run. This is contradicting for me. Again: Tuning was active in the 4th run. (This is obvious as no cache file is present.) One would assume that no further tuning is happening in the 5th run. The fact that the cache file remains unaltered supports this. However, Delta H has changes again in the 5th run! This seems to tell us that there was some tuning in the 5th run. But this tuning result seemed to have being never written out. |
Just checked with MILC (HISQ) and although to a lesser extent than in Frank's example I see similar effects over 3 runs (first run tuned): delta S = 3.353286e-01 delta S = 3.353285e-01 delta S = 3.353285e-01 |
I took the fully settled tunecache (the one after the 4th run) as a basis. (This file is available at https://www.dropbox.com/s/u8tqfsipo2nbabv/tunecache_1.tsv). The tuning information of the individual kernels are located in this file beginning from line 4 to line 47. Last night a script went through the file, removing one of the tuning lines at a time, and running the same trajectory twice, logging the Delta H's. That is, in each run QUDA found a tuncache file identical to the original one except one removed line. What we consider a bug here is when the Delta H from the 2nd run differs from the 1st run. The first number determines the line which was removes, the 2nd and 3rd number the Delta Hs and the 4th number the difference. 4 -0.156215933383464 -0.156215933383464 0 Thus, there are 7 kernels which do not behave themselves when they are re-tuned: 12x24x24x16 N4quda11axpyCGNorm2I7double26float26float4EE vol=110592,stride=110592,precision=4 128 1 1 102 1 1 2048 # 80.62 Gflop/s, 161.23 GB/s, tuned Mon Dec 8 11:36:39 2014 It looks to me like one of the following happens:
|
Thanks! Looks like as soon as any reduction is tuned in the active run it alters the result. |
I've just pushed a minor fix. I noticed that the TuneKey for the caxpbypzYmbwcDotProductUYNormY kernel listed single-precision twice (prec=4), instead of being a combination of both 4 and 8, since this is a double-single precision kernel. I don't think this affects this bug, but I mention it out of completeness since the tune cache will now change slightly with this latest push fixing this. At the same time, I made the backup and restoration cleaner, as it now saves the entire field using the actual allocation size (before it used a hack to work out this size, this was put in as a workaround against Tesla compilation). Anyway, it's probably worth retesting with respect to this bug: 979b748 One other thing, that shouldn't affect reproducibility but should be mentioned. The auto-tuning is switched off by default in the library, and is switched on when the inverter is first called. However, when the device interface is called, e.g., when using QDPJIT, then the prior loadGaugeQuda and loadCloverQuda interface functions also use kernels (that are are not tuned by default). Thus if one does the following:
At the end of invertQuda, if there are any changes to the tune cache, then it will be dumped to disk. Since the gauge and clover copy routines will not do auto tuning until second invocation, the tune cache will be updated after both invertQuda calls. With changing the interface, to rectify this, the quda::setTuning(QUDA_TUNE_YES) should be called prior to loadGaugeQuda(). This will manually switch on the tuning ensuring that all kernels are tuned by the time the first invertQuda() is complete. |
As you already suggested your changes didn't fix the issue. With QUDA (979b): With QUDA (979b) and turning on QUDA tuning before loadGauge/Clover: |
As this issue seems to be tricky: Do we have any idea whether this also appears on single GPU runs? |
I could read through the code but one thing I just thought off: How is tuning handled in a multi GPU setup? I assume only one MPI rank takes care of creating the tunecache ? But do all ranks run the tuning? And if so, do the all use the same tune result? So, the scenario that I think of is Tuning with 2 GPUs: For the tunecache 128 is written to disk. So, in the run without tuning we run with block size 128 on GPU0 and GPU1. Mike, I guess you can answer that without digging through the code? |
I think this is it. Good deduction. Ron, can you comment on this, since it was you wrote this?This email message is for the sole use of the intended recipient(s) and may contain reply email and destroy all copies of the original message. |
I just looked through the tunelaunch function and did not see any communication. |
Having the different processes communicate to ensure the same block is used throughout is definitely something that should be done. However, there is something we have to be careful about here. When doing domain-decomposition, each GPU is solving a system independently of the others. The pathological case here is when one GPU doesn't even do any local solve (e.g., when doing DD on a point source, for the first few iterations, some local domains will have zero support and so never enter the solver loop). So the GPUs can be executing different kernels simultaneously and so we cannot rely on being able to globally synchronize (which is why the global sums are switched off when doing tuning in tune.cpp (line 333). This email message is for the sole use of the intended recipient(s) and may contain reply email and destroy all copies of the original message. |
Very important point. Maybe we need some kind of locking and let only one GPU do the tuning? |
I'm pretty certain we (you) have nailed the issue, there's no question in my mind that this is a weakness that needs to be addressed. In terms of the DD issue, we need a solution that is asynchronous and deterministic (these two things usually don't go hand in hand!). When a kernel is tuned for the first time (globally) we need to ensure that that result is broadcast everywhere once complete, so that when the same kernel is called elsewhere for the time we need to use the same value. Of the top of my head, a clear solution isn't coming to me, short of using one-sided communication. Any ideas? |
Asynchronous really makes it complicated. Nothing I like comes to my mind right now. |
Unless an easy solution presents itself, I think we can be clear that this issue isn't going to be fixed for 0.7.0, as I want to release this very soon. |
I am just testing a hack which will break down in the asynchronous case. Anyhow if that works with MILC and also in Frank's case we understand the issue and since this has been around for a while (should also be in the 0.6 release) we can go on. On Dec 10, 2014, at 16:17, mikeaclark <[email protected]mailto:[email protected]> wrote: Unless an easy solution presents itself, I think we can be clear that this issue isn't going to be fixed for 0.7.0, as I want to release this very soon. — |
Sounds like a plan. As long as it doesn't cause a hang for DD (which is something Balint and I battled for about a month when first introduced the auto tuner before route causing it to this divergence of execution between GPUs). |
I don't want to get my hack into a release version. I assume it will cause DD to hang. |
So, I forced some communication after measuring the execution time in tune.cpp by using comm_time = elapsed_time; comm_allreduce(&comm_time); elapsed_time = float(comm_time/somm_size()); Frank just confirmed that this actually resolves the issues he found. rm tunecache Delta H = -0.147376751251613 After HMC trajectory call: time= 535.843235 secs Delta H = -0.147376751251613 After HMC trajectory call: time= 388.536862 secs But as Mike mentioned we need a communication that works also for DD. I suggest we add a comment to the README and create a new issue to discuss how to implement a non-blocking communication. |
Mathias, can you take care of this? Thanks. |
There's a related problem that we should fix at the same time, described in this comment in tune.cpp:
|
Good one. I read the comment a while ago but forgot about it. I will include it in the follow up issue. On Dec 11, 2014, at 22:02, Ron Babich <[email protected]mailto:[email protected]> wrote: There's a related problem that we should fix at the same time, described in this comment in tune.cpp: //FIXME: We should really check to see if any nodes have tuned a kernel that was not also tuned on node 0, since as things — |
Added comment in README. |
Qudanauts,
I am not seeing binary reproducibility when QUDA tuning is turned on. I may be wrong, but as far as I understand one should get the same result (talking Delta H here) as long as one repeats the trajectory on the same machine partition and makes sure that every one MPI rank gets the same coordinate within the machine grid across runs. (This eliminates the possible difference due to non-associative floats when it comes to adding numbers across nodes.)
I tested my assumption and it seems to hold but only if QUDA has finished tuning. To clarify: I run a short trajectory with only 1 (quite) large step and have set Dslash tuning on:
1traj, 1step, QUDA (latest master, 1d31cbb), tune on:
rm tunecache
1st run: Delta H = -0.154441864483488 After HMC trajectory call: time= 534.147396 secs
2nd run: Delta H = -0.149310405935012 After HMC trajectory call: time= 386.239995 secs
3rd run: Delta H = -0.149310405935012 After HMC trajectory call: time= 386.19768 secs
rm tunecache
4th run: Delta H = -0.153873271329758 After HMC trajectory call: time= 536.969036 secs
I am repeating the trajectory here to check Chroma + QDP-JIT/NVVM + QUDA correctness. You can see that once tuning has settled after the 1st run, Delta H stays constant from the 2nd to the 3rd run. Removing the tunecache file and thus having QUDA to tune again has an impact on Delta H: The Delta H value from the 4th run seems uncorrelated to all previous one.
Thus, it seems I get binary reproducibility only after QUDA doesn't tune anymore (since all kernels are already tuned). I believe this is the only change between e.g. 3rd and 4th run.
This was a 24^3x64 lattice on a 4 x K40m machine (1x1x1x4). I use the MPI rank number to determine the device number for QUDA. Since QMP based the calculation of the node coordinate on the rank number too these runs are completely compareable and should reproduce the same Delta H.
The text was updated successfully, but these errors were encountered: