-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
further twisted mass clover convergence issues #474
Comments
We found similar performance issues in Jureca. The workaround we were using was to spread the gpus across several nodes, i.e., if you ask for 4 gpus, take four nodes, 1 gpu per node. This looks really counterintuitive, because I'd expect intra-node communication to be much faster than inter-node, but for us it works.
Can you try that, to use one gpu per node, and report back? My feeling is that there is something wrong with Jureca.
|
Hmm, well the performance is not really the problem, when the tunecache is enabled, I get around 750 Gflop/s in single precision on one node on a 24^3x48 lattice. Performance goes up a little moving to a 32^3x64 lattice and significantly moving to a 48^3x96 lattice. Using just one gpu per node is quite inefficient though, isn't it? If I understand correctly, accounting is always for the whole node. I can try of course, but my chief concern is the failure to converge. I also see occasional lockups, but I haven't caught one yet with high verbosity output enabled, so I don't know where and how it locks up yet. |
Yes it is. Hence our woes.
By the way, setting QUDA_RESOURCE_PATH to somewhere will save you some |
No, I will have to try. I find it a bit surprising that the inversions behave admirably on our local cluster but so badly on Jureca... Do I just set
Thanks. I just disabled it for testing purposes in these two runs to understand what's going on since we saw in the past that there are some issues when the tuning is enabled (and it seems to retune at weird times). Here as well, when I set |
Even for your success case the reliable updates seem to increase the residual by several orders of magnitude. (Maybe you can plot the residual vs iteration count for both cases and do the same for your runs on K20). |
Here as well, when I set QUDA_TUNE_NO and do not set QUDA_RESOURCE_PATH,
|
@kostrzewa Can you try disabling the peer-2-peer communication? I've seen on one machine (a quad K80 system at Jlab) where peer-to-peer communication gave the wrong answer for reasons I've yet to fully determine (I suspect a BIOS or driver issue). To do so, you need to export /setenv Also, can you tell me the NVIDIA driver that is running on Jureca? This might help us work out why there might appear is a Jureca-specific issue. |
More on this...it's possible that the autotuner is getting something wrong: this can happen for a couple of reasons:
To isolate between these two two cases, you should set I imagine we should be able to track this issue done fairly easily. What @AlexVaq reports above about dynamical clover inversions behaving better than static clover inversions suggests to me that the problem is likely not peer-to-peer related, and more likely a hidden bug in the twisted-clover save/restore or tuning degeneracy. |
Just echoing what @mathiaswagner reported. It would be good to try increasing the Also, can you post the tunecache.tsv, profile.tsv and profile_async.tsv files that are generated when |
Also, a general comment on performance: it looks like you're using double-single solvers here with no reconstruction. You will gain more performance if you switch to double-half as well as using reconstruct for the sloppy operator (12 or 18). Of course let's get a handle on the solver stability first though. |
To begin answering your questions, here's the output of |
These inversions are done with twisted boundary conditions (in all directions) which we enforce by pre-multiplying the gauge fields with the respective phases before passing them to QUDA. As far as I understand, reconstruction cannot be used in this case. Also in our more usual case, where we use twisted boundary conditions in time only to produce anti-periodic quark field boundary conditions in time, it is not clear to me if having "-1" boundary conditions in the valence sector and twisted boundary conditions in the sea might have ill effects. For this reason we presently do all our calculations without reconstruction. |
I had a look at the code and noticed that twisted-clover uses the general framework for tuning for dslash operators, and the preTune and postTune functions are as defined in dslash_quda.cuh.
Although this still could mean that there is something particularly wrong with twisted-clover and the tuning, it would be very strange indeed.
|
This can be something to consider as a future feature. I suppose you can always reconstruct the standard link and add the phase at the end, without a big penalty in performance, but this has not been implemented yet. |
@kostrzewa: There is probably a workaround for that. We also do reconstruction for staggered where we need to take into account the staggered phases and even an arbitrary phase do to the U(3) symmetry of the long links. Anyhow, I am most interested in the convergence behavior in your K20 runs. (see my comment above). Do you have something you can share? |
I just did what you suggested. In the first invocation, tuning was done for the first inversion (up quark, which Mario sets to In the second invocation, using the tuning results of the first invocation, both inversions go through fine and the reliable update does not trigger an increase in the residual. I'm using dynamic clover and it seems that it doesn't really make a difference. I attach the two logfiles: And the tuning results from the first invocation (I'm afraid this also includes results for the L=32 lattice size, I forgot to reset...):
I need to set this up but will probably do so today. The jobs that I referred to as not being problematic on K20 use the same configurations and operator and the same number of devices (4), but different sources and twisted boundary conditions in time only. I should mention though that the "failing" jobs we are discussing here also have twisted boundary conditions in time only. I will only attempt the jobs for non-zero momenta once this is resolved.
That's excellent news and would be much appreciated. |
Ok the fact that the solve goes through fine the second time after tuning has been done is indicative of something being wrong with the autotuning. One would think that the error occurs when the reliable update is done, but I see only blas kernels are tuned then on the second run and not dslash kernels. Looking into this... |
Ok, I'm back tracking from my previous statement. The fact that it works after tuning but not during isn't necessarily indicative of the tuning getting something wrong, since between the tuning run and the post tuning run we don't have exact reproducibility when running on multiple GPUs (#182, #199). What also confuses the issue is that with tuning switched off, the reliable update saw a large jump in the residual norm (although not large enough to trigger a failure). Still thinking about this! |
@kostrzewa After some late night insight, I have made a first attempt at fixing the reproducibility issue (#199) when running with tuning enabled on multi GPUs. E.g., before CG: 2916 iterations, <r,r> = 8.596637031253607e-08, |r|/|b| = 9.336447639050856e-08 (tuning run) after CG: 2916 iterations, <r,r> = 8.592158592432650e-08, |r|/|b| = 9.334015399767549e-08 (tuning run) (At present this solution I've employed doesn't give reproducibility with domain-decomposition solvers, but that's not really a concern here.) Can you run your code using the feature/multi-gpu-reproducible branch? This will help us diagnose the problem you are seeing, where the answer converges after tuning but not during tuning. With my latest changes, I can definitively state that this should not happen unless there is an underlying bug. |
I also find this problem weird. A quick inspection of the code doesn't reveal anything, and we never had serious problems with this in the past.
Could it be some very tricky race condition? Like I didn't call cudaDeviceSynchronize() after some kernels. In the past those were bugs of the covariant derivative and the contraction code that appeared very rarely and randomly, and since I wrote half of the twisted-clover code, I might have made a similar mistake.
|
Setting |
@AlexVaq There should, by design, be no @kostrzewa I don't think you tested with peer-to-peer disabled yet. Can you try this as well to see how it affects things? ( I don't think it can cause anything to do wrong, but one thing I realised while implementing the reproducible multi-GPU tuning just now, is that the policy tuning* could end up with a different result on each GPU. This is fixed in my new branch, since after any given tuning takes place, the tune cache is broadcast from process 0 to ensure all processes use the same policy. Not that it should affect the computation. *whether to do dslash halos as a single kernel for all dimensions at the end or as separate kernels for each dimension as communications finish |
I've tested this now (dynamic_clover + two_invocation_dynamic_clover_nop2p.zip So unless something changed on Jureca, which I cannot guarantee, it seems that this was it... I will proceed by disabling dynamic_clover and trying again, just to remove one possible variable. |
Disabling dynamic clover does not seem to make a difference, but I did experience a lockup during a call of |
@kostrzewa thanks for this data. I'll take another look at the peer to peer code, though all my tests have shown it to be robust. I haven't tested it with twisted clover though (shouldn't make a difference though). Something to look at on my forthcoming 12 hour flight 😄 If you have time, I would still like you to run a test with my new branch as this will with further diagnosis. @AlexVaq we really need to get clover and twisted clover testing more robust in the QUDA unit tests. How about we simply compute the clover term(s) in QUDA and copy this back for the CPU dslash? |
|
Jureca runs slurm and some custom thread/process pinning (AFAIK), could this have an effect which is not taken into account? As I mentioned, NUMA affinity does not work on Jureca either.
will do so right now |
I'll knock one up. On holiday now for the next 10 days but this might be something I do at some point. |
I'm seriously suspecting hardware/driver/software issues on Jureca. I keep getting hard lockups in various places, independent of the various possible combinations of branch/compile/runtime options that we have discussed. ( |
Yes, that's the next step. I might ask the JSC people to run the code on all 68 nodes to see if there's a systematic problem... So far, I've experienced lockups on three different nodes. |
@maddyscientist Despite my using the reproducible branch, it seems that the inversion histories are not, in fact, reproducible. Am I doing something wrong or missing some kind of run-time parameter? |
Not at all ? If you tune multiple times the results may differ. If you tune in run 1, the next runs (run 2 and later) should give the same result. Run1 may differ. With the reproducible branch run 1 should no love longer differ. Unless there is a bug in that branch. On 02.06.2016, at 19:54, Bartosz Kostrzewa <[email protected]mailto:[email protected]> wrote: @maddyscientisthttps://github.com/maddyscientist Despite my using the reproducible branch, it seems that the inversion histories are not, in fact, reproducible. Am I doing something wrong or missing some kind of run-time parameter? You are receiving this because you were mentioned. NVIDIA GmbH, Wuerselen, Germany, Amtsgericht Aachen, HRB 8361 This email message is for the sole use of the intended recipient(s) and may contain reply email and destroy all copies of the original message. |
I don't think you're doing anything wrong: the fact that it isn't reproducible is good data, and is likely an indicator that the tuning per se is not to blame for these issues and there is either a bug or machine issues. |
@mathiaswagner |
Some more information: Thanks to information from JSC, I've now been able to compile the entire software stack using gcc 4.9.3 only (thus not mixing icc and gcc). In this setup, I still experienced the residiual increase and aborted solve. I think we can thus exclude the compiler as the culprit. |
As far as the lockups are concerned, I may have identified that the most frequent issue is our call of
to construct the clover field on the device. So far, 90% or so of all lockups happen here. |
Do the lock ups occur if tuning is disabled? |
Yes, they also occur when tuning is disabled. (QUDA_TUNE_NO is set and QUDA_RESOURCE_PATH is not defined). I was further able to set up the computation on the K20 cluster and I experience issues here too when tuning is enabled. No lockups, but NANs. This was with v0.7.2, however, so it's not a fair test. I will have time to look back into this after the 23rd of June or so. |
Can you give me the full k20 machine specifications: Cuda driver, toolkit version and compiler version. This hang is very disturbing since I've never seen it and cannot reproduce it. @AlexVaq: I have almost finished the host clover kernel that will for real testing of the Wilson clover action. From this I imagine you can trivially extend it to twisted clover. |
@kostrzewa Can you confirm that |
I will only be able to check this directly after the 23rd or so. If |
@kostrzewa That's ok, your use of |
Sorry, I seem to have spoken too soon, I just experienced two more lockups... |
Also, in contrast to what I saw before, the final true residual is now apparently wrong.
Where in the last line, the true <r,r> is computed using the tmLQCD operator. In this run, P2P was enabled. I also experience lockups in different places now. With P2P disabled, the code locks up during operator creation and, in a different run, at the fifth CG iteration. Would it be useful for you if you could try the exact same code on one of your machines with one of our gauge configurations? Cheers, |
So, it gives wrong result? That's pretty serious. I might be able to try On Tue, Jun 28, 2016 at 12:40 PM, Bartosz Kostrzewa <
|
@AlexVaq I've sent you the paths by e-mail, let me know if there's a problem accessing them. |
Thanks, I can access the configurations. I’ll have a look at it as soon as I can. Let’s see if I can find some spot later today... |
I'm also happy to reproduce the issue locally if you give me instructions on how to build the code. Also, can you check whether dslash_test and invert_test QUDA internal tests are working on Jureca? (With |
Looks like the lack of clover convergence was I stupid bug I created whilst adding my reference clover dslash changes (I introduced a bug in the clover inversion). Fixed in #483, @AlexVaq I've assigned this to you merge and close (@mathiaswagner is on holiday). @kostrzewa Hopefully your convergence issues should be taken care of. The only thing left that is worrisome is the lock ups, I've never reproduced this. I think the best thing there is for me to match your workflow on my workstation here, so I can try and reproduce this. |
@maddyscientist @AlexVaq |
@kostrzewa Looking at the comments above, it looks like the lockups have only been seen on Jureca, is that correct? |
Yes, that's correct. |
Do you know what version of linux is running on Jureca? We've seen some codes hang with multi-GPU running on RHEL / CentOS 6.6 owing to a bug in the kernel (https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64). |
Another thing I would suggest is that if you can run interactively on Jureca, you could run the code until it hangs then attach gdb to the hung process and get a stack trace to see where it's hung. |
@kostrzewa Can you test this work flow on the feature/memory-pool branch? I've just noticed that the initial autotuner state was not initialized. This could lead to some processes autotuning and other not tuning, with the state not being set until the first call to the linear solver. While I don't think this would cause a hang, since the tuning is a purely local process, it would be good to rule this out as the source of the problem. I've also simplified the process of enabling the autotuner: tuning is now enabled by default, unless the environment variable |
we resolved this a while ago |
I would like to request some help, if possible, with pin-pointing the origin of convergence issues that I see using twisted mass clover on the Jureca K80 nodes at Juelich Supercomputing Center. I don't see any of these issues on our local K20 GPU nodes in Bonn. I've compiled 97672d3 of the develop branch and use QUDA through the latest commit of the master branch of etmc/tmLQCD
I invert a stochastic time-slice source on a twisted mass clover lattice and this goes through without problems (although the residual goes up for a number of iterations). This propagator is then used as a source for a sequential inversion.
I've attached two log files, one of which shows an aborted inversion of the sequential propagator while the other one is successful (on the same configuration and using the same parameters). Completely disabling tuning seems to increase the probability of success, but leads to very suboptimal performance. (that is, when I don't provide a QUDA_RESOURCE_PATH and set QUDA_TUNE_NO, the inversions go throgh more freqently)
fail.log.zip
success.log.zip
One point of potential importance might be the fact that the NUMA affinity does not work on Jureca, so I currently compile QUDA with this functionality disabled. In addition, JSC deploys CUDA with the Intel compiler only, which might be relevant. The specific versions are given as:
CUDA/7.5.18
MVAPICH2/2.2b-GDR
Intel/2015.3.187-GCC-4.9.3-2.25
@maddyscientist I haven't gotten around to reorganising the tmLQCD interface for QUDA to not reuse the parameter struct. This is irrelevant here, however, because the executable is called multiple times in each job and the parameter struct is thus always "clean".
The text was updated successfully, but these errors were encountered: