Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

further twisted mass clover convergence issues #474

Closed
kostrzewa opened this issue May 29, 2016 · 59 comments
Closed

further twisted mass clover convergence issues #474

kostrzewa opened this issue May 29, 2016 · 59 comments

Comments

@kostrzewa
Copy link
Member

kostrzewa commented May 29, 2016

I would like to request some help, if possible, with pin-pointing the origin of convergence issues that I see using twisted mass clover on the Jureca K80 nodes at Juelich Supercomputing Center. I don't see any of these issues on our local K20 GPU nodes in Bonn. I've compiled 97672d3 of the develop branch and use QUDA through the latest commit of the master branch of etmc/tmLQCD

I invert a stochastic time-slice source on a twisted mass clover lattice and this goes through without problems (although the residual goes up for a number of iterations). This propagator is then used as a source for a sequential inversion.

I've attached two log files, one of which shows an aborted inversion of the sequential propagator while the other one is successful (on the same configuration and using the same parameters). Completely disabling tuning seems to increase the probability of success, but leads to very suboptimal performance. (that is, when I don't provide a QUDA_RESOURCE_PATH and set QUDA_TUNE_NO, the inversions go throgh more freqently)

fail.log.zip
success.log.zip

One point of potential importance might be the fact that the NUMA affinity does not work on Jureca, so I currently compile QUDA with this functionality disabled. In addition, JSC deploys CUDA with the Intel compiler only, which might be relevant. The specific versions are given as:

CUDA/7.5.18
MVAPICH2/2.2b-GDR
Intel/2015.3.187-GCC-4.9.3-2.25

@maddyscientist I haven't gotten around to reorganising the tmLQCD interface for QUDA to not reuse the parameter struct. This is irrelevant here, however, because the executable is called multiple times in each job and the parameter struct is thus always "clean".

@AlexVaq
Copy link
Member

AlexVaq commented May 29, 2016 via email

@kostrzewa
Copy link
Member Author

Hmm, well the performance is not really the problem, when the tunecache is enabled, I get around 750 Gflop/s in single precision on one node on a 24^3x48 lattice. Performance goes up a little moving to a 32^3x64 lattice and significantly moving to a 48^3x96 lattice. Using just one gpu per node is quite inefficient though, isn't it? If I understand correctly, accounting is always for the whole node.

I can try of course, but my chief concern is the failure to converge. I also see occasional lockups, but I haven't caught one yet with high verbosity output enabled, so I don't know where and how it locks up yet.

@AlexVaq
Copy link
Member

AlexVaq commented May 30, 2016

Using just one gpu per node is quite inefficient though, isn't it? If I
understand correctly, accounting is always for the whole node.

Yes it is. Hence our woes.

I can try of course, but my chief concern is the failure to converge. I
also see occasional lockups, but I haven't caught one yet with high
verbosity output enabled, so I don't know where and how it locks up yet.

Yes, I see that there is something wrong when doing the reliable update,
because the residual is increased two orders of magnitude (in the success
scenario, in the failure scenario, it simply exits). Are you using
dynamical clover inversion? I've found it to be much more stable, and it
inverts the clover term in a different way. I don't understand it yet why
it should work better (actually, I would expect it to work worse because it
will invert the sloppy clover, not the full precision one), but usually it
does.

By the way, setting QUDA_RESOURCE_PATH to somewhere will save you some
computer time by caching the tuning.

@kostrzewa
Copy link
Member Author

Are you using dynamical clover inversion?

No, I will have to try. I find it a bit surprising that the inversions behave admirably on our local cluster but so badly on Jureca... Do I just set DYNAMIC_CLOVER at compile-time (in make.inc) or do I need to also set some run-time parameter?

By the way, setting QUDA_RESOURCE_PATH to somewhere will save you some computer time by caching the tuning.

Thanks. I just disabled it for testing purposes in these two runs to understand what's going on since we saw in the past that there are some issues when the tuning is enabled (and it seems to retune at weird times). Here as well, when I set QUDA_TUNE_NO and do not set QUDA_RESOURCE_PATH, the inversions almost never fail, but they are more than a factor of 2 slower.

@mathiaswagner
Copy link
Member

Even for your success case the reliable updates seem to increase the residual by several orders of magnitude. (Maybe you can plot the residual vs iteration count for both cases and do the same for your runs on K20).
The is a parameter 'max_res_increase' which controls how often the residual is allowed to increase after a reliable update. Maybe try increasing that.

@AlexVaq
Copy link
Member

AlexVaq commented May 30, 2016

Do I just set DYNAMIC_CLOVER at compile-time (in make.inc) or do I need
to also set some run-time parameter?

You just set DYNAMIC_CLOVER and it should work. You'll lose performance,
but gain memory, so the best you can try is to use as less GPUs as possible
to improve the strong scaling.

Here as well, when I set QUDA_TUNE_NO and do not set QUDA_RESOURCE_PATH,

the inversions almost never fail, but they are more than a factor of 2
slower.

Mmm... This makes me guess there must be a problem with the tuning, so
either the pretune result is not properly saved-restored, or the tuning
framework is failing for some reason. Most likely is the first, let me have
a look at the twisted-clover files.

@maddyscientist
Copy link
Member

maddyscientist commented Jun 1, 2016

@kostrzewa Can you try disabling the peer-2-peer communication? I've seen on one machine (a quad K80 system at Jlab) where peer-to-peer communication gave the wrong answer for reasons I've yet to fully determine (I suspect a BIOS or driver issue). To do so, you need to export /setenv QUDA_ENABLE_P2P=0.

Also, can you tell me the NVIDIA driver that is running on Jureca? This might help us work out why there might appear is a Jureca-specific issue.

@maddyscientist
Copy link
Member

More on this...it's possible that the autotuner is getting something wrong: this can happen for a couple of reasons:

  1. When you have two instances of the same kernel launched with slightly different parameters, but the autotuner isn't breaking the degeneracy leading to reuse of launch parameters.
  2. If a kernel that is being autotuned destroys its input data in the act of it being called but doesn't save its input state, meaning that the incorrect answer is computed.

To isolate between these two two cases, you should set QUDA_RESOURCE_PATH, run the failing solver case, then rerun it again exactly. If it works correctly on second invocation, then this is indicative of 2.

I imagine we should be able to track this issue done fairly easily. What @AlexVaq reports above about dynamical clover inversions behaving better than static clover inversions suggests to me that the problem is likely not peer-to-peer related, and more likely a hidden bug in the twisted-clover save/restore or tuning degeneracy.

@maddyscientist
Copy link
Member

Just echoing what @mathiaswagner reported. It would be good to try increasing the QudaInvertParam::max_res_increase parameter. The residual history isn't that different between the good run and bad runs until the bad run breaks down. This could have simply been unlucky: there is no exact reproducibility between tuned and untuned runs since the order of the reduction changes. It's possible that the difference seen here is simply due to the summation order change. If you plot the residuals, then both the failing run and the successful run are extremely similar, so this possibly suggest that this is the case: the successful run also sees a large residual increase when the reliable update is performed, however, its residual increase is only just short of triggering this failure (hitting the maximum number of residual increases).

Also, can you post the tunecache.tsv, profile.tsv and profile_async.tsv files that are generated when QUDA_RESOURCE_PATH are set?

@maddyscientist
Copy link
Member

Also, a general comment on performance: it looks like you're using double-single solvers here with no reconstruction. You will gain more performance if you switch to double-half as well as using reconstruct for the sloppy operator (12 or 18). Of course let's get a handle on the solver stability first though.

@kostrzewa
Copy link
Member Author

To begin answering your questions, here's the output of nvidia-smi -q for the first K80 device on one of the GPU compute nodes.
jureca_nvidia_smi_q.txt

@kostrzewa
Copy link
Member Author

kostrzewa commented Jun 1, 2016

Also, a general comment on performance: it looks like you're using double-single solvers here with no reconstruction. You will gain more performance if you switch to double-half as well as using reconstruct for the sloppy operator (12 or 18).

These inversions are done with twisted boundary conditions (in all directions) which we enforce by pre-multiplying the gauge fields with the respective phases before passing them to QUDA. As far as I understand, reconstruction cannot be used in this case.

Also in our more usual case, where we use twisted boundary conditions in time only to produce anti-periodic quark field boundary conditions in time, it is not clear to me if having "-1" boundary conditions in the valence sector and twisted boundary conditions in the sea might have ill effects. For this reason we presently do all our calculations without reconstruction.

@AlexVaq
Copy link
Member

AlexVaq commented Jun 1, 2016 via email

@AlexVaq
Copy link
Member

AlexVaq commented Jun 1, 2016

These inversions are done with twisted boundary conditions (in all directions) which we enforce by pre-multiplying the gauge fields with the respective phases before passing them to QUDA. As far as I understand, reconstruction cannot be used in this case.

This can be something to consider as a future feature. I suppose you can always reconstruct the standard link and add the phase at the end, without a big penalty in performance, but this has not been implemented yet.

@mathiaswagner
Copy link
Member

@kostrzewa: There is probably a workaround for that. We also do reconstruction for staggered where we need to take into account the staggered phases and even an arbitrary phase do to the U(3) symmetry of the long links. Anyhow, I am most interested in the convergence behavior in your K20 runs. (see my comment above). Do you have something you can share?

@kostrzewa
Copy link
Member Author

kostrzewa commented Jun 1, 2016

To isolate between these two two cases, you should set QUDA_RESOURCE_PATH, run the failing solver case, then rerun it again exactly. If it works correctly on second invocation, then this is indicative of 2.

I just did what you suggested. In the first invocation, tuning was done for the first inversion (up quark, which Mario sets to TwistFlavour = -1) and the solver converges normally. For the second inversion (down quark -> TwistFlavour = 1), tuning is done again and I observe the residual increase as before. Then the inversion is aborted. (I have not yet tweaked the suggested parameter).

In the second invocation, using the tuning results of the first invocation, both inversions go through fine and the reliable update does not trigger an increase in the residual.

I'm using dynamic clover and it seems that it doesn't really make a difference.

I attach the two logfiles:
two_invocations.log.zip

And the tuning results from the first invocation (I'm afraid this also includes results for the L=32 lattice size, I forgot to reset...):
quda_resource_path.zip

@mathiaswagner

Anyhow, I am most interested in the convergence behavior in your K20 runs. (see my comment above). Do you have something you can share?

I need to set this up but will probably do so today. The jobs that I referred to as not being problematic on K20 use the same configurations and operator and the same number of devices (4), but different sources and twisted boundary conditions in time only. I should mention though that the "failing" jobs we are discussing here also have twisted boundary conditions in time only. I will only attempt the jobs for non-zero momenta once this is resolved.

There is probably a workaround for that. We also do reconstruction for staggered where we need to take into account the staggered phases and even an arbitrary phase do to the U(3) symmetry of the long links.

That's excellent news and would be much appreciated.

@maddyscientist
Copy link
Member

Ok the fact that the solve goes through fine the second time after tuning has been done is indicative of something being wrong with the autotuning. One would think that the error occurs when the reliable update is done, but I see only blas kernels are tuned then on the second run and not dslash kernels. Looking into this...

@maddyscientist
Copy link
Member

Ok, I'm back tracking from my previous statement. The fact that it works after tuning but not during isn't necessarily indicative of the tuning getting something wrong, since between the tuning run and the post tuning run we don't have exact reproducibility when running on multiple GPUs (#182, #199). What also confuses the issue is that with tuning switched off, the reliable update saw a large jump in the residual norm (although not large enough to trigger a failure).

Still thinking about this!

@maddyscientist
Copy link
Member

maddyscientist commented Jun 2, 2016

@kostrzewa After some late night insight, I have made a first attempt at fixing the reproducibility issue (#199) when running with tuning enabled on multi GPUs. E.g.,

before

CG: 2916 iterations, <r,r> = 8.596637031253607e-08, |r|/|b| = 9.336447639050856e-08 (tuning run)
CG: 2916 iterations, <r,r> = 8.591627185004837e-08, |r|/|b| = 9.333726750427703e-08
CG: 2916 iterations, <r,r> = 8.591627185004837e-08, |r|/|b| = 9.333726750427703e-08

after

CG: 2916 iterations, <r,r> = 8.592158592432650e-08, |r|/|b| = 9.334015399767549e-08 (tuning run)
CG: 2916 iterations, <r,r> = 8.592158592432650e-08, |r|/|b| = 9.334015399767549e-08
CG: 2916 iterations, <r,r> = 8.592158592432650e-08, |r|/|b| = 9.334015399767549e-08

(At present this solution I've employed doesn't give reproducibility with domain-decomposition solvers, but that's not really a concern here.)

Can you run your code using the feature/multi-gpu-reproducible branch? This will help us diagnose the problem you are seeing, where the answer converges after tuning but not during tuning. With my latest changes, I can definitively state that this should not happen unless there is an underlying bug.

@AlexVaq
Copy link
Member

AlexVaq commented Jun 2, 2016 via email

@mathiaswagner
Copy link
Member

SettingCUDA_LAUNCH_BLOCKING=1 might help to diagnose race conditions.

@maddyscientist
Copy link
Member

@AlexVaq There should, by design, be no cudaDeviceSynchronize calls in the dslash codes, it should all be event based. Since this code is shared by all dslash kernels, I severely doubt that you are to blame for this 😄

@kostrzewa I don't think you tested with peer-to-peer disabled yet. Can you try this as well to see how it affects things? (QUDA_ENABLE_P2P=0). It's possible that I did something funny with this (more likely than @AlexVaq doing something weird I think 😉 ).

I don't think it can cause anything to do wrong, but one thing I realised while implementing the reproducible multi-GPU tuning just now, is that the policy tuning* could end up with a different result on each GPU. This is fixed in my new branch, since after any given tuning takes place, the tune cache is broadcast from process 0 to ensure all processes use the same policy. Not that it should affect the computation.

*whether to do dslash halos as a single kernel for all dimensions at the end or as separate kernels for each dimension as communications finish

@kostrzewa
Copy link
Member Author

kostrzewa commented Jun 2, 2016

@maddyscientist

I don't think you tested with peer-to-peer disabled yet. Can you try this as well to see how it affects things? (QUDA_ENABLE_P2P=0). It's possible that I did something funny with this (more likely than @AlexVaq doing something weird I think 😉 ).

I've tested this now (dynamic_clover + QUDA_ENABLE_P2P=0) and this seems to work fine in both runs. I attach the logs and the tuning results.

two_invocation_dynamic_clover_nop2p.zip

So unless something changed on Jureca, which I cannot guarantee, it seems that this was it...

I will proceed by disabling dynamic_clover and trying again, just to remove one possible variable.

@kostrzewa
Copy link
Member Author

Disabling dynamic clover does not seem to make a difference, but I did experience a lockup during a call of loadCloverQuda (nop2p.loadCloverQuda.lockup.log in the attachment). The third call (confusingly called second_invocation_nop2p.log) worked for both propagators without residual increases.

two_invocation_nop2p.zip

@maddyscientist
Copy link
Member

@kostrzewa thanks for this data. I'll take another look at the peer to peer code, though all my tests have shown it to be robust. I haven't tested it with twisted clover though (shouldn't make a difference though). Something to look at on my forthcoming 12 hour flight 😄 If you have time, I would still like you to run a test with my new branch as this will with further diagnosis.

@AlexVaq we really need to get clover and twisted clover testing more robust in the QUDA unit tests. How about we simply compute the clover term(s) in QUDA and copy this back for the CPU dslash?

@AlexVaq
Copy link
Member

AlexVaq commented Jun 2, 2016

@AlexVaq https://github.com/AlexVaq we really need to get clover and twisted clover testing more robust in the QUDA unit tests. How about we simply compute the clover term(s) in QUDA and copy this back for the CPU dslash?

That's the way to go, but do we have a CPU dslash for clover?

@kostrzewa
Copy link
Member Author

@maddyscientist

I'll take another look at the peer to peer code, though all my tests have shown it to be robust. I haven't tested it with twisted clover though (shouldn't make a difference though).

Jureca runs slurm and some custom thread/process pinning (AFAIK), could this have an effect which is not taken into account? As I mentioned, NUMA affinity does not work on Jureca either.

If you have time, I would still like you to run a test with my new branch as this will with further diagnosis.

will do so right now

@maddyscientist
Copy link
Member

I'll knock one up. On holiday now for the next 10 days but this might be something I do at some point.

@mathiaswagner
Copy link
Member

QUDA's basic numa affinity is know to be pretty much broken, see #223. It's a long standing issue but we probably need a better approach anyway #473.

@kostrzewa
Copy link
Member Author

kostrzewa commented Jun 2, 2016

I'm seriously suspecting hardware/driver/software issues on Jureca. I keep getting hard lockups in various places, independent of the various possible combinations of branch/compile/runtime options that we have discussed. (DYNAMIC_CLOVER = yes, "reproducible", QUDA_ENABLE_P2P=0)

@kostrzewa
Copy link
Member Author

kostrzewa commented Jun 2, 2016

@mathiaswagner

Do you know which nodes your job is running on? Might be a faulty node that can be isolated. Slurm should allow you to exclude these nodes at submission.

Yes, that's the next step. I might ask the JSC people to run the code on all 68 nodes to see if there's a systematic problem... So far, I've experienced lockups on three different nodes.

@kostrzewa
Copy link
Member Author

@maddyscientist Despite my using the reproducible branch, it seems that the inversion histories are not, in fact, reproducible. Am I doing something wrong or missing some kind of run-time parameter?

@mathiaswagner
Copy link
Member

Not at all ?

If you tune multiple times the results may differ.

If you tune in run 1, the next runs (run 2 and later) should give the same result. Run1 may differ. With the reproducible branch run 1 should no love longer differ. Unless there is a bug in that branch.

On 02.06.2016, at 19:54, Bartosz Kostrzewa <[email protected]mailto:[email protected]> wrote:

@maddyscientisthttps://github.com/maddyscientist Despite my using the reproducible branch, it seems that the inversion histories are not, in fact, reproducible. Am I doing something wrong or missing some kind of run-time parameter?

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/474#issuecomment-223368945, or mute the threadhttps://github.com/notifications/unsubscribe/AETKL6dESrqCS43t0MnR09Qthwr0SrkHks5qHxf1gaJpZM4IpSwq.

NVIDIA GmbH, Wuerselen, Germany, Amtsgericht Aachen, HRB 8361
Managing Director: Karen Theresa Burns


This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

@maddyscientist
Copy link
Member

I don't think you're doing anything wrong: the fact that it isn't reproducible is good data, and is likely an indicator that the tuning per se is not to blame for these issues and there is either a bug or machine issues.

@kostrzewa
Copy link
Member Author

kostrzewa commented Jun 2, 2016

@mathiaswagner
If you look into the first archive, the files first_invocation.log and second_invocation.log are two executions of the same code with the same source (as you can infer from the source normalisation before and after rescaling). The first execution does the tuning and the second one re-uses the tunecache. Not even the total number of iterations are the same. The two jobs even executed on the same node, which was lucky, I guess.

@kostrzewa
Copy link
Member Author

Some more information: Thanks to information from JSC, I've now been able to compile the entire software stack using gcc 4.9.3 only (thus not mixing icc and gcc). In this setup, I still experienced the residiual increase and aborted solve. I think we can thus exclude the compiler as the culprit.

@kostrzewa
Copy link
Member Author

As far as the lockups are concerned, I may have identified that the most frequent issue is our call of

 loadCloverQuda(NULL, NULL, &inv_param);

to construct the clover field on the device. So far, 90% or so of all lockups happen here.

@maddyscientist
Copy link
Member

Do the lock ups occur if tuning is disabled?

@kostrzewa
Copy link
Member Author

Yes, they also occur when tuning is disabled. (QUDA_TUNE_NO is set and QUDA_RESOURCE_PATH is not defined).

I was further able to set up the computation on the K20 cluster and I experience issues here too when tuning is enabled. No lockups, but NANs. This was with v0.7.2, however, so it's not a fair test. I will have time to look back into this after the 23rd of June or so.

@maddyscientist
Copy link
Member

Can you give me the full k20 machine specifications: Cuda driver, toolkit version and compiler version. This hang is very disturbing since I've never seen it and cannot reproduce it.

@AlexVaq: I have almost finished the host clover kernel that will for real testing of the Wilson clover action. From this I imagine you can trivially extend it to twisted clover.

@maddyscientist
Copy link
Member

maddyscientist commented Jun 15, 2016

@kostrzewa Can you confirm that QudaInvertParam::compute_clover_trlg=0 in the call to loadCloverQuda()?

@kostrzewa
Copy link
Member Author

I will only be able to check this directly after the 23rd or so. If newQudaInvertParam() sets it to 0 by default, then it should have been 0, yes. We don't touch that parameter. I guess we would if we did HMC or reweighting using QUDA.

@maddyscientist
Copy link
Member

@kostrzewa That's ok, your use of newQudaInvertParam confirms that it will be set to 0. You are correct, this is used for HMC (and could be used for reweighting).

@kostrzewa
Copy link
Member Author

Sorry, I seem to have spoken too soon, I just experienced two more lockups...

@kostrzewa
Copy link
Member Author

kostrzewa commented Jun 28, 2016

Also, in contrast to what I saw before, the final true residual is now apparently wrong.

# QUDA: CG: 4230 iterations, <r,r> = 2.376115e-16, |r|/|b| = 2.489101e-11
# QUDA: CG: Reliable updates = 6
# QUDA: CG: Convergence at 4230 iterations, L2 relative residual: iterated = 2.489101e-11, true = 2.489101e-11
# QUDA: Solution = 2.20368e+06
# QUDA: Reconstructed: CUDA solution = 3.6399e+06, CPU copy = 3.6399e+06
# QUDA: Device memory used:  Spinor: 0.375000 GiB,  Gauge: 0.000000 GiB, Clover: 0.000000 GiB
# QUDA: Done: 4230 iter / 55.6053 secs = 793.995 Gflops
# QUDA: time spent in reorder_spinor_fromQuda: 0.072200 secs
# QUDA: time spent in reorder_spinor_fromQuda: 0.071547 secs
[...]
# Inversion done in 4230 iterations, squared residue = 1.475342e+05!

Where in the last line, the true <r,r> is computed using the tmLQCD operator. In this run, P2P was enabled.

I also experience lockups in different places now. With P2P disabled, the code locks up during operator creation and, in a different run, at the fifth CG iteration.

Would it be useful for you if you could try the exact same code on one of your machines with one of our gauge configurations?

Cheers,
Bartek

@AlexVaq
Copy link
Member

AlexVaq commented Jun 28, 2016

So, it gives wrong result? That's pretty serious. I might be able to try
one of your confs, can you give me the details?

On Tue, Jun 28, 2016 at 12:40 PM, Bartosz Kostrzewa <
[email protected]> wrote:

Also, in contrast to what I saw before, the final true residual is now
apparently wrong.

QUDA: CG: 4230 iterations, <r,r> = 2.376115e-16, |r|/|b| = 2.489101e-11

QUDA: CG: Reliable updates = 6

QUDA: CG: Convergence at 4230 iterations, L2 relative residual: iterated = 2.489101e-11, true = 2.489101e-11

QUDA: Solution = 2.20368e+06

QUDA: Reconstructed: CUDA solution = 3.6399e+06, CPU copy = 3.6399e+06

QUDA: Device memory used: Spinor: 0.375000 GiB, Gauge: 0.000000 GiB, Clover: 0.000000 GiB

QUDA: Done: 4230 iter / 55.6053 secs = 793.995 Gflops

QUDA: time spent in reorder_spinor_fromQuda: 0.072200 secs

QUDA: time spent in reorder_spinor_fromQuda: 0.071547 secs

[...]

Inversion done in 4230 iterations, squared residue = 1.475342e+05!

Where in the last line, the true is computed using the tmLQCD operator.

I also experience lockups in different places now. With P2P disabled, the
code locks up during operator creation and, in a different run, at the
fifth CG iteration.

Would it be useful for you if you could try the exact same code on one of
your machines with one of our gauge configurations?

Cheers,
Bartek


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#474 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ADqYswxweaEURHcPFb6RN5HCwKAHpo5Wks5qQPo2gaJpZM4IpSwq
.

@kostrzewa
Copy link
Member Author

@AlexVaq I've sent you the paths by e-mail, let me know if there's a problem accessing them.

@AlexVaq
Copy link
Member

AlexVaq commented Jun 28, 2016

Thanks, I can access the configurations. I’ll have a look at it as soon as I can. Let’s see if I can find some spot later today...

@maddyscientist
Copy link
Member

I'm also happy to reproduce the issue locally if you give me instructions on how to build the code.

Also, can you check whether dslash_test and invert_test QUDA internal tests are working on Jureca? (With --dslash-type twisted-clover to select fermion type)

@maddyscientist
Copy link
Member

Looks like the lack of clover convergence was I stupid bug I created whilst adding my reference clover dslash changes (I introduced a bug in the clover inversion). Fixed in #483, @AlexVaq I've assigned this to you merge and close (@mathiaswagner is on holiday).

@kostrzewa Hopefully your convergence issues should be taken care of. The only thing left that is worrisome is the lock ups, I've never reproduced this. I think the best thing there is for me to match your workflow on my workstation here, so I can try and reproduce this.

@kostrzewa
Copy link
Member Author

kostrzewa commented Jul 15, 2016

@maddyscientist @AlexVaq
Thanks to both of you. I haven't had much time lately to look into this unfortunately, or to forward the necessary info. This is just a quick update to say that indeed our residual check now works out again (with the latest develop branch). I still have the lockup issue unfortunately. I will try to prepare instructions on how to run the code as soon as I have a moment. Cheers!

@maddyscientist
Copy link
Member

@kostrzewa Looking at the comments above, it looks like the lockups have only been seen on Jureca, is that correct?

@kostrzewa
Copy link
Member Author

Yes, that's correct.

@maddyscientist
Copy link
Member

Do you know what version of linux is running on Jureca? We've seen some codes hang with multi-GPU running on RHEL / CentOS 6.6 owing to a bug in the kernel (https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64).

@maddyscientist
Copy link
Member

Another thing I would suggest is that if you can run interactively on Jureca, you could run the code until it hangs then attach gdb to the hung process and get a stack trace to see where it's hung.

@maddyscientist
Copy link
Member

@kostrzewa Can you test this work flow on the feature/memory-pool branch? I've just noticed that the initial autotuner state was not initialized. This could lead to some processes autotuning and other not tuning, with the state not being set until the first call to the linear solver. While I don't think this would cause a hang, since the tuning is a purely local process, it would be good to rule this out as the source of the problem.

I've also simplified the process of enabling the autotuner: tuning is now enabled by default, unless the environment variable QUDA_ENABLE_TUNING=0 is set. If this branch doesn't fix the hanging problem, It would be good to try your runs with the tuning disabled to rule out the autotuner as a source of the problem.

@kostrzewa
Copy link
Member Author

we resolved this a while ago

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants