-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numerous run time issues on Betzy login3 #573
Comments
NorESM job does not stop when error occurred. Since the tasks exit without send error to mpi lib. |
What you want to do is the following
You can see the documentation for srun on betzy just using srun --help
I think this can also be applied to noresm2.1 and noresm2.3 versions. |
I have started running NorESM2-LM (release-noresm2.0.8) tests using 2023 compiler/library versions (which load hpcx/2.20 instead of hpcx/2.14 that is loaded by the 2022 versions) to see if that can help avoiding the UCX errors. First, I tried 2023a using Then, I tried 2023b (env_mach_specific.xml.gz) using I am currently running a coupled of 25-node jobs that each run 5 instances of NorESM2-LM and one 100-node job (with 25 instances). No UCX so far. |
@IngoBethke In your
Will override some dependencies of netCDF (bzip2,zlib,Szip f.e) which is not a big deal, but might cause some issues later. |
I have a maybe stupid question: could the ucx-errors also be node-dependent? - I am asking since I ran now a few (short) runs successfully with the very same settings as before, but they ran on different nodes (previously, they crashed reporting errors for nodes |
@jmaerz sigma2 fixed the ucx error, so old configs should work on all nodes as before. |
UPDATE 2021.10.25: ucx proplem is still around even using 2023b library version. Experienced another crash of a 20-node job last night (see log below).
UPDATE 2021.10.22: ucx time-out error using 2023b libraries with hpcx v2.20
UPDATE 2021.10.21: I have run several thousand simulation years using the 2023b libraries without a single crash. I am unlikely going back to using 2022a but would be interested to hear if the ucx-errors still occur with 2022a. UPDATE: see above post by Matvey. @mvdebolskiy, can you post some more info what Sigma2 did to fix it? I have now run over 1000 simulations years using netCDF-Fortran/4.6.1-iompi-2023b and not encountered a single crash. If indeed switching from hpcx/2.14 to hpcx/2.20 made the difference, then it could be worth trying to load hpcx/2.20 together with the 2022a libraries versions:
What do you think? In my case, the ucx errors occurred unmistakenly in the hpcx/2.14 library
I tried setting OMPI_MCA_coll_hcoll_enable=0 because the uxc error message mentions Good question. According to my and Alok's experience, the ucx error was not observed before mid-August and suspiciously close to the installation date of the current hpcx libaries. That doesn't necessarily mean that a badly configured node or bad interconnect cannot trigger the error. To be on the safe side, my current runs exclude about 20 nodes (somewhat arbitrarily) but b4394 and b4396 are not among those: b2113,b2114,b2115,b2116,b5226,b1216,b1226,b3355,b3356,b3357,b3359,b3379,b3382,b3383,b3338,b3340,b3343,b3344,b3345,b2171,b2172,b2173 But I think I may try running without node-exclude list in the future. |
@IngoBethke
I had problems with running on 2022a last thursday, but the tests with just |
@mvdebolskiy My latest simulations ran 5-10% slower on the 11 October but otherwise at steady pace during 5-14 October. In comparison, my performed in August and September using the 2020 libraries were about 10-15% slower than my latest ones. So my experience is that my simulations are running a bit faster now. You can check my timing files in |
I have access. I will check. |
My noresm_2_5_alpha06 simulation crashed yesterday evening, but I didn't realize it until this afternoon when I checked the run directory. The job appeared to be running correctly, as it was still listed in squeue without any issues. However, upon reviewing the cesm.log, I found the following error:
|
Getting strange messages piped to the display from where I submitted the job:
This happens for various |
@jmaerz that's from login nodes too. |
Thanks for the explanation - I was guessing so, but wasn't entirely sure. |
Update 2024.10.21: queuing works normal again Betzy's queuing system seems to be broken or put on pause. My queued jobs got all "requeued" i.e. cancelled and newly submitted jobs get not even processed properly by the queuing system. That other users are subject to the same issue is apparent from https://metadoc.sigma2.no/status_graph/?machine=betzy&period=week&type=cpu&size=large&dynamic=false&start=0&end=0&format=png |
I had the same problem. My jobs just started disappearing without trace... |
This is a beginning place holder for numerous issues that have occurred on Betzy as part of the OS upgrade. Currently, since only login3 is available - these have all occurred there.
From @mvertens:
Currently this is all using the noresm2_5_alpha06 code base that was just created last week.
There are two separate errors I encountered - both which I reported to sigma2.
==== backtrace (tid: 40171) ==== 0 0x000000000005e810 uct_ud_ep_deferred_timeout_handler() /build-result/src/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/ib/ud/base/ud_ep.c:278 1 0x000000000004fd37 ucs_callbackq_slow_proxy() /build-result/src/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/ucs/datastruct/callbackq.c:404 2 0x000000000004881a ucs_callbackq_dispatch() /build-result/src/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc
There seems to be an outstanding issue on this here: openucx/ucx#5159
Sigma2 suggests moving to the intel/2023a tool chain (currently @mvdebolskiy is working on this) - but it might be that Sigma2 need to upgrade there openucx.
[LOG_CAT_MLB] Registration of 0 network context failed. Don't use HCOLL
[LOG_CAT_MLB] Failed to grow mlb dynamic manager
[LOG_CAT_MLB] Payload allocation failed
[LOG_CAT_BASESMUMA] Failed to shmget with IPC_PRIVATE, size 20971520, IPC_CREAT; errno 28:No space left on device
[LOG_CAT_MLB] Registration of 0 network context failed. Don't use HCOLL
[LOG_CAT_MLB] Failed to grow mlb dynamic manager
In this case the solution was to set the environment variable OMPI_MCA_coll_hcoll_enable to 0.
Sigma2 has a fix for (2) which requires the 2023a tool chain and that Matvey is working on.
I am not sure that updating to the 2023a tool chain will fix (1).
I think we should try the new tool chain once @mvdebolskiy is ready with the update and see if (1) occurs again.
The text was updated successfully, but these errors were encountered: