-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Final values with 0 -> no IO values #173
Comments
I think I have identified the bug here. The stats that are got from |
P.S. In the case where the children exit, but prmon does not do a polling cycle before the end of the job then we do report the correct numbers, because the last values read from the children are effectively remembered internally. |
Just a update... things are weirder than I thought. It seems that in the simple case when a parent process forks a child and that child does some i/o and exits, then the parent inherits the i/o in its /proc counters. This means that
So really diagnosing the cause of why ATLAS jobs are missing i/o (maybe the pilot is now doing a daemon type double fork of part of the job?) is harder and the fix is not so clear. |
BTW, it's perhaps worth noting that the It's only the instance that's launched by the pilot seems to have this issue. |
Thanks for checking @amete - I was indeed going to look at that. I have asked @PalNilsson for insight into how the pilot is launching the payloads. |
The pilot starts prmon just after the singularity container for the payload is launched and "listens" to this process. In https://bigpanda.cern.ch//media/filebrowser/35939f9e-bbac-4f18-9978-8efb01e7db46/valid3/tarball_PandaJob_4858116576_LRZ-LMU_CLOUD/pilotlog.txt search for the first occurrence of "prmon" - the lines before is where the pilot identifies the singularity process which is just launched above. Cheers, Johannes |
Oh, I wonder if it's somethind to do with running in singularity then? Is that ~new? (I am also thinking we didn't really change this logic in years). If I look at the reproducer that PanDA gives me... http://pandaserver.cern.ch:25080//server/panda/getScriptOfflineRunning?pandaID=4858116576 There's no mention of singularity there. Is there an easy recipe to start the job in the same way as the pilot's doing? |
We run singularity for every PanDA payload since summer 2019 - the following is also written in the pilot logs (admittedly one has to search for it). The singularity launch is hidden in: "...atlasLocalSetup.sh -c $thePlatform ..." in the final command (see below). Create the following 2 files, adjust "/srv" in the commands below and probably X509_USER_PROXY my_release_setup.sh:source $AtlasSetup/scripts/asetup.sh Athena,22.0.18,notest --platform x86_64-centos7-gcc8-opt --makeflags='$MAKEFLAGS' container_script.sh:export TMPDIR=/srv;export GFORTRAN_TMPDIR=/srv;export ATHENA_CORE_NUMBER=10;export ATHENA_PROC_NUMBER=10;export PANDA_RESOURCE='LRZ-LMU_CLOUD';export FRONTIER_ID="[22770742_4858116576]";export CMSSW_VERSION=$FRONTIER_ID;export PandaID=4858116576;export PanDA_TaskID='22770742';export PanDA_AttemptNr='2';export INDS='data18_13TeV:data18_13TeV.00357750.physics_Main.daq.RAW';export RUCIO_APPID='reprocessing';export RUCIO_ACCOUNT='pilot'; Reco_tf.py --inputBSFile=data18_13TeV.00357750.physics_Main.daq.RAW._lb0106._SFO-8._0005.data --multithreaded=True --postExec 'all:from IOVDbSvc.CondDB import conddb; conddb.addOverride("/TRT/Cond/StatusHT", "TrtStrawStatusHT-RUN2-Reprocessing");' 'RAWtoESD:StreamESD.ExtraOutputs = [('"'"'DataHeader'"'"','"'"'StoreGateSvc+StreamESD'"'"')]' --preExec 'all:from AthenaCommon.AthenaCommonFlags import athenaCommonFlags; from InDetRecExample.InDetJobProperties import InDetFlags; from RecExConfig.RecFlags import rec; rec.doTrigger = False; InDetFlags.useDynamicAlignFolders.set_Value_and_Lock(True); InDetFlags.useMuForTRTErrorScaling.set_Value_and_Lock(True if not rec.doRDOTrigger else False); from MuonCombinedRecExample.MuonCombinedRecFlags import muonCombinedRecFlags; from AthenaMonitoring.DQMonFlags import DQMonFlags; DQMonFlags.doMonitoring=False' --autoConfiguration=everything --conditionsTag all:CONDBR2-BLKPA-2018-14 --geometryVersion=default:ATLAS-R2-2016-01-00-01 --runNumber=357750 --AMITag=r12148 --outputAODFile=AOD.22770742._002116.pool.root.1 --outputESDFile=ESD.22770742._002116.pool.root.1 --jobNumber=1177; final command:export X509_USER_PROXY=/var/lib/condor/execute/dir_28465/prod;if [ -z "$ATLAS_LOCAL_ROOT_BASE" ]; then export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase; fi;export thePlatform="x86_64-centos7-gcc8-opt";export PandaID=4858116576;source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh -c $thePlatform -s /srv/my_release_setup.sh -r /srv/container_script.sh -e "-C" Cheers, Johannes |
Thanks a lot @elmsheus. Can I ask one more favour (you or @amete)? Could you put the input file somewhere I can see it? I just generated a new grid certificate (it has been years since I needed one), but the ATLAS VOMS server is not letting me re-register. I'd like to avoid this rabbit hole, so if I can just get the input file it will speed things up a lot. |
@graeme-a-stewart, no problem - I've copied the RAW file to: /afs/cern.ch/user/e/elmsheus/public/forGraeme/data18_13TeV/data18_13TeV.00357750.physics_Main.daq.RAW._lb0106._SFO-8._0005.data Side note: X509_USER_PROXY should not be needed for the local job execution. |
Thanks a lot for the input file @elmsheus. So, I have re-run the job, both plain and in singularity. However, I have not been able to reproduce the problem as seen in the real production job, even when pointing prmon right at the singularity launch command. As it's not 100% clear what the problem actually is it's also not completely clear how to properly fix it. I can envisage a monkey patch, where prmon remembers what the highest i/o values it saw were, and it will never allow the reported values to be less than these (as they have to be monotonic increasing). If it's only a glitch that's happening at the end of the job, for reasons we still don't quite understand, then it will be a decent work around, but it's not entirely satisfactory. |
@graeme-a-stewart, thanks again for looking into this. I guess it's most probably a timing/time-out problem at the end of the singularity process, prmon or the pilot terminating the remaining processes (without having real evidence...). The suggested "monkey-patch" makes conceptually sense, since the json file reports the total sum of the I/O numbers and they should be the "last good" values of the txt file - could we go ahead with this ? |
Hi @graeme-a-stewart, Thanks a lot for this ! |
P.S. If prmon detects this problem you'll see messages like this:
|
Thinking about this a little more, I realise that there can be a race condition. The list of PIDs is gathered, then passed to each of the different monitors in turn. If the monitored process exits before it can be polled then it's impossible to read the I/O stats and we could end up with a spurious reading of zero. Usually this would be really rare, so why it would happen to almost 50% of the jobs in this task I have no idea. So I suspect there is still an different/additional cause. |
Hi,
as discussed by email, here is the GitHub issue:
we've discussed the following prmon problem already a while ago and
thought it was fixed: final values set to 0 for the IO values in the
summary json file.
Here is a PanDA job of the recent 22.0.18 validation task:
https://bigpanda.cern.ch/job?pandaid=4858116576
The value of rchar, read_bytes, wchar, write_bytes are set to 0 in
memory_monitor_summary.json:
http://bigpanda.cern.ch//media/filebrowser/35939f9e-bbac-4f18-9978-8efb01e7db46/valid3/tarball_PandaJob_4858116576_LRZ-LMU_CLOUD/memory_monitor_summary.json
because the values of the last line in memory_monitor_output.txt are
mainly 0 in the last line:
https://bigpanda.cern.ch//media/filebrowser/35939f9e-bbac-4f18-9978-8efb01e7db46/valid3/tarball_PandaJob_4858116576_LRZ-LMU_CLOUD/memory_monitor_output.txt
The jobs use the latest stable prmon version setup by the PanDA pilot
from ALRB through: setupATLAS; lsetup prmon
It seems not to be site dependent for this task (happens at CERN-TO,
CERN, CERN-PROD_EOS, TW-FTT, LRZ-LMU_CLOUD, UNI_FREIBURG) and it happens
for 952 jobs out of 2054 (so it's pretty bad...).
For all these jobs, 0-values are reported back to PanDA and the
analytics ES cluster in Chicago, so we cannot do a full analysis w.r.t
to IO for many jobs - the individual job plotting is obviously
unaffected: https://bigpanda.cern.ch/prmonplots/4858116576/
Do you want me open a github issue ? Perhaps the logic in prmon has to
be changed for PanDA jobs since prmon is "killed" too late in the update
cycle ?
Cheers, Johannes
The text was updated successfully, but these errors were encountered: