Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow 180.1 fails #44536

Closed
makortel opened this issue Mar 25, 2024 · 15 comments
Closed

Workflow 180.1 fails #44536

makortel opened this issue Mar 25, 2024 · 15 comments

Comments

@makortel
Copy link
Contributor

Step 1 of workflow 180.1 fails in all(?) IBs with

   ______________________________________     
         Running STARlight                    
   ______________________________________     
%MSG-STARLIGHT number of events requested = 3
%MSG-STARLIGHT random seed used for the run = 234570
%MSG-STARLIGHT number of cputs for the run = 1
%MSG-STARLIGHT SCRAM_ARCH version = el8_amd64_gcc10
%MSG-STARLIGHT CMSSW version = CMSSW_12_5_5_patch1
Changed to: /data/cmsbld/jenkins/workspace/ib-run-relvals/lhe1t2m3p
Changed to: /data/cmsbld/jenkins/workspace/ib-run-relvals/lhe1t2m3p
Changed to: /data/cmsbld/jenkins/workspace/ib-run-relvals/lhe1t2m3p
WARNING: In non-interactive mode release checks e.g. deprecated releases, production architectures are disabled.
WARNING: In non-interactive mode release checks e.g. deprecated releases, production architectures are disabled.
WARNING: In non-interactive mode release checks e.g. deprecated releases, production architectures are disabled.
WARNING: In non-interactive mode release checks e.g. deprecated releases, production architectures are disabled.
WARNING: In non-interactive mode release checks e.g. deprecated releases, production architectures are disabled.
WARNING: In non-interactive mode release checks e.g. deprecated releases, production architectures are disabled.
cp: cannot create directory '/data/cmsbld/jenkins/workspace/ib-run-relvals/lhe1t2m3p/CMSSW_12_5_5_patch1/config': File exists
Traceback (most recent call last):
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 114, in <module>
    main()
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 109, in main
    if not execcommand(args, opts):
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 103, in execcommand
    return eval('scram_commands.cmd_%s' % cmds[0])(args, opts)
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/CMD.py", line 58, in cmd_project
    return process(args)
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/Commands/project.py", line 71, in process
    return project_bootfromrelease(project.upper(), version, releasePath, opts)
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/Commands/project.py", line 116, in project_bootfromrelease
    area = relarea.satellite(installdir, installname, symlink, Core().localarea())
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Configuration/ConfigArea.py", line 182, in satellite
    utime(join(devconf, "Self.xml"))
FileNotFoundError: [Errno 2] No such file or directory
WARNING: Developer's area is created for architecture el8_amd64_gcc10 while your current OS is slc7_amd64.
Traceback (most recent call last):
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 114, in <module>
    main()
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 109, in main
    if not execcommand(args, opts):
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 103, in execcommand
    return eval('scram_commands.cmd_%s' % cmds[0])(args, opts)
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/CMD.py", line 53, in cmd_runtime
    return process(args)
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/Commands/runtime.py", line 30, in process_runtime
    rt.save(RUNTIME_SHELLS[args[0]])
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/RuntimeEnv.py", line 145, in save
    env = self._runtime()
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/RuntimeEnv.py", line 396, in _runtime
    tools = toolmanager.loadtools()
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/BuildSystem/ToolManager.py", line 123, in loadtools
    with open(tool) as ref:
IsADirectoryError: [Errno 21] Is a directory: '/data/cmsbld/jenkins/workspace/ib-run-relvals/lhe1t2m3p/CMSSW_12_5_5_patch1/.SCRAM/el8_amd64_gcc10/tools/tools'
WARNING: Developer's area is created for architecture el8_amd64_gcc10 while your current OS is slc7_amd64.
*** STARTING STARLIGHT PRODUCTION ***
./starlight: error while loading shared libraries: libgfortran.so.5: cannot open shared object file: No such file or directory
/data/cmsbld/jenkins/workspace/ib-run-relvals/CMSSW_14_1_X_2024-03-22-2300/pyRelval/180.1_Starlight_DoubleDiffraction_5360_HI_2023/thread0/lheevent/macros/convert_SL2LHE: error while loading shared libraries: libEve.so: cannot open shared object file: No such file or directory
sed: can't read slight.lhe: No such file or directory
sed: can't read slight.lhe: No such file or directory
Traceback (most recent call last):
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 114, in <module>
    main()
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 109, in main
    if not execcommand(args, opts):
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 103, in execcommand
    return eval('scram_commands.cmd_%s' % cmds[0])(args, opts)
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/CMD.py", line 53, in cmd_runtime
    return process(args)
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/Commands/runtime.py", line 30, in process_runtime
    rt.save(RUNTIME_SHELLS[args[0]])
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/RuntimeEnv.py", line 145, in save
    env = self._runtime()
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/RuntimeEnv.py", line 396, in _runtime
    tools = toolmanager.loadtools()
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/BuildSystem/ToolManager.py", line 123, in loadtools
    with open(tool) as ref:
IsADirectoryError: [Errno 21] Is a directory: '/data/cmsbld/jenkins/workspace/ib-run-relvals/lhe1t2m3p/CMSSW_12_5_5_patch1/.SCRAM/el8_amd64_gcc10/tools/tools'
sed: can't read slight.lhe: No such file or directory
mv: cannot stat 'slight.lhe': No such file or directory
***STARLIGHT COMPLETE***
xmllint integrity check failed on cmsgrid_final.lhe
   ______________________________________     
         Running Generic Tarball/Gridpack     
   ______________________________________     
gridpack tarball path = /cvmfs/cms.cern.ch/phys_generator/gridpacks/RunIII/5p36TeV/starlight/starlight_double_diffraction_el8_amd64_gcc10_CMSSW_12_5_5_patch1_tarball.tgz
%MSG-MG5 number of events requested = 3
%MSG-MG5 random seed used for the run = 234569
%MSG-MG5 thread count requested = 1
%MSG-MG5 residual/optional arguments = 
Traceback (most recent call last):
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 114, in <module>
    main()
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 109, in main
    if not execcommand(args, opts):
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 103, in execcommand
    return eval('scram_commands.cmd_%s' % cmds[0])(args, opts)
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/CMD.py", line 53, in cmd_runtime
    return process(args)
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/Commands/runtime.py", line 30, in process_runtime
    rt.save(RUNTIME_SHELLS[args[0]])
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/RuntimeEnv.py", line 145, in save
    env = self._runtime()
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/RuntimeEnv.py", line 396, in _runtime
    tools = toolmanager.loadtools()
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/BuildSystem/ToolManager.py", line 123, in loadtools
    with open(tool) as ref:
IsADirectoryError: [Errno 21] Is a directory: '/data/cmsbld/jenkins/workspace/ib-run-relvals/lhe1t2m3p/CMSSW_12_5_5_patch1/.SCRAM/el8_amd64_gcc10/tools/tools'
*** STARTING STARLIGHT PRODUCTION ***
./starlight: error while loading shared libraries: libgfortran.so.5: cannot open shared object file: No such file or directory
/data/cmsbld/jenkins/workspace/ib-run-relvals/CMSSW_14_1_X_2024-03-22-2300/pyRelval/180.1_Starlight_DoubleDiffraction_5360_HI_2023/thread1/lheevent/macros/convert_SL2LHE: error while loading shared libraries: libEve.so: cannot open shared object file: No such file or directory
sed: can't read slight.lhe: No such file or directory
sed: can't read slight.lhe: No such file or directory
*** STARTING STARLIGHT PRODUCTION ***
sed: can't read slight.lhe: No such file or directory
./starlight: error while loading shared libraries: libgfortran.so.5: cannot open shared object file: No such file or directory
/data/cmsbld/jenkins/workspace/ib-run-relvals/CMSSW_14_1_X_2024-03-22-2300/pyRelval/180.1_Starlight_DoubleDiffraction_5360_HI_2023/thread3/lheevent/macros/convert_SL2LHE: error while loading shared libraries: libEve.so: cannot open shared object file: No such file or directory
mv: cannot stat 'slight.lhe': No such file or directory
***STARLIGHT COMPLETE***
sed: can't read slight.lhe: No such file or directory
sed: can't read slight.lhe: No such file or directory
xmllint integrity check failed on cmsgrid_final.lhe
sed: can't read slight.lhe: No such file or directory
mv: cannot stat 'slight.lhe': No such file or directory
***STARLIGHT COMPLETE***
xmllint integrity check failed on cmsgrid_final.lhe
   ______________________________________     
         Running STARlight                    
   ______________________________________     
%MSG-STARLIGHT number of events requested = 3
%MSG-STARLIGHT random seed used for the run = 234569
%MSG-STARLIGHT number of cputs for the run = 1
%MSG-STARLIGHT SCRAM_ARCH version = el8_amd64_gcc10
%MSG-STARLIGHT CMSSW version = CMSSW_12_5_5_patch1
Changed to: /data/cmsbld/jenkins/workspace/ib-run-relvals/lhe1t2m3p
WARNING: In non-interactive mode release checks e.g. deprecated releases, production architectures are disabled.
WARNING: There already exists /data/cmsbld/jenkins/workspace/ib-run-relvals/lhe1t2m3p/CMSSW_12_5_5_patch1 area for SCRAM_ARCH el8_amd64_gcc10.
Traceback (most recent call last):
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 114, in <module>
    main()
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 109, in main
    if not execcommand(args, opts):
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/bin/scram.py", line 103, in execcommand
    return eval('scram_commands.cmd_%s' % cmds[0])(args, opts)
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/CMD.py", line 53, in cmd_runtime
    return process(args)
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/Commands/runtime.py", line 30, in process_runtime
    rt.save(RUNTIME_SHELLS[args[0]])
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/RuntimeEnv.py", line 145, in save
    env = self._runtime()
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/Core/RuntimeEnv.py", line 396, in _runtime
    tools = toolmanager.loadtools()
  File "/cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V3_00_66/SCRAM/BuildSystem/ToolManager.py", line 123, in loadtools
    with open(tool) as ref:
IsADirectoryError: [Errno 21] Is a directory: '/data/cmsbld/jenkins/workspace/ib-run-relvals/lhe1t2m3p/CMSSW_12_5_5_patch1/.SCRAM/el8_amd64_gcc10/tools/tools'
*** STARTING STARLIGHT PRODUCTION ***
./starlight: error while loading shared libraries: libgfortran.so.5: cannot open shared object file: No such file or directory
/data/cmsbld/jenkins/workspace/ib-run-relvals/CMSSW_14_1_X_2024-03-22-2300/pyRelval/180.1_Starlight_DoubleDiffraction_5360_HI_2023/thread2/lheevent/macros/convert_SL2LHE: error while loading shared libraries: libEve.so: cannot open shared object file: No such file or directory
sed: can't read slight.lhe: No such file or directory
sed: can't read slight.lhe: No such file or directory
sed: can't read slight.lhe: No such file or directory
mv: cannot stat 'slight.lhe': No such file or directory
***STARLIGHT COMPLETE***
xmllint integrity check failed on cmsgrid_final.lhe
----- Begin Fatal Exception 23-Mar-2024 02:33:35 CET-----------------------
An exception of category 'ExternalLHEProducer' occurred while
   [0] Processing global begin Run run: 1
   [1] Calling method for module ExternalLHEProducer/'externalLHEProducer'
Exception Message:
Child failed with exit code 1.
----- End Fatal Exception -------------------------------------------------
Another exception was caught while trying to clean up runs after the primary fatal exception.

e.g. https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc12/CMSSW_14_1_X_2024-03-22-2300/pyRelValMatrixLogs/run/180.1_Starlight_DoubleDiffraction_5360_HI_2023/step1_Starlight_DoubleDiffraction_5360_HI_2023.log#/

@makortel
Copy link
Contributor Author

assign generators

@cmsbuild
Copy link
Contributor

New categories assigned: generators

@alberto-sanchez,@bbilin,@GurpreetSinghChahal,@mkirsano,@menglu21,@SiewYan you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @makortel.

@smuzaffar, @antoniovilela, @rappoccio, @makortel, @Dr15Jones, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

The workflow was added in #44316, that was merged in CMSSW_14_1_X_2024-03-22-2300, in which the failures appeared.

@makortel
Copy link
Contributor Author

@stahlleiton disables the workflow in #44540

@smuzaffar
Copy link
Contributor

180.1 workflow runs fine if not run in threaded mode. In threaded mode multiple threads tries to create CMSSW_12_5_5_patch1 dev area at the same time under same path $CMSSW_BASE../lhe1t2m3p [a]. The problem was with SCRAM V3 that it was not allowing the creation of cmssw dev area with in an existing cmssw dev area if both cmssw versions were using different build rules tags. It was a bug on scram part which is now fixed. @stahlleiton , all you need is to update lheevent/runcmsgrid.sh in the grid-pack and drop the creation of $CMSSW_BASE../lhe1t2m3p and just use the current working directory (which will be a dedicated directory per thread)

[a] lheevent/runcmsgrid.sh

    # Make a directory that doesn't overlap
    if [[ -d "${CMSSW_BASE}" ]] && [[ "${LHEWORKDIR}" = "${CMSSW_BASE}"/* ]]; then
        cd ${CMSSW_BASE}/..
        TPD=${PWD}/lhe1t2m3p
        [[ ! -d "${TPD}" ]] && mkdir ${TPD}
        cd ${TPD}
        echo "Changed to: "${TPD}
    fi

    eval `scramv1 unsetenv -sh`
    export SCRAM_ARCH=${scram_arch_version}
    scramv1 project CMSSW ${cmssw_version}
    cd ${cmssw_version}/src
    eval `scramv1 runtime -sh`

@stahlleiton
Copy link
Contributor

stahlleiton commented Mar 27, 2024

180.1 workflow runs fine if not run in threaded mode. In threaded mode multiple threads tries to create CMSSW_12_5_5_patch1 dev area at the same time under same path $CMSSW_BASE../lhe1t2m3p [a]. The problem was with SCRAM V3 that it was not allowing the creation of cmssw dev area with in an existing cmssw dev area if both cmssw versions were using different build rules tags. It was a bug on scram part which is now fixed. @stahlleiton , all you need is to update lheevent/runcmsgrid.sh in the grid-pack and drop the creation of $CMSSW_BASE../lhe1t2m3p and just use the current working directory (which will be a dedicated directory per thread)

[a] lheevent/runcmsgrid.sh

    # Make a directory that doesn't overlap
    if [[ -d "${CMSSW_BASE}" ]] && [[ "${LHEWORKDIR}" = "${CMSSW_BASE}"/* ]]; then
        cd ${CMSSW_BASE}/..
        TPD=${PWD}/lhe1t2m3p
        [[ ! -d "${TPD}" ]] && mkdir ${TPD}
        cd ${TPD}
        echo "Changed to: "${TPD}
    fi

    eval `scramv1 unsetenv -sh`
    export SCRAM_ARCH=${scram_arch_version}
    scramv1 project CMSSW ${cmssw_version}
    cd ${cmssw_version}/src
    eval `scramv1 runtime -sh`

My current plan was to create a temporary directory with a different name per job (using lhe1t2m3p$RANDOM) outside of the current CMSSW directory to avoid clashing with settings of different CMSSW environments (i.e. the working directory and the gridpack settings). But if it is fine to create a different CMSSW directory inside another one, then I can simply do it as you proposed

@smuzaffar
Copy link
Contributor

smuzaffar commented Mar 27, 2024

few other workflows, e.g. 523.0, do the same. So it should be fine to create cmssw dev area inside another.

@stahlleiton
Copy link
Contributor

Great, then I will move to that approach and test it

@stahlleiton
Copy link
Contributor

stahlleiton commented Mar 27, 2024

I tried to run multiple relval test using the same gridpack before and after the fix, and it managed to pass after removing the use of the temporary directory (while it failed using the temp dir as seen before).

I made a PR to the genproduction repository updating the bash script and will create a PR to CMSSW to update the gridpack.

@stahlleiton
Copy link
Contributor

The issue in worflow 180.1 should now be addressed in #44671

@stahlleiton
Copy link
Contributor

stahlleiton commented Apr 25, 2024

The issue in worflow 180.1 should now be addressed in #44671

Were there any issues seen in the IBs after PR 44671 was integrated yesterday?

If not, then we can probably close this issue as resolved.

@makortel
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants