-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v16.3 DA pre-implementation parallel #776
Comments
A meeting hosted by @aerorahul joined by @arunchawla-NOAA on May 10th outlined the following action regarding this issue:
|
|
Setting DA retro ecflow workflow parallel on WCOSS2 started on 5/12. This activity has a bit delay from 5/12 to 5/23 due to WCOSS2 RFC, switch, system issue, ETC.
|
The new GSI package as of June 1st is 047b5da submodule 99f147c. This version has three file and one directory changed in build process. As of June 10th we have the following on Dogwood for gsi: |
Hi @lgannoaa - There is one minor correction here - instead of Additionally, given changes in the develop branch that have never made it into the release branch for GFS DA components previously, the Also, since export ncio_ver=1.0.0 and load(pathJoin("ncio", os.getenv("ncio_ver"))) Using the stack-built sorc/build_enkf_chgres_recenter_nc.sh - removal of the following lines:
sorc/enkf_chgres_recenter_nc.fd/input_data.f90 - replace module_fv3gfs_ncio with module_ncio sorc/enkf_chgres_recenter_nc.fd/output_data.f90 - replace module_fv3gfs_ncio with module_ncio sorc/enkf_chgres_recenter_nc.fd/makefile - replace FV3GFS_NCIO_INC entries with NCIO_INC Finally, for building the GSI, I'd recommend the following changes in export GSI_MODE="GFS" This will build the GSI in global mode (the default is regional - adding WRF to the build) and will limit the building of utilities from all utilities (by default) to just those utilities required within the GFS. |
@lgannoaa I missed an update that is required for making sorc/enkf_chgres_recenter_nc.fd/makefile work with the stack's ncio module: Replacing FV3GFS_NCIO_LIB with NCIO_LIB. Many thanks to @RussTreadon-NOAA for bringing this to my attention. |
Hi @lgannoaa @KateFriedman-NOAA @emilyhcliu, I wanted to ask a question about how you would like to proceed with respect to the renaming of the gsi, enkf, and ncdiag_cat executables. Should I make changes to the GSI/scripts to use the new executables, or will sorc/link_fv3gfs.sh be updated to link these new executable names with the old naming convention? Please let me know your preference so that I can make the necessary changes for release/gfsda.v16.3.0. |
@MichaelLueken-NOAA Let's move this discussion back to the main issue for this upgrade: issue #744. This issue is just for documenting the parallel. @lgannoaa Please keep workflow changes and non-parallel setup discussions in issue #744. Thanks! @MichaelLueken-NOAA Please summarize the GSI executable name changes that are occurring in a new comment in #744 and then tag Emily, Rahul, Lin, and myself to discuss. Thanks! |
@arunchawla-NOAA @aerorahul @emilyhcliu @MichaelLueken-NOAA @RussTreadon-NOAA @KateFriedman-NOAA HOMEgfs: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0 |
A meeting with Emily on June 10th outline the following action:
|
@emilyhcliu Emily, could you please check with Jun Wang and Helin Wei to make sure the updated forecast model is used in this cycled experiment ? Model updates include changes in LSM for improving snow forecast and in UPP for fixing cloud ceiling calculation. (@junwang-noaa @HelinWei-NOAA @WenMeng-NOAA ) |
@lgannoaa I'm working with Helin to get a new GLDAS tag ready (which includes a needed small update related to adding the "atmos" subfolder into the GDA, the PRCP CPC gauge file path needed to add it too). The current GLDAS tag in the I'm also working to wrap up Fit2Obs testing and try to get a new tag for your use on WCOSS2 ASAP. |
@HelinWei-NOAA I do not see the snow updates PR to the ufs-weather-model production/GFS.v16 branch. Would you please make one if the code updates are ready? Thanks |
@junwang-noaa I created one on fv3atm |
HOMEgfs/parm/config config.resources.nco.static and config.fv3.nco.static have been used to fix eupd job card issue that caused it to fail. The Global Workflow emc.dyn versions of those configs aren't yet updated to run high res on WCOSS2. |
@emilyhcliu Please review the first three full cycles of the output from this parallel. |
A meeting with @emilyhcliu on June 14th. indicated there will be a few more short cycled tests required for package adjustment before start running the official implementation parallel.
|
A meeting with @emilyhcliu @aerorahul @KateFriedman-NOAA on June 15th outlined the following: |
Information received from Daryl on June 16th outline the following:
|
Ali Abdolali is now assigned as WAVE point of contact. |
As of noon on June 24th, here is the state of this parallel:
|
A meeting with CYCLONE_TRACKER code manager Jiayi on June 24th. Checked Jobs and output in $COM/$RUN.$PDY/$cyc/atmos/epac and natl. These jobs run successfully. |
Emily checked new gdas/gfs prep jobs and jobs using its output on June 24th. It is working. |
As of EOB June 28th, there are three incoming changes:
A new run of cycled test started on June 29 to test a few days with DELETE_COM_IN_ARCHIVE_JOB="YES". |
Switch verif-global.fd to verif_global_v2.9.5 tag on July 5th. |
Meeting with code manager for verif-global and FIT2OBS on July 5th indicated no issue found in a 10 days testing run. |
Status of DA package
|
July 14th, Cactus has system degradation issue. NCO halt the Cactus system. This impacted the parallel into halt and incomplete transfer jobs. example of failed eupd (zombie) job from system issue: @arunchawla-NOAA @dtkleist @emilyhcliu @aerorahul Execute touch on COM. |
Set DELETE_COM_IN_ARCHIVE_JOB="YES" after 20211025 rerun completed. |
Due to the extreme slowness of the HPSS transfer rate which caused archive jobs failure. A redesign of the archive job is in progress. The prototype has been tested on July 15th for a single cycle. It provided much higher performance compare to the original archive job design architecture. Cactus performance over the weekend of July 16th and 17th has improved a little bit. The HPSS transfer is still remain slow. The plan to include the above two changes is still on going. Starting 2021103000 a two cycle testing on new designed archive jobs and offline METPlus cron task is in place. As of July 19th 10:00AM, the test for the above two changes was in place for CDATE 2021103000 ~ 2021103106. The initial test result was good, however the wcoss2 hpss transfer slowness issue in the afternoon caused many test jobs to fail. The parallel on halt on CDATE 2021103106. The COM size is 91 TB waiting for the archive job to success to resume COM clean up and parallel job to proceed. As of July 20th, these newly designed archive jobs have been running without issue. The HPSS transfer speed became slow again in the afternoon into the evening. This time, the parallel was running too advance to the point it has to be halt to wait for archive jobs to finish. The COM usage was too high. The archive jobs need to finish for clean up jobs to clear COM. At 3:00P EST July 21th, a check on archive jobs shown all caught up to the speed of the parallel. Therefore, the the next time parallel resume to run on CDATE 2021110212 the COM cleanup job will be enabled. Tag: @arunchawla-NOAA @dtkleist @emilyhcliu @aerorahul for your knowledge |
@emilyhcliu HPSS has system error in the past few hours. Many archive jobs with error status=141. Parallel is on halt as of July 18th 10:00PM until HPSS system issue resolved. |
We do not have access to Cactus from July 19th 11-15Z, and again from 20-00Z due to system upgrades and test. |
Transfer speed has been increased. Parallel resumed to run on CDATE=2021103118. The newly designed archive jobs, clean jobs are now in place. The METPlus jobs is now running offline. Transfer speed become slow again on 7/20 in the afternoon into the evening. Parallel halt on 2021110206 waiting for archive jobs to finish transfer. Pending transfer jobs all completed. Parallel will resume when Cactus return to the developer. |
The Cactus will not available on 7/21 (Thu), 7/22 (Fri), and 7/25 (Mon). There will be a production switch on 7/28 (Thu). These events will impact parallel. |
July 21th evening, two jobs run into system issue and failed. The 2021110300 eupd and gfs fcst. |
July 22th. @emilyhcliu indicated a need to start second parallel on CDATE=2022062000. Preparation is pending on NCO approval to run parallel on production machine. Tag: @arunchawla-NOAA @dtkleist @emilyhcliu @aerorahul @junwang-noaa |
July 22 evening, the transfer speed remain slow. |
July 23 evening into July 24, FIT2OBS starting CDATE=2021110518 failed on every cycle. Contacted code manager for assistance. The following FIT2OBS job logs shown HDF error in reading gfs forecast output atmf nc files: The following FIT2OBS job logs shown segmentation fault: The fit2obs from CDATE 2021110518 ~ 2021111000 were missing due to job failure noted above. Due to the need for testing fit2obs failure issue. The fit2obs package is now switched to /lfs/h2/emc/global/noscrub/lin.gan/git/Fit2Obs/newm.1.5 starting CDATE=2021110906 On July 25th, @jack-woollen suggested to rerun job with increased memory or the exclusive node use. I modified the ecflow to do this change. However, the job failure issue still remain. The parallel is resumed on CDATE=2021110906. |
@emilyhcliu @MichaelLueken-NOAA |
@lgannoaa
I'm not sure what it means, but the stdout files are empty and this is the only error message. I'm not sure what this error message means. Is this an issue with allocating resources through ecflow? @MichaelLueken-NOAA |
@emilyhcliu @junwang-noaa The efcs group 36 on 2021110918 using The gfs forecast on CDATE=2021111100 |
@lgannoaa |
This is a system issue. It seems to occur whenever GDIT does some work on
the system, like running production.
It is happening with my jobs too and reruns run. It appears that there are
some bad nodes that cause the issue.
Moorthi
…On Tue, Jul 26, 2022 at 9:48 AM MichaelLueken-NOAA ***@***.***> wrote:
@MichaelLueken-NOAA <https://github.com/MichaelLueken-NOAA> Please let me
know what is the certified resource requirement for the failed job above?
The gfs analysis on CDATE=2021111100 failed with the same issue. Both of
these jobs claim same resource: #PBS -l
select=55:mpiprocs=15:ompthreads=8:ncpus=120 #PBS -l
place=vscatter:exclhost mpiexec -l -n 825 -ppn 15 --cpu-bind depth --depth
8 ...gsi.x gfs_atmos_analysis_00.o9078921 DATA:
/lfs/h2/emc/stmp/lin.gan/RUNDIRS/da-dev16-ecf/2021111100/gfs_atmos_analysis_00.9078921.cbqs01
This job failed with: launch RPC: Couldn't allocate a port for PMI to use
gfs_atmos_analysis_00.o9078937 This job is a rerun from the failure above.
@lgannoaa <https://github.com/lgannoaa>
I have no idea what the certified resource requirement for the failed job
is. I don't run the global-workflow j-jobs and scripts. All I know is that
the jobs failed with the launch RPC: Couldn't allocate a port for PMI to
use error message. That is all I can assist with.
—
Reply to this email directly, view it on GitHub
<#776 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALLVRYRN5I4K47OBDBX2ZATVV7UD5ANCNFSM5VRVE32A>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Dr. Shrinivas Moorthi
Research Meteorologist
Modeling and Data Assimilation Branch
Environmental Modeling Center / National Centers for Environmental
Prediction
5830 University Research Court - (W/NP23), College Park MD 20740 USA
Tel: (301)683-3718
e-mail: ***@***.***
Phone: (301) 683-3718 Fax: (301) 683-3718
|
@lgannoaa If you haven't already, I suggest opening a WCOSS2 helpdesk ticket about this error, it looks like a machine problem, particularly because reruns are successful. GDIT can take a look at the nodes that the failed jobs ran on. |
@SMoorthi-emc Thank you for providing a data point. |
I had put tickets for this kind of stuff before and it always turned out to
be some node issue.
Moorthi
…On Tue, Jul 26, 2022 at 10:00 AM Rahul Mahajan ***@***.***> wrote:
@SMoorthi-emc <https://github.com/SMoorthi-emc> Thank you for providing a
data point.
@lgannoaa <https://github.com/lgannoaa> Please open an issue with WCOSS2
helpdesk and cc Steven.Earle.
—
Reply to this email directly, view it on GitHub
<#776 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALLVRYXIEGJNYECPG4CI2KTVV7VNRANCNFSM5VRVE32A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Dr. Shrinivas Moorthi
Research Meteorologist
Modeling and Data Assimilation Branch
Environmental Modeling Center / National Centers for Environmental
Prediction
5830 University Research Court - (W/NP23), College Park MD 20740 USA
Tel: (301)683-3718
e-mail: ***@***.***
Phone: (301) 683-3718 Fax: (301) 683-3718
|
Transfer speed slow down again in the evening of July 26th. PTMP is at 90%. Parallel paused on CDATE=2021111300 waiting for transfer job to catch on. |
The realtime parallel is in preparation. The configuration: DATA directory limit is 12T required self cleanup process. |
@lgannoaa For the real-time parallel, are you planning to source dump/obs files from production com or the global dump archive? |
Production switch on July 28th. This retro parallel is on halt CDATE=2021111506. All archive jobs completed. NCO indicated the system auto scrub will remain on for PTMP on all systems. The COM /lfs/h2/emc/ptmp/lin.gan/da-dev16-ecf will be touched. The first touch will be executed at the end of July 28th. |
NCO announced on Aug 1st that Dogwood will have two days outage on Aug 2nd and Aug 4th. |
Status Update from DA - issues, diagnostics, solution and moving forward Issues, diagnostics, bug fixes, and tests The issue and diagnostics are documented in NOAA-EMC/GSI#438 A short gfs.v16.3.0 parallel test (v163t) was performed to verify the bug fix (2) Increasing NSST biases and RMS of O-F (no bias) are observed in the time seires of AVHRR MetOp-B channel 3 and the window channels from hyperspectral sensors (IASI, CrIS). Foundation temperature bias and rms compared to operational GFS and OSTIA increase with time. It was found that the NSST increment file from GSI was not passing into the global cycle properly. The issue and diagnostics in detail are documented in NOAA-EMC/GSI#449 The bug fix is documented in NOAA-EMC/GSI#448 Test We will keep this running for a few days.... Here is the link to the Verification page: https://www.emc.ncep.noaa.gov/gc_wmb/eliu/v163ctl/ We should stop the retrospective parallel on Cactus and re-run it with the bug fixes. |
Closing this issue. Replaced with: |
Description
This issue is to document the v16.3 da pre-implementation parallel.
Initial tasking email sent by @arunchawla-NOAA indicated:
Emily and Andrew summarized on May 9th, 2022:
First full cycle starting CDATE is retro 2021101600
HOMEgfs: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0
pslot: da-dev16-ecf
EXPDIR: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0/parm/config
COM: /lfs/h2/emc/ptmp/Lin.Gan/da-dev16-ecf/para/com/gfs/v16.3
log: /lfs/h2/emc/ptmp/Lin.Gan/da-dev16-ecf/para/com/output/prod/today
on-line archive: /lfs/h2/emc/global/noscrub/lin.gan/archive/da-dev16-ecf
METPlus stat files: /lfs/h2/emc/global/noscrub/lin.gan/archive/metplus_data
FIT2OBS: /lfs/h2/emc/global/noscrub/lin.gan/archive/da-dev16-ecf/fits
Verification Web site: https://www.emc.ncep.noaa.gov/gmb/Lin.Gan/metplus/da-dev16-ecf
(Updated daily at 14:00 UTC on PDY-1)
HPSS archive: /NCEPDEV/emc-global/5year/lin.gan/WCOSS2/scratch/da-dev16-ecf
FIT2OBS:
/lfs/h2/emc/global/save/emc.global/git/Fit2Obs/newm.1.5
df1827cb (HEAD, tag: newm.1.5, origin/newmaster, origin/HEAD)
obsproc:
/lfs/h2/emc/global/save/emc.global/git/obsproc/v1.0.2
83992615 (HEAD, tag: OT.obsproc.v1.0.2_20220628, origin/develop, origin/HEAD)
prepobs
/lfs/h2/emc/global/save/emc.global/git/prepobs/v1.0.1
5d0b36fba (HEAD, tag: OT.prepobs.v1.0.1_20220628, origin/develop, origin/HEAD)
HOMEMET
/apps/ops/para/libs/intel/19.1.3.304/met/9.1.3
METplus
/apps/ops/para/libs/intel/19.1.3.304/metplus/3.1.1
verif_global
/lfs/h2/emc/global/noscrub/lin.gan/para/packages/gfs.v16.3.0/sorc/verif-global.fd
1aabae3aa (HEAD, tag: verif_global_v2.9.4)
Requirements
A meeting has been setup to discuss what is the action summary for package preparation.
Acceptance Criteria (Definition of Done)
Dependencies
The text was updated successfully, but these errors were encountered: