-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: e2c command for SImon proess leads to xarray fault. #255
Comments
All MPAS processing is generating these errors (Omon and SImon). A summary:
|
@TonyB9000 the Minimal Complete Verifiable Example (MVCE) no longer works: when I check I only find |
Alright, I can reproduce this with 12 months data with v1.11.2:
|
I suspect that the "Resource temporarily unavailable " error is from the use of "lock=False" 0f91fb4. And when testing v1.11.1, things work okay. With |
@tomvothecoder I don't know if this is specific to acme1's file system, but should we consider revert to use 1.11.1 for upcoming e3sm_unified deployment? I think let's decide after a more thorough investigation. |
I think you might be right about Related issues
I agree that more investigation is needed before deciding what to do next. v1.11.2 includes another fix for My understanding is that Tony is still using acme1 and |
For future reference, you can use a Python script to invoke
|
Thanks for your insight. Based on Tony's recent operation, I think the zppy application, only including processing atmopshere and land variables are not impacted. So perhaps, we can spend time to resovle this more ocean, sea-ice related error on the side. |
Sounds good. I will convert Tony's command into a Python script and try to do some debugging today. |
@tomvothecoder @chengzhuzhang Just a side note - more than half of the v2_1 MPAS jobs (run with datasm back in February) completed successfully. So, I was going to examine the generation dates to see if the first appearance of failures corresponds to any changes I (or others) make have made to e2c. I am interested in the python-driver you (Tom) mentioned, and will investigate. |
@chengzhuzhang Jill wrote "When create this example, you may want to isolate the test data so others can test." Sorry about that. Yes, I neglected to isolate that data for retesting. |
@tomvothecoder @chengzhuzhang Just for the record, we are running into the same errors on Chrysalis - so this is not limited to acme1. KEY FACT: Circa February 10, both Omon and SImon jobs (v2_1 data) were completing successfully (via datasm postprocess). Soon thereafter, sporadic failures became the norm. I have maintained all of the runlogs for the 296 (v2_1) MPAS jobs, of which 26 are consistently failing. I will seek to characterize the failure dates and types, to see if I can isolate a date upon which to focus. |
I would suggest to revert the change that sets |
Turns out,
|
With "lock" parameter elided, the v2_1 1pctCO2 Omon.sosga completes successfully, as it had in February. But Omon.so fails. The stack-trace:
|
I ran the same tests on Chrysalis (v2_1 data, 1pctCO2, Omon.sosga and Omon.so), and both failed - but for different reasons.
The test with "so" failed as well, but appears identical to the acme1 failure ("failed to prevent overwriting existing key _FillValue in attrs on variable 'so'"). |
As I wrote in slack, I will ensure that the installed cmor tables are up-to-date on both acme1 and chrysalis, then rerun both of the v2_1 Omon.sosga and Omon.so tests on each. I will also make sure that the tables-path is reiterated in the main log list of parameters applied. |
Hi @TonyB9000 for further trouble shooting, do you have intermediate files saved (i.e. before cmorization) for the variables that having this _Fillvalue issues, for example: for so from v2_1.LR.1pctCO2_0101.mpaso? Do you have a e3sm_to_cmip commandline for reproducing this issue? |
Also, please submit a PR that elides "lock=False", I believe this fixed the concurrency issue, but leaves the _FillValue issues. |
Did the PR, and the merge (tests successful - whatever they are). I'll run a test of the failed _fillValue issue - but e2c "hides" the interim regridded stuff in volatile "temp" directories. I'll see if they can be retained. |
@chengzhuzhang I just ran:
which create a script that issues:
The content of the "sublog" (
The most recent elements of my "TMPDIR" (
and |
@chengzhuzhang Presently, the only variables that are exhibiting these error are a subset of the Omon (so, sos, thetao, thkcello, tos, volcello), and ONLY for 1pctCO2 and abrupt-4xCO2. The "job directory" created for this job (all 1pctCO2 vars), when listed chronologically, show that only the last 3 were touched: (see
|
Thank you @TonyB9000! Could you open permission for |
Everything in /p/user_pub/e3sm/bartoletti1/tmp should be accessible now. |
I'm trying to visualize native MPAS output through uxarray, but it doesn't see to work out of box with their docs. @xylar We ran into _FillValue issue only with two experiments. I can't help thinking if it is relevant to the recent coupled group endeavor about model crushing in warmer climate. Do you have some script that I can use to do some quick view of MPAS data? |
@xylar Never mind. I made the script working with uxarray. Checking native data now... |
@TonyB9000 Definitely, we should consider to break large 3d variables into year chunks...I see a 316 G file with 6000 time steps. |
@chengzhuzhang The last run (depicted above with "2024-05-15 19:56:05,615 [INFO]: so.py(handle:48) >> Starting so", was only given 1 year of data to run with. (That is the ONLY way I know of to limit the years and file sizes. e3sm_to_cmip has no command line option to limit the years. You must create external code that calls e2c sequentially on segments of data. That is what the CWL workflows were doing...) |
@chengzhuzhang I am doing another run, hopefully with more log results. |
@TonyB9000 I think I found the problem...
I think these are the native variables used to derive Omon (so, sos, thetao, thkcello, tos, volcello). So question, any clue why we suddenly have _FillValues in these 2 experiments...I think originally we just use cell mask, with the presence of _FillValue, how should we modify e2c? |
@chengzhuzhang @xylar Long Long Ago, in a Galaxy Far Far Away . . . we addressed a similar issue by editing the native data with ncatted: (in /p/user_pub/e3sm/bartoletti1/Operations/8_Corrections/history/FIX_FillValue/remove_FillValue_attribute.sh)
But I don't know if this past solution applies (I think that was a "_FillValue=NaNf" issue) |
@chengzhuzhangand @TonyB9000, we added fill values to as many variables as we could to increase CF compliance about a year and a half ago: Anyway, I'm happy to look into fixing this. I suspect the fix is simply to drop the attribute if it is present, as the error message above suggests. I would do this in the handlers rather than with |
@xylar Thank you for clarifying and the quick fix. |
I have no idea. I would expect all v2.1 runs with ocean and sea-ice output to have this issue. The feature was merged before the v2.1 tag was created. |
What happened?
e3sm_to_cmip command leads to a cascade of xarray errors, beginning with "cmor_handlers/mpas_vars/siv.py (ds.compute())" call.
What did you expect to happen? Are there are possible answers you came across?
n/a
Minimal Complete Verifiable Example (MVCE)
CLI command
Python script
Relevant log output
Anything else we need to know?
The input directory (
/p/user_pub/e3sm/bartoletti1/Operations/5_DatasetGeneration/AltProcess/tmp/v2_1.LR.historical_0251/native_data
) contains symlinks to the actual native data, as well as to the region file:EC30to60E2r2_mocBasinsAndTransects20210623.nc (/p/user_pub/e3sm/staging/resource/maps/EC30to60E2r2_mocBasinsAndTransects20210623.nc)
and to the namefile:
mpassi_in (/p/user_pub/e3sm/warehouse/E3SM/2_1/historical/LR/sea-ice/native/namefile/fixed/ens4/v0/mpassi_in)
as is required for MPAS cmorizing.
Environment
populated config files : /home/bartoletti1/mambaforge/.condarc
conda version : 24.1.2
conda-build version : not installed
python version : 3.10.6.final.0
solver : libmamba (default)
virtual packages : __archspec=1=broadwell
__conda=24.1.2=0
__glibc=2.17=0
__linux=3.10.0=0
__unix=0=0
base environment : /home/bartoletti1/mambaforge (writable)
conda av data dir : /home/bartoletti1/mambaforge/etc/conda
conda av metadata url : None
channel URLs : https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
package cache : /home/bartoletti1/mambaforge/pkgs
/home/bartoletti1/.conda/pkgs
envs directories : /home/bartoletti1/mambaforge/envs
/home/bartoletti1/.conda/envs
platform : linux-64
user-agent : conda/24.1.2 requests/2.31.0 CPython/3.10.6 Linux/3.10.0-1160.108.1.el7.x86_64 rhel/7.9 glibc/2.17 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.7
UID:GID : 61843:4061
netrc file : None
offline mode : False
The text was updated successfully, but these errors were encountered: