Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow grouping input cubes by date (instead of filename) for fix_metadata #2551

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

schlunma
Copy link
Contributor

Description

This PR allows grouping the input cubes for our fix_metadata functions by date, i.e., all files with the same data range are passed to the fix simultaneously). This can be enabled by setting the class variable GROUP_CUBES_BY_DATE = True in the corresponding fix class. This allows implementing fixes where variables from multiple input files are necessary (for example, to derive rsut for ERA5).

This solution only works for projects where the input files are located in the same directory, and the input file pattern is flexible enough to find all files. This is fine for the native ERA5 data in netCDF format (that we need to manually download and put into the corresponding directories). For other projects where files are stored in different directories, further changes are necessary (potentially in local.py). However, this PR is a prerequisite to make these other cases work.

By default, input cubes are grouped by filename for fix_metadata (i.e., each fix_metadata call operates only on a single file):

by_file = defaultdict(list)
for cube in cubes:
by_file[cube.attributes.get("source_file", "")].append(cube)
for cube_list in by_file.values():
cube_list = CubeList(cube_list)
for fix in fixes:
cube_list = fix.fix_metadata(cube_list)

Note that this is fully backwards-compatible since the new functionality needs to be explicitly enabled.

Closes #1806

Link to documentation: TBA


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@schlunma schlunma added the fix for dataset Related to dataset-specific fix files label Oct 11, 2024
@schlunma schlunma added this to the v2.12.0 milestone Oct 11, 2024
@schlunma schlunma self-assigned this Oct 11, 2024
@schlunma
Copy link
Contributor Author

schlunma commented Oct 11, 2024

Here's a small recipe to test this for rsut (I also included rsdt to ensure that this still works fine):

# ESMValTool
---
documentation:
  title: test
  description: test
  authors:
    - schlund_manuel

datasets:
  - {project: native6, dataset: ERA5, type: reanaly, version: v1, tier: 3, timerange: 2000/2001}

diagnostics:

  test:
    variables:
      rsut:
        mip: Amon
      rsdt:
        mip: Amon
    scripts:
      null

Input files need to be arranged like this:

.
└── Tier3
    └── ERA5
        └── v1
            └── mon
                ├── rsdt
                │   ├── era5_toa_incident_solar_radiation_2000_monthly.nc
                │   └── era5_toa_incident_solar_radiation_2001_monthly.nc
                └── rsut
                    ├── era5_mean_top_net_short_wave_radiation_flux_2000_monthly.nc
                    ├── era5_mean_top_net_short_wave_radiation_flux_2001_monthly.nc
                    ├── era5_toa_incident_solar_radiation_2000_monthly.nc
                    └── era5_toa_incident_solar_radiation_2001_monthly.nc

@bouweandela do you think this approach is a reasonable solution to this problem? As mentioned in the description, it doesn't solve the problem for all cases, but a different grouping will be necessary for all of them. And it is fully sufficient for the ERA5 netCDF case.

Copy link

codecov bot commented Oct 11, 2024

Codecov Report

Attention: Patch coverage is 57.14286% with 12 lines in your changes missing coverage. Please review.

Project coverage is 94.83%. Comparing base (da3440e) to head (06ff0ac).

Files with missing lines Patch % Lines
esmvalcore/cmor/_fixes/native6/era5.py 30.00% 7 Missing ⚠️
esmvalcore/cmor/fix.py 68.75% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2551      +/-   ##
==========================================
- Coverage   94.91%   94.83%   -0.08%     
==========================================
  Files         251      251              
  Lines       14261    14282      +21     
==========================================
+ Hits        13536    13545       +9     
- Misses        725      737      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix for dataset Related to dataset-specific fix files
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Variable derivation for ERA5 on-the-fly CMORizer
1 participant