Creation of Static and Forcing Fields #34

sadamov · 2024-05-17T13:57:13Z

Summary

For international collaboration a selection of state variables, as well as static and forcing features were selected (see here for reference: Sheet). While the variables are provided by the user in a zarr-archive (usually NWP-model output), the additional forcing and static features are often not available. I suggest to provide the user with the option to generate these fields using a new script create_forcings.py. The script works for state and boundary domains alike, simple visualizations were provided to make this issue easier to understand. Please feel free to improve on any part of the script and help writing PRs 🚀

New fields

datetime forcings,
land-sea masks,
boundary masks,
top-of-atmosphere (TOA) radiation forcings.

Questions / Issues:

Does such a script belong in the neuralLAM repo or in a pre-processing step?
Should it be one or multiple scripts?
I have a script version for COSMO and MEPS data - should become domain agnostic
See below for questions specific to each field
Currently the fields are stored in one .zarr archive. It probably makes sense to specify the zarr archive where each field should append to.

Function Details and Rationale

`calculate_datetime_forcing(ds, args)`

This function calculates the datetime forcing for the neural LAM model by generating sinusoidal representations of the hour of the day and the time of the year. These features help the model understand diurnal and seasonal variations. Previously these were part of the PyTorch Dataset, provided in __get_item__; it certainly makes more sense to calculate these in advance, should be more efficient and allows for cleaner Dataset code.

`generate_land_sea_mask(xy_state, tempdir, projection, high_res_factor=10)`

This function generates a land-sea mask for the neural LAM model. The shapefile for the land-sea mask comes from natural-earth at 50m resolution. Percentage of gridcell covered by land is calculated by rasterization of high-res shapefile. This generates static fields, for some regions this mask will vary in time as parts of the sea freeze.

`create_boundary_mask(lonlat_state, lonlat_boundary, boundary_thickness, overlap=False)`

This function creates a boundary mask for the neural LAM model. The boundary mask defines the edges of the model's domain and helps in handling boundary conditions, which are essential for limited area modeling. It allows for overlap (Flatbread) or non-overlapping boundaries (Donut). The borders of the interior state domain are identified using anemoi-dataset functionalitites for lat-lon matching of boundary and state. The thickness of the boundary is applied using binary_dilution, which is a terrible approach -> should be calculated in meters in projection of state.

`generate_toa_radiation_forcing(ds, lonlat)`

This function pre-computes all static features related to the grid nodes for top-of-atmosphere (TOA) radiation. This functionality is provided by the graphcast repository. It is however quite slow for large domains and again depends on lat lon. Maybe there are better options.

Usage

The example script is based on MEPS (since we all have that data) and ERA5, directly from the google weatherbench cloud storage. To run it, store the script on the first level of the NeuralLAM repo, with meps_example in the data-folder:

pip install gcsfs 
pip install rasterio
pip install git+https://github.com/google-deepmind/graphcast.git
python create_forcings.py

The text was updated successfully, but these errors were encountered:

sadamov · 2024-05-17T14:00:33Z

@leifdenby As we discussed in the meeting: the boundary mask certainly has high importance as it is not a simple static feature but potentially used in loss-calculation and graph-design. As you can see from the plots, I only got so far as to create a very basic working example for MEPS+ERA5 as a basis for discussion and future PRs. It would be great to get your boundary masking expertise here 🤝

joeloskarsson · 2024-05-17T14:29:32Z

Is a land-sea mask or a land-water mask (including lakes) most relevant? I have seen that for the MEPS area the large Swedish lakes impact some surface variables, so it would perhaps be nice to be able to inform the model of lakes also. Although, we could definitely provide a utility for land-sea mask and then if one wants something more specific (like way I describe) you use have to get that from somewhere else and put it into your zarr separately.
I am generally for including these types of utility scripts in neural-lam, as they are highly useful (even crucial) for working with the codebase. I think it would be good to group all such scripts in some way though, perhaps as a utility sub-module or similar.

TomasLandelius · 2024-05-22T12:30:15Z

For many static (climatological) surface parameters I guess ECOCLIMAP-II would work: https://opensource.umr-cnrm.fr/projects/ecoclimap/wiki (1 km resolution)
When it comes to more dynamical things like sea and lake ice I guess one would get that from satellite based products. Leaving this for later now.

TomasLandelius · 2024-05-23T06:53:59Z

I have something to do

create_boundary_mask(lonlat_state, lonlat_boundary, crs_state, boundary_min_distance, boundary_max_distance, overlap=False)

producing a result like this for the MEPS domain with ERA5 (latlon projection) on the boundary.

TomasLandelius · 2024-05-23T17:01:26Z

Regarding TOA irradiance. What about something like this:

def toa_solirr(lat, lon, day, hr_utc):
    dec = -23.45 * np.cos(np.pi/180*360/365*(day-81)) * np.pi/180
    hr_lst = hr_utc + lon/15
    hr_angle = 15 * (hr_lst - 12)
    return np.sin(lat*np.pi/180) * np.sin(dec) + np.cos(lat*np.pi/180) * np.cos(dec) * np.cos(hr_angle*np.pi/180)

Is that also too slow?

%timeit toa_solirr(lat_era, lon_era, 0, 0) 76.7 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

joeloskarsson · 2024-05-30T14:45:27Z

I have something to do

create_boundary_mask(lonlat_state, lonlat_boundary, crs_state, boundary_min_distance, boundary_max_distance, overlap=False)

producing a result like this for the MEPS domain with ERA5 (latlon projection) on the boundary.

This looks great I think! How general is this (e.g. how hard to use it for another domain, another projection)? Can you share the full code for this? What is the unit for the distances now? meters?

Regarding TOA irradiance. What about something like this:
def toa_solirr(lat, lon, day, hr_utc):
    dec = -23.45 * np.cos(np.pi/180*360/365*(day-81)) * np.pi/180
    hr_lst = hr_utc + lon/15
    hr_angle = 15 * (hr_lst - 12)
    return np.sin(lat*np.pi/180) * np.sin(dec) + np.cos(lat*np.pi/180) * np.cos(dec) * np.cos(hr_angle*np.pi/180)
Is that also too slow?

%timeit toa_solirr(lat_era, lon_era, 0, 0) 76.7 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Since it is entirely parallelizable that should be quite fast. I tested a parallel version with broadcasting:

import numpy as np
import xarray as xr
import time

zarr_path = "data/era5_latlon/ll.zarr"

ll_xds = xr.open_zarr(zarr_path)

lon_1d = ll_xds.longitude.to_numpy() # (num_lon,)
lat_1d = ll_xds.latitude.to_numpy() # (num_lat,)

# dim order is (lat, lon, day, hr)
def toa_solirr(lat, lon, day, hr_utc):
    lat = lat[:, np.newaxis, np.newaxis, np.newaxis]
    lon = lon[np.newaxis, :, np.newaxis, np.newaxis]
    day = day[np.newaxis, np.newaxis, :, np.newaxis]
    hr_utc = hr_utc[np.newaxis, np.newaxis, np.newaxis, :]

    dec = -23.45 * np.cos(np.pi / 180 * 360 / 365 * (day - 81)) * np.pi / 180
    hr_lst = hr_utc + lon / 15
    hr_angle = 15 * (hr_lst - 12)
    return np.sin(lat * np.pi / 180) * np.sin(dec) + np.cos(
        lat * np.pi / 180
    ) * np.cos(dec) * np.cos(hr_angle * np.pi / 180)

day = np.arange(7)
hour = np.arange(24)

start_time = time.time()
toa = toa_solirr(lat_1d, lon_1d, day, hour)
end_time = time.time()
print(f"Took: {end_time-start_time} s")

(quick test code, but does the trick)
That runs in around 1 second for a week, so around a minute per year (on my laptop, latlons from ERA5 0.25 deg.), which seems acceptable.
I have not looked into the math of this, but what is the difference between this and the GraphCast TOA implementation (https://github.com/google-deepmind/graphcast/blob/main/graphcast/solar_radiation.py)? A benefit of that one is that it uses Xarray. I quickly run out of memory doing this in numpy.
I don't think it's a big problem if this takes a bit of time, since you should not have to re-compute it very often. Another thought (given how paralellizable this is), is to try to do it on GPU? Could almost just replace the np in Tomas example with torch. But then there will be a lot of data to shuffle back to the CPU and save to disk.

TomasLandelius · 2024-05-31T06:53:48Z

The difference to GC implementation is that they focus on having something very similar to what goes into ERA5 while mine opts for as simple as possible. Maybe too simple? It does for example not care about the Sun-Earth distance that causes ca 7 % variation during the year. Still, I think other things contribute more to the error than imperfections in this forcing. However, I've not checked the output from my version with more precise ones - should be done.

TomasLandelius · 2024-06-05T07:02:39Z

Here is an update with a correction to the declination. It now outputs irradiance [Wm**-2] so that it can be checked more easily. Also input is given by DataArray coordinates.

def toa_solirr(dt, lon, lat):
    # same dim order as in xda
    lon = lon.values[np.newaxis, :, np.newaxis]
    lat = lat.values[np.newaxis, np.newaxis, :]
    day = dt.dayofyear.values[:, np.newaxis, np.newaxis]
    hr_utc = dt.hour.values[:, np.newaxis, np.newaxis]
    dec = np.pi/180 * 23.45 * np.sin(2*np.pi * (284+day) / 365) 
    hr_lst = hr_utc + lon / 15
    hr_angle = 15 * (hr_lst - 12)
    cos_sza = np.sin(lat * np.pi / 180) * np.sin(dec) + np.cos(
        lat * np.pi / 180
    ) * np.cos(dec) * np.cos(hr_angle * np.pi / 180)
    return np.fmax(0, 1366 * cos_sza) 

toa = toa_solirr(xda['time'].dt, xda['longitude'], xda['latitude'])

I tried to get it to work as a _ufunc like this but to no avail. Maybe someone familiar with Dask can make it work?

toa = xr.apply_ufunc(
    toa_solirr,
    xda['time'].dt, 
    xda['longitude'],
    xda['latitude'],
    input_core_dims=[['time'], ['latitude'], ['longitude']],
    output_core_dims=[[]],
    dask='parallelized'
)

sadamov · 2024-06-06T14:19:34Z

The datetime forcings are now implemented in #54 without the need for a seconds_in_year constant since that information is now directly retrieved from the xarray Dataset.

I added alternative functions for calculation of toa irradiance and border masks using similar input and output as in the existing ones. Also removed need for pyproj by reformulating projections using cartopy.

joeloskarsson · 2024-11-18T14:08:13Z

Closing this as superseded by mllam/mllam-data-prep#29

sadamov added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels May 17, 2024

sadamov mentioned this issue May 21, 2024

Replace constants.py with data_config.yaml #31

Merged

joeloskarsson mentioned this issue Jun 18, 2024

Merge global forecasting capabilities from prob_model_global #63

Open

11 tasks

sadamov linked a pull request Jul 19, 2024 that will close this issue

Contributions to #34 #57

Merged

sadamov removed a link to a pull request Jul 19, 2024

Contributions to #34 #57

Merged

sadamov self-assigned this Sep 6, 2024

joeloskarsson mentioned this issue Nov 18, 2024

Add ability to derive fields from input datasets mllam/mllam-data-prep#29

Open

joeloskarsson closed this as completed Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creation of Static and Forcing Fields #34

Creation of Static and Forcing Fields #34

sadamov commented May 17, 2024 •

edited

Loading

sadamov commented May 17, 2024 •

edited

Loading

joeloskarsson commented May 17, 2024

TomasLandelius commented May 22, 2024 •

edited

Loading

TomasLandelius commented May 23, 2024 •

edited

Loading

TomasLandelius commented May 23, 2024 •

edited

Loading

joeloskarsson commented May 30, 2024

TomasLandelius commented May 31, 2024

TomasLandelius commented Jun 5, 2024 •

edited

Loading

sadamov commented Jun 6, 2024

joeloskarsson commented Nov 18, 2024

Creation of Static and Forcing Fields #34

Creation of Static and Forcing Fields #34

Comments

sadamov commented May 17, 2024 • edited Loading

Summary

New fields

Questions / Issues:

Function Details and Rationale

calculate_datetime_forcing(ds, args)

generate_land_sea_mask(xy_state, tempdir, projection, high_res_factor=10)

create_boundary_mask(lonlat_state, lonlat_boundary, boundary_thickness, overlap=False)

generate_toa_radiation_forcing(ds, lonlat)

Usage

sadamov commented May 17, 2024 • edited Loading

joeloskarsson commented May 17, 2024

TomasLandelius commented May 22, 2024 • edited Loading

TomasLandelius commented May 23, 2024 • edited Loading

TomasLandelius commented May 23, 2024 • edited Loading

joeloskarsson commented May 30, 2024

TomasLandelius commented May 31, 2024

TomasLandelius commented Jun 5, 2024 • edited Loading

sadamov commented Jun 6, 2024

joeloskarsson commented Nov 18, 2024

sadamov commented May 17, 2024 •

edited

Loading

`calculate_datetime_forcing(ds, args)`

`generate_land_sea_mask(xy_state, tempdir, projection, high_res_factor=10)`

`create_boundary_mask(lonlat_state, lonlat_boundary, boundary_thickness, overlap=False)`

`generate_toa_radiation_forcing(ds, lonlat)`

sadamov commented May 17, 2024 •

edited

Loading

TomasLandelius commented May 22, 2024 •

edited

Loading

TomasLandelius commented May 23, 2024 •

edited

Loading

TomasLandelius commented May 23, 2024 •

edited

Loading

TomasLandelius commented Jun 5, 2024 •

edited

Loading