Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creation of Static and Forcing Fields #34

Closed
sadamov opened this issue May 17, 2024 · 10 comments
Closed

Creation of Static and Forcing Fields #34

sadamov opened this issue May 17, 2024 · 10 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@sadamov
Copy link
Collaborator

sadamov commented May 17, 2024

Summary

For international collaboration a selection of state variables, as well as static and forcing features were selected (see here for reference: Sheet). While the variables are provided by the user in a zarr-archive (usually NWP-model output), the additional forcing and static features are often not available. I suggest to provide the user with the option to generate these fields using a new script create_forcings.py. The script works for state and boundary domains alike, simple visualizations were provided to make this issue easier to understand. Please feel free to improve on any part of the script and help writing PRs 🚀

New fields

  • datetime forcings,
  • land-sea masks,
  • boundary masks,
  • top-of-atmosphere (TOA) radiation forcings.

Questions / Issues:

  • Does such a script belong in the neuralLAM repo or in a pre-processing step?
  • Should it be one or multiple scripts?
  • I have a script version for COSMO and MEPS data - should become domain agnostic
  • See below for questions specific to each field
  • Currently the fields are stored in one .zarr archive. It probably makes sense to specify the zarr archive where each field should append to.

Function Details and Rationale

calculate_datetime_forcing(ds, args)

This function calculates the datetime forcing for the neural LAM model by generating sinusoidal representations of the hour of the day and the time of the year. These features help the model understand diurnal and seasonal variations. Previously these were part of the PyTorch Dataset, provided in __get_item__; it certainly makes more sense to calculate these in advance, should be more efficient and allows for cleaner Dataset code.

image

generate_land_sea_mask(xy_state, tempdir, projection, high_res_factor=10)

This function generates a land-sea mask for the neural LAM model. The shapefile for the land-sea mask comes from natural-earth at 50m resolution. Percentage of gridcell covered by land is calculated by rasterization of high-res shapefile. This generates static fields, for some regions this mask will vary in time as parts of the sea freeze.
image

create_boundary_mask(lonlat_state, lonlat_boundary, boundary_thickness, overlap=False)

This function creates a boundary mask for the neural LAM model. The boundary mask defines the edges of the model's domain and helps in handling boundary conditions, which are essential for limited area modeling. It allows for overlap (Flatbread) or non-overlapping boundaries (Donut). The borders of the interior state domain are identified using anemoi-dataset functionalitites for lat-lon matching of boundary and state. The thickness of the boundary is applied using binary_dilution, which is a terrible approach -> should be calculated in meters in projection of state.

image

generate_toa_radiation_forcing(ds, lonlat)

This function pre-computes all static features related to the grid nodes for top-of-atmosphere (TOA) radiation. This functionality is provided by the graphcast repository. It is however quite slow for large domains and again depends on lat lon. Maybe there are better options.

image

Usage

The example script is based on MEPS (since we all have that data) and ERA5, directly from the google weatherbench cloud storage. To run it, store the script on the first level of the NeuralLAM repo, with meps_example in the data-folder:

pip install gcsfs 
pip install rasterio
pip install git+https://github.com/google-deepmind/graphcast.git
python create_forcings.py
@sadamov sadamov added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels May 17, 2024
@sadamov
Copy link
Collaborator Author

sadamov commented May 17, 2024

@leifdenby As we discussed in the meeting: the boundary mask certainly has high importance as it is not a simple static feature but potentially used in loss-calculation and graph-design. As you can see from the plots, I only got so far as to create a very basic working example for MEPS+ERA5 as a basis for discussion and future PRs. It would be great to get your boundary masking expertise here 🤝

@joeloskarsson
Copy link
Collaborator

  • Is a land-sea mask or a land-water mask (including lakes) most relevant? I have seen that for the MEPS area the large Swedish lakes impact some surface variables, so it would perhaps be nice to be able to inform the model of lakes also. Although, we could definitely provide a utility for land-sea mask and then if one wants something more specific (like way I describe) you use have to get that from somewhere else and put it into your zarr separately.
  • I am generally for including these types of utility scripts in neural-lam, as they are highly useful (even crucial) for working with the codebase. I think it would be good to group all such scripts in some way though, perhaps as a utility sub-module or similar.

@TomasLandelius
Copy link

TomasLandelius commented May 22, 2024

For many static (climatological) surface parameters I guess ECOCLIMAP-II would work: https://opensource.umr-cnrm.fr/projects/ecoclimap/wiki (1 km resolution)
When it comes to more dynamical things like sea and lake ice I guess one would get that from satellite based products. Leaving this for later now.

@TomasLandelius
Copy link

TomasLandelius commented May 23, 2024

I have something to do

create_boundary_mask(lonlat_state, lonlat_boundary, crs_state, boundary_min_distance, boundary_max_distance, overlap=False)

producing a result like this for the MEPS domain with ERA5 (latlon projection) on the boundary.
Figure_1

@TomasLandelius
Copy link

TomasLandelius commented May 23, 2024

Regarding TOA irradiance. What about something like this:

def toa_solirr(lat, lon, day, hr_utc):
    dec = -23.45 * np.cos(np.pi/180*360/365*(day-81)) * np.pi/180
    hr_lst = hr_utc + lon/15
    hr_angle = 15 * (hr_lst - 12)
    return np.sin(lat*np.pi/180) * np.sin(dec) + np.cos(lat*np.pi/180) * np.cos(dec) * np.cos(hr_angle*np.pi/180)

Is that also too slow?

%timeit toa_solirr(lat_era, lon_era, 0, 0) 76.7 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@joeloskarsson
Copy link
Collaborator

I have something to do

create_boundary_mask(lonlat_state, lonlat_boundary, crs_state, boundary_min_distance, boundary_max_distance, overlap=False)

producing a result like this for the MEPS domain with ERA5 (latlon projection) on the boundary. Figure_1

This looks great I think! How general is this (e.g. how hard to use it for another domain, another projection)? Can you share the full code for this? What is the unit for the distances now? meters?

Regarding TOA irradiance. What about something like this:

def toa_solirr(lat, lon, day, hr_utc):
    dec = -23.45 * np.cos(np.pi/180*360/365*(day-81)) * np.pi/180
    hr_lst = hr_utc + lon/15
    hr_angle = 15 * (hr_lst - 12)
    return np.sin(lat*np.pi/180) * np.sin(dec) + np.cos(lat*np.pi/180) * np.cos(dec) * np.cos(hr_angle*np.pi/180)

Is that also too slow?

%timeit toa_solirr(lat_era, lon_era, 0, 0) 76.7 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Since it is entirely parallelizable that should be quite fast. I tested a parallel version with broadcasting:

import numpy as np
import xarray as xr
import time

zarr_path = "data/era5_latlon/ll.zarr"

ll_xds = xr.open_zarr(zarr_path)

lon_1d = ll_xds.longitude.to_numpy() # (num_lon,)
lat_1d = ll_xds.latitude.to_numpy() # (num_lat,)

# dim order is (lat, lon, day, hr)
def toa_solirr(lat, lon, day, hr_utc):
    lat = lat[:, np.newaxis, np.newaxis, np.newaxis]
    lon = lon[np.newaxis, :, np.newaxis, np.newaxis]
    day = day[np.newaxis, np.newaxis, :, np.newaxis]
    hr_utc = hr_utc[np.newaxis, np.newaxis, np.newaxis, :]

    dec = -23.45 * np.cos(np.pi / 180 * 360 / 365 * (day - 81)) * np.pi / 180
    hr_lst = hr_utc + lon / 15
    hr_angle = 15 * (hr_lst - 12)
    return np.sin(lat * np.pi / 180) * np.sin(dec) + np.cos(
        lat * np.pi / 180
    ) * np.cos(dec) * np.cos(hr_angle * np.pi / 180)

day = np.arange(7)
hour = np.arange(24)

start_time = time.time()
toa = toa_solirr(lat_1d, lon_1d, day, hour)
end_time = time.time()
print(f"Took: {end_time-start_time} s")

(quick test code, but does the trick)
That runs in around 1 second for a week, so around a minute per year (on my laptop, latlons from ERA5 0.25 deg.), which seems acceptable.
I have not looked into the math of this, but what is the difference between this and the GraphCast TOA implementation (https://github.com/google-deepmind/graphcast/blob/main/graphcast/solar_radiation.py)? A benefit of that one is that it uses Xarray. I quickly run out of memory doing this in numpy.
I don't think it's a big problem if this takes a bit of time, since you should not have to re-compute it very often. Another thought (given how paralellizable this is), is to try to do it on GPU? Could almost just replace the np in Tomas example with torch. But then there will be a lot of data to shuffle back to the CPU and save to disk.

@TomasLandelius
Copy link

The difference to GC implementation is that they focus on having something very similar to what goes into ERA5 while mine opts for as simple as possible. Maybe too simple? It does for example not care about the Sun-Earth distance that causes ca 7 % variation during the year. Still, I think other things contribute more to the error than imperfections in this forcing. However, I've not checked the output from my version with more precise ones - should be done.

@TomasLandelius
Copy link

TomasLandelius commented Jun 5, 2024

Here is an update with a correction to the declination. It now outputs irradiance [Wm**-2] so that it can be checked more easily. Also input is given by DataArray coordinates.

def toa_solirr(dt, lon, lat):
    # same dim order as in xda
    lon = lon.values[np.newaxis, :, np.newaxis]
    lat = lat.values[np.newaxis, np.newaxis, :]
    day = dt.dayofyear.values[:, np.newaxis, np.newaxis]
    hr_utc = dt.hour.values[:, np.newaxis, np.newaxis]
    dec = np.pi/180 * 23.45 * np.sin(2*np.pi * (284+day) / 365) 
    hr_lst = hr_utc + lon / 15
    hr_angle = 15 * (hr_lst - 12)
    cos_sza = np.sin(lat * np.pi / 180) * np.sin(dec) + np.cos(
        lat * np.pi / 180
    ) * np.cos(dec) * np.cos(hr_angle * np.pi / 180)
    return np.fmax(0, 1366 * cos_sza) 

toa = toa_solirr(xda['time'].dt, xda['longitude'], xda['latitude'])

I tried to get it to work as a _ufunc like this but to no avail. Maybe someone familiar with Dask can make it work?

toa = xr.apply_ufunc(
    toa_solirr,
    xda['time'].dt, 
    xda['longitude'],
    xda['latitude'],
    input_core_dims=[['time'], ['latitude'], ['longitude']],
    output_core_dims=[[]],
    dask='parallelized'
)

@sadamov
Copy link
Collaborator Author

sadamov commented Jun 6, 2024

The datetime forcings are now implemented in #54 without the need for a seconds_in_year constant since that information is now directly retrieved from the xarray Dataset.

sadamov pushed a commit that referenced this issue Jul 18, 2024
I added alternative functions for calculation of toa irradiance and
border masks using similar input and output as in the existing ones.
Also removed need for pyproj by reformulating projections using cartopy.
@sadamov sadamov linked a pull request Jul 19, 2024 that will close this issue
@sadamov sadamov removed a link to a pull request Jul 19, 2024
@sadamov sadamov self-assigned this Sep 6, 2024
@joeloskarsson
Copy link
Collaborator

Closing this as superseded by mllam/mllam-data-prep#29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants