Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow amalgamate #66

Open
rburghol opened this issue Jul 15, 2024 · 25 comments
Open

Workflow amalgamate #66

rburghol opened this issue Jul 15, 2024 · 25 comments
Assignees

Comments

@rburghol
Copy link
Contributor

rburghol commented Jul 15, 2024

This will produce a full domain coverage raster of amalgamated daily best-fit rasters, and store it in the database for later use by coverage model time series input routines in workflow wdm

Data Model

  • Consider storing an amalgamated raster at the daily (or weekly) level, of 8-bit integer type instead of 64-bit floating point type. https://postgis.net/docs/RT_ST_BandPixelType.html
  • The 8-0bit integer could be a reference to a varkey, and would result in 1/4th the storage requirement of the daymet resolution.

Steps

  • Assemble daily total amalgamated rasters for days in simulation period, using best available data according to ranking file.

Info

  • Overview of functions needed:
    • Given:
      • a default hourly raster (has 0.0 values and resolution of final dataset)
      • a varkey indicating the name of coverage best fit records to amalgamate for a given model scenario
      • a coverage with ranking models completed for the potential data sources
      • a table of raster data sources with a ranking value for each potential data source for each day of the table
      • A base raster with 1 record for every hour
    • Create:
      • A raster with the best fit precip data for every cell based on coverage ratings
      • A raster with the varkey for the best fit precip data source for every cell based on coverage ratings
    • Amalgamate coverage specific rasters into a single continuous daily dataset for model meteorology input generation.
  • amalgamate: Create new amalgamated raster dataset from multiple data sources ranked in the assess step
    • download
    • process: Create model properties and transform map data to model spatial and temporal scale
      • 01_importModelPropsToDB Create a met model property and an accompanying scenario propertyfor this amalgamation scenario
      • 02_bestVarIDRaster Create a raster that has the varid of the best fit precipitation data source in every cell based on the ratings imported in geo ->impot -> 04. Store this raster as 8-bit integer in dh_timeseries_weather under the amalgamation scenario property
      • 03_amalgamate: create a raster in the database for the amalgamated scenario with each cell being precip data from the data source identified in 02_bestVarIDRaster
        • Uses a for loop to repeatedly call amalgamate.sh, which takes a given varid and sets all values in the bestVarIDRaster to NULL before unioning in the resamples raw precip data to create an amalgamated raster
        • Store as raster in dh_timeseries_weather
    • import
      • Import amalgamated dataset into postgresql database (may take place in process)
    • analyze- Analyze
      • 01_plotVarIDRaster Generate a png of the raster generated in process 02 and store in the appropriate directory defined by config. Plot is generated by returning the target tid from postgres databse and feeding to plotRaster.R, called by plotRaster.sh
      • 02_plotVarIDRaster Generate a png of the raster generated in process 03 and store in the appropriate directory defined by config using process in analyze 01
      • QA
      • Summary metrics for inspection
      • Generate case-studies of model fit for Met/USGS

Run amalgamate for simple_lm should be in the form:

MODEL_ROOT=/backup/meteorology/
MODEL_BIN=$MODEL_ROOT
SCRIPT_DIR=/opt/model/model_meteorology/sh
export MODEL_ROOT MODEL_BIN SCRIPT_DIR
/opt/model/meta_model/run_model raster_met amalgamate_simple_lm 2022-02-18 auto amalgamate 

There is no need to insert a coverage. The coverage details will be included in the amalgamate config and will be for the model extent. Best fit ratings should be extracted and combined for all coverages.
Run amalgamate for storm_vol should be in the form:

MODEL_ROOT=/backup/meteorology/
MODEL_BIN=$MODEL_ROOT
SCRIPT_DIR=/opt/model/model_meteorology/sh
export MODEL_ROOT MODEL_BIN SCRIPT_DIR
/opt/model/meta_model/run_model raster_met amalgamate_storm_vol 2022-02-18 auto amalgamate 

Batch Run for entire model period (process only):

MODEL_ROOT=/backup/meteorology/
MODEL_BIN=$MODEL_ROOT
SCRIPT_DIR=/opt/model/model_meteorology/sh
export MODEL_ROOT MODEL_BIN SCRIPT_DIR
for j in {1984..2023}; do 
  yr=$j
  doy=`date -d "${yr}-12-31" +%j`
  i=0
  while [ $i -lt $doy ]; do
   thisdate=`date -d "${yr}-01-01 +$i days" +%Y-%m-%d`
   echo "Running: sbatch /opt/model/meta_model/run_model raster_met amalgamate_storm_vol \"$thisdate\" auto amalgamate process"
   sbatch /opt/model/meta_model/run_model raster_met amalgamate_storm_vol "$thisdate" auto amalgamate process
   i=$((i + 1))
  done
done
@COBrogan
Copy link
Collaborator

COBrogan commented Sep 16, 2024

First attempt at amalgamating a raster. Pulls in clipped, resampled rasters for daymet and PRISM. The "best" dataset below is picked randomly, this will need to be updated as values are added to REST. Unions NLDAS rasters prior to resampling and clipping. This is the longest part of the query, adding roughly 10 seconds. For one day, this query takes about 15 seconds. Scaling this to the full time period would result take about 60 hours (15 sec/day * 365 day/yr * 39 yr / 3600 sec/hours). To-do:

  • Clip amalgamated raster to ~3 small watersheds and confirm results are identical to those from calc_raster_ts. Small watersheds will not have additional values overlaid on top of them and should thus remain identical
  • Check a larger watershed where we have full coverage, ideally one that is half driven by one dataset and half by another to see if the amalgamate is working as intended
  • Set "background" raster to NLDAS union
  • Temporally disaggregate based on the fraction or a 24-hour average rainfall per Integrate temporal disaggregation step into wdm workflow. #86
  • Count number of times the 24-hour default option is used
  • Add real ratings view the properties or TS added via REST
--Given we have a file that indicates which dataset performed best, we should have
--three different datasets defined via WITH, grabbing and clipping coverages based on the file

--Arbitrarily pick a tsendtime to practice with, here February 18, 2020. 
--Note that with NLDAS this will pick an abitrary hour and we need the full 24-hour set
\set tsendin '1582027200'
\set resample_varkey 'daymet_mod_daily'
-- sets all integer feature and varid with query 
select hydroid as covid from dh_feature where hydrocode = 'cbp6_met_coverage' \gset

--Grab the USGS full drainage geometries/coverages and assign ratings to inicate best 
--performing precip dataset
WITH usgsCoverage as (
	SELECT f.*,
	--Join in ratings. Until ratings are put in db via REST, let's
	--use a random integer between 0 (NLDAS) and 2 (daymet)
	floor(random() * (2-0+1) + 0)::int as dataID,
	--Add area of watershed/coverage for refernce and to order from downstream to upstrea
	ST_AREA(fgeo.dh_geofield_geom) as covArea,
	fgeo.dh_geofield_geom as dh_geofield_geom
	FROM dh_feature as f
	LEFT JOIN field_data_dh_geofield as fgeo
	on (
		fgeo.entity_id = f.hydroid
		and fgeo.entity_type = 'dh_feature' 
	) 
	WHERE f.bundle = 'watershed' AND f.ftype = 'usgs_full_drainage'
	ORDER BY covArea DESC
),
--Get the geometry and feature fields for the full coverage based on the covid variable gset above. 
--This will be used to create a resampled NLDAS for the day
fullCoverage as (
	SELECT f.*,fgeo.dh_geofield_geom
	FROM dh_feature as f
	LEFT JOIN field_data_dh_geofield as fgeo
	on (
		fgeo.entity_id = f.hydroid
		and fgeo.entity_type = 'dh_feature' 
	) 
	WHERE f.hydroid = :'covid'
),
--Where PRISM is the best performing dataset, grab the appropriate
--dialy raster from dh_weather_timeseries and resample to target resolution
--and then clip to watershed boundaries
prism as (
	SELECT cov.*,
	met.featureid,met.tsendtime,
	st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
	FROM usgsCoverage as cov
	JOIN(
		select *
		from dh_timeseries_weather as met
		left outer join dh_variabledefinition as b
			on (met.varid = b.hydroid) 
		where b.varkey='prism_mod_daily'
			and met.featureid = :covid
			and met.tsendtime = :'tsendin'
	) AS met
	ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
	LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
	ON 1 = 1
	WHERE cov.dataID = 1
),
--Where daymet is the best performing dataset, grab the appropriate
--daily raster from dh_weather_timeseries and resample to target resolution
--and then clip to watershed boundaries
daymet as (
	SELECT cov.*,
	met.featureid,met.tsendtime,
	st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
	FROM usgsCoverage as cov
	JOIN(
		select *
		from dh_timeseries_weather as met
		left outer join dh_variabledefinition as b
			on (met.varid = b.hydroid) 
		where b.varkey='daymet_mod_daily'
			and met.featureid = :covid
			and met.tsendtime = :'tsendin'
	) AS met
	ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
	LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
	ON 1 = 1
	WHERE cov.dataID = 2
),
--Union all NLDAS rasters for the day to get the sum of NLDAS for the day
nldasFullDay AS (
	SELECT st_union(met.rast,'sum') as rast
	FROM (
		select *
		from dh_timeseries_weather as met
		left outer join dh_variabledefinition as b
			on (met.varid = b.hydroid) 
		where b.varkey='nldas2_precip_hourly_tiled_16x16'
			and met.featureid = :covid
			and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin'))
			and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin'))
			and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin'))
	) AS met
),
nldasFullDayResamp AS (
	select st_resample(met.rast,rt.rast) as rast
	FROM fullCoverage as f
	JOIN nldasFullDay as met
	ON ST_ConvexHull(met.rast) && f.dh_geofield_geom
	LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
	ON 1 = 1
),
--Union all NLDAS rasters for the day, intersecting by the usgsCoverage geometries
--to leverage the tiled NLDAS rasters. The end result is a raster for each coverage 
--where NLDAS is the most highly rated that is of the full day's dataset, 
--but clipped to only intersecting tiles
nldasDay as (
	SELECT cov.hydroid, cov.hydrocode,
	cov.ftype, cov.bundle, cov.name,
	:'tsendin' as tsendtime,
	st_union(met.rast,'sum') as rast
	FROM usgsCoverage as cov
	JOIN(
		select *
		from dh_timeseries_weather as met
		left outer join dh_variabledefinition as b
			on (met.varid = b.hydroid) 
		where b.varkey='nldas2_precip_hourly_tiled_16x16'
			and met.featureid = :covid
			and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin'))
			and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin'))
			and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin'))
	) AS met
	ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
	WHERE cov.dataID = 0
	GROUP BY cov.hydroid, cov.hydrocode, cov.ftype,
		cov.bundle, cov.name
),
--Now, using the union of NLDAS hourly data in nldasDay, resample to the template raster and clip to each 
--watershed where NLDAS is rated the best via an INNER JOIN and the WHERE in nldasDay
nldas as (
	SELECT cov.*,met.tsendtime,
	st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
	FROM usgsCoverage as cov
	INNER JOIN nldasDay as met
	on cov.hydroid = met.hydroid
	LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
	ON 1 = 1
),
--For each feature in usgsCoverage, find the non-NULL summary dataset (which will represent the best rated dataset)
amalgamate as (
	select cov.*, COALESCE(prismMet.rast,daymetMet.rast,nldasMet.rast) as rast
	FROM usgsCoverage as cov
	LEFT JOIN nldas as nldasMet
	on cov.hydroid = nldasMet.hydroid
	LEFT JOIN prism as prismMet
	on cov.hydroid = prismMet.hydroid
	LEFT JOIN daymet as daymetMet
	on cov.hydroid = daymetMet.hydroid
),
--Union the best rated datasets together. Since the data is sorted by drainage area, 
--upstream areas will populate after larger, downstream coverages
amalgamateUnion as (
	SELECT ST_union(amalgamate.rast) as rast
	FROM usgsCoverage as cov
	LEFT JOIN amalgamate as amalgamate
	on cov.hydroid = amalgamate.hydroid
)
--Use a full union to create a column with the amalgamateUnion raster and the nldasFulLDayResamp raster
--Then, union the rasters to get a raster in which the "background" is the nldasFullDayResamp and everything else is 
--the best fit raster
SELECT ST_union(fullUnion.rast) as rast
FROM (
	SELECT rast FROM nldasFullDayResamp
    UNION ALL
    SELECT rast FROM amalgamateUnion 
) as fullUnion;

@COBrogan
Copy link
Collaborator

COBrogan commented Sep 17, 2024

Testing the above raster. Using random numbers to assign best fit dataset (for now):

hydroid hydrocode ftype bundle name dataid covarea
289982 usgs_ws_02029000 usgs_full_drainage watershed JAMES RIVER AT SCOTTSVILLE, VA 0 1.2134
290060 usgs_ws_03176500 usgs_full_drainage watershed NEW RIVER AT GLEN LYN, VA 2 0.9902
290065 usgs_ws_02026000 usgs_full_drainage watershed JAMES RIVER AT BENT CREEK, VA 1 0.9666

So, let's find the mean precipitation in this watershed using the resampled PRISM data from dh_timeseries_weather

Check `dh_timeseries_weather` PRISM data
\set hydrocode 'usgs_ws_02026000'
\set band '1'
\set tsendin '1582027200'
--\set varkey 'nldas2_precip_hourly_tiled_16x16'
\set varkey 'prism_mod_daily'
--\set varkey 'daymet_mod_daily'
\set resample_varkey 'daymet_mod_daily'
-- sets all integer feature and varid with query 
select hydroid as covid from dh_feature where hydrocode = 'cbp6_met_coverage' \gset

WITH usgs_features AS (
  SELECT * 
  FROM  dh_feature
  WHERE hydrocode = :'hydrocode'
),
metUnion as (
	Select met.featureid,
		st_union(met.rast,'sum') as rast
	FROM usgs_features as f
	left outer join field_data_dh_geofield as fgeo
	on (
		fgeo.entity_id = f.hydroid
		and fgeo.entity_type = 'dh_feature' 
	) 
	JOIN(
		select *
		from dh_timeseries_weather as met
		left outer join dh_variabledefinition as b
			on (met.varid = b.hydroid) 
		where b.varkey=:'varkey'
			and met.featureid = :covid
			and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin'))
			and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin'))
			and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin'))
	) AS met
	ON fgeo.dh_geofield_geom && met.bbox
	
	group by met.featureid
),
met as (
	Select met.featureid, to_timestamp(:'tsendin') as obs_date,
		extract(year from to_timestamp(:'tsendin')) as yr,
		extract(month from to_timestamp(:'tsendin')) as mo,
		extract(day from to_timestamp(:'tsendin')) as da,
		extract(hour from to_timestamp(:'tsendin')) as hr,
		(ST_summarystats(st_clip(st_resample(met.rast,rt.rast), fgeo.dh_geofield_geom), :'band', TRUE)).mean as stats
	FROM usgs_features as f
	left outer join field_data_dh_geofield as fgeo
	on (
		fgeo.entity_id = f.hydroid
		and fgeo.entity_type = 'dh_feature' 
	) 
	JOIN metUnion AS met
	ON ST_Intersects(ST_ConvexHull(met.rast),fgeo.dh_geofield_geom)
	left join (select rast from raster_templates where varkey = :'resample_varkey') as rt
	ON 1 = 1
)
select featureid, obs_date, yr, mo, da, hr, 
  0.0393701 * stats precip_in
from met
order by met.obs_date;

featureid | obs_date | yr | mo | da | hr | precip_in
-----------+------------------------+------+----+----+----+--------------------
617850 | 1981-09-16 08:00:00-04 | 1981 | 9 | 16 | 8 | 0.6558419584256406

Now let's find the mean precipitation in this watershed using the amalgamated data from tmp_amalgamate

Check `amalgamate` data
WITH usgs_features AS (
  SELECT * 
  FROM  dh_feature
  WHERE hydrocode = :'hydrocode'
),
met as (
	Select f.hydroid as featureid, to_timestamp(:'tsendin') as obs_date,
		extract(year from to_timestamp(:'tsendin')) as yr,
		extract(month from to_timestamp(:'tsendin')) as mo,
		extract(day from to_timestamp(:'tsendin')) as da,
		extract(hour from to_timestamp(:'tsendin')) as hr,
		(ST_summarystats(st_clip(met.rast, fgeo.dh_geofield_geom), :'band', TRUE)).mean as stats
	FROM usgs_features as f
	left outer join field_data_dh_geofield as fgeo
	on (
		fgeo.entity_id = f.hydroid
		and fgeo.entity_type = 'dh_feature' 
	) 
	JOIN tmp_amalgamate AS met
	ON ST_Intersects(ST_ConvexHull(met.rast),fgeo.dh_geofield_geom)
)
select featureid, obs_date, yr, mo, da, hr, 
  0.0393701 * stats precip_in
from met
order by met.obs_date;

featureid | obs_date | yr | mo | da | hr | precip_in
-----------+------------------------+------+----+----+----+--------------------
290065 | 1981-09-16 08:00:00-04 | 1981 | 9 | 16 | 8 | 0.5779727155453814
(1 row)

These different values actually make sense. This is a larger watershed and will thus change after amalgamation as smaller watersheds are unioned on top. We instead need to find a headwater watershed, which should remain unchanged. For instance, usgs_ws_01645784 may work as it is our smallest watershed by area (used daymet in this run) and usgs_ws_01654500 (which used PRISM in this run). Using the above code to evaluate usgs_ws_01654500, we see an identical value between amalgamate and dh_timeseries_weather's PRISM data. Similarly, usgs_ws_01645784 produced identical data to dh_timeseries_weather's daymet data. usgs_ws_02055100 is larger, but used NLDAS data in this random allocation of ratings. Using the code above, we again see agreement between dh_timeseries_weather and the amalgamate method above

@rburghol
Copy link
Contributor Author

This is great progress @COBrogan thanks for pushing it along! I was today days old when I learned the SQL term COALESCE so thanks!

I had a couple questions and speculations (and may reveal my misunderstanding):

  • could your headwater data not yield the same data since the selection if best fit is random (for this demo)?
  • I wonder if we can improve performance by using the daily sum raster cor nldas2 that @mwdunlap2004 is working on so it's just another prepopulated varkey?
  • I think you have squeezed out at least some of the performance since your WITH queries are o ly grabbing the data for days when they're the selected source.
  • are you actually pulling in all USGS coverages for the amalgamated layers? I got a little lost between usgsCoverage, which seemms to grab all shapes, and fullCoverage which seems to limit to a larger watershed of interest.
  • if you are actually grabbing all the gage covs, do we need to use the single coverage in fullCoverage or is that just to get the base raster? I think I answered my own Question but humor me please 😝
  • All in all, a 60 hour amalgamation process for the entire domain does not seem outrageous to me -- of course I will soon want to do it in 45 minutes but that is the way of things!

@COBrogan
Copy link
Collaborator

  • could your headwater data not yield the same data since the selection if best fit is random (for this demo)?

The headwaters are revealing the same data as the data from the randomly selected dataset. I created some temp tables to test this out but that code isn’t here (but I have it on OneDrive if I need to refer to it again).

  • I wonder if we can improve performance by using the daily sum raster cor nldas2 that @mwdunlap2004 is working on so it's just another prepopulated varkey?

Absolutely! Those raster will reduce run time by at least 33% I think.

  • I think you have squeezed out at least some of the performance since your WITH queries are o ly grabbing the data for days when they're the selected source.

That’s what I’m hoping! I tried to put a comment block above each ‘WITH’ to help it be a little more legible bc the WITHs are difficult to read.

  • are you actually pulling in all USGS coverages for the amalgamated layers? I got a little lost between usgsCoverage, which seemms to grab all shapes, and fullCoverage which seems to limit to a larger watershed of interest.

‘fullCoverage’ is the union of nldas for the whole CB extent. It’s unioned at the end to serve as the “background” data is no better data exists

  • if you are actually grabbing all the gage covs, do we need to use the single coverage in fullCoverage or is that just to get the base raster? I think I answered my own Question but humor me please 😝

To get the base raster. There may be a better way when we finish with temporal disaggregateion

  • All in all, a 60 hour amalgamation process for the entire domain does not seem outrageous to me -- of course I will soon want to do it in 45 minutes but that is the way of things!

Hopefully we speed it up!

I’ll be working on confirming that the larger watersheds are being overwritten by smaller headwaters using st point next week.

@rburghol
Copy link
Contributor Author

Thinking about steps process -> 01_acquire and process -> 02_rank.

  • 01_acquire could use the existing join_col.R script and just iterate through different scenario ranking files, joining them sequentially to the base met time series files and saving as the amalgamate ranking file
  • 02_rank could then simply look at the various ranking columns and indicate the method with the highest value for each. This can also be saved into the same text file.
    • can the best_available column be populated with the varid of the selected timeseries? This would make joining straightforward in the final database SQL
  • Then, the resulting file can be imported as a timeseries record with only the final ranking indicator as the tsvalue

@rburghol
Copy link
Contributor Author

rburghol commented Sep 26, 2024

@COBrogan Putting a little thought into how ratings could be included as part of the WITH section. Here it is in SQL form:

WITH ratings as (
  select f.hydroid, scen.propname, 
    rts.tstime, rts.tsendtime, rts.tsvalue as best_varid
  from usgs_coverage as f
  left outer join dh_properties as scen 
  on (
    scen.featureid = f.hydroid
    and scen.propname = :'ama_scenario'
  )
)

Then, we could collapse the WITH statements for prism and daymet into a single generic query where we simply selected varid joined to ratings.best_varid like so:

BESTmet as (
	SELECT cov.*,
	met.featureid,met.tsendtime,
	st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
	FROM usgsCoverage as cov
	JOIN(
		select *
		from dh_timeseries_weather as met
		left outer join dh_timeseries as b
			on (
          met.varid = b.best_varid
          and met.tstime >= b.tstime
          and met.tsendtime <= b.tsendtime
        ) 
		where 
			and met.featureid = :covid
			and met.tsendtime = :'tsendin'
	) AS met
	ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
	LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
	ON 1 = 1
	WHERE cov.dataID = 2
),

For more information on how I think we might go about creating those pre-amalgamated ratings for each coverage see #96

@COBrogan
Copy link
Collaborator

COBrogan commented Sep 26, 2024

@mwdunlap2004 @rburghol Using st_value, I checked the values of the amalgamated raster above at a few locations. My goal was to ensure the amalgamation was working so I looked at points in a large watershed (e.g. usgs_ws_02035000 James River at Cartersville) at:

  • A point (37.792314, -78.157391) that does not overlap any other watershed. Here, the value of the amalgamated raster should be identical to the value of the raster for that varkey as stored in dh_timeseries_weather e.g. the value should be identical to the raw data for this usgs_ws_02035000
  • A point (37.923872, -78.587890) that is overlapped by an upstream watershed (e.g. usgs_ws_02030000 Hardware River BL Briery Run nr Scottsville, VA). Here, the value of the amalgamated raster should be identical to the value of the raster for that varkey as stored in dh_timeseries_weather for the usgs_ws_02030000 raster e.g. the value should be identical to the Hardware River the raw data
  • A point (40.299638, -76.909790) that is outside of our USGS full drainage areas. Here, the value of the amalgamated raster should be identical to the "background" value of the raster e.g. the value should be identical to the daily union of the nldas2 raw data for that day in dh_timeseries_weather.

In each case, I found that the amalgamated raster was working. I think the amalgamation is working as intended!


See here for a modified amalgamate code that creates temp tables for the ratings and amalgamate raster for later exploration. I used this to test amalgamate in three scenarios above, but please note that the ratings are still randomly assigned such that each time you run this you will need to check the datasources selected in scenarios 1 and 2:

Temp Table Amalgamate
--Given we have a file that indicates which dataset performed best, we should have
--three different datasets defined via WITH, grabbing and clipping coverages based on the file

--Arbitrarily pick a tsendtime to practice with, here September 16, 1981. 
--Note that with NLDAS this will pick an abitrary hour and we need the full 24-hour set
\set tsendin '369489600'
\set resample_varkey 'daymet_mod_daily'
-- sets all integer feature and varid with query 
select hydroid as covid from dh_feature where hydrocode = 'cbp6_met_coverage' \gset


create temp table tmp_usgsCoverage as (
	SELECT f.*,
	--Join in ratings. Until ratings are put in db via REST, let's
	--use a random integer between 0 (NLDAS) and 2 (daymet)
	floor(random() * (2-0+1) + 0)::int as dataID,
	--Add area of watershed/coverage for refernce and to order from downstream to upstrea
	ST_AREA(ST_MakeValid(fgeo.dh_geofield_geom)) as covArea,
	fgeo.dh_geofield_geom as dh_geofield_geom
	FROM dh_feature as f
	LEFT JOIN field_data_dh_geofield as fgeo
	on (
		fgeo.entity_id = f.hydroid
		and fgeo.entity_type = 'dh_feature' 
	) 
	WHERE f.bundle = 'watershed' AND f.ftype = 'usgs_full_drainage'
	ORDER BY covArea DESC
)


create temp table tmp_amalgamate as (
--Grab the USGS full drainage geometries/coverages and assign ratings to inicate best 
--performing precip dataset
WITH usgsCoverage as (
	SELECT *
	FROM tmp_usgsCoverage
),
--Get the geometry and feature fields for the full coverage based on the covid variable gset above. 
--This will be used to create a resampled NLDAS for the day
fullCoverage as (
	SELECT f.*,fgeo.dh_geofield_geom
	FROM dh_feature as f
	LEFT JOIN field_data_dh_geofield as fgeo
	on (
		fgeo.entity_id = f.hydroid
		and fgeo.entity_type = 'dh_feature' 
	) 
	WHERE f.hydroid = :'covid'
),
--Where PRISM is the best performing dataset, grab the appropriate
--dialy raster from dh_weather_timeseries and resample to target resolution
--and then clip to watershed boundaries
prism as (
	SELECT cov.*,
	met.featureid,met.tsendtime,
	st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
	FROM usgsCoverage as cov
	JOIN(
		select *
		from dh_timeseries_weather as met
		left outer join dh_variabledefinition as b
			on (met.varid = b.hydroid) 
		where b.varkey='prism_mod_daily'
			and met.featureid = :covid
			and met.tsendtime = :'tsendin'
	) AS met
	ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
	LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
	ON 1 = 1
	WHERE cov.dataID = 1
),
--Where daymet is the best performing dataset, grab the appropriate
--daily raster from dh_weather_timeseries and resample to target resolution
--and then clip to watershed boundaries
daymet as (
	SELECT cov.*,
	met.featureid,met.tsendtime,
	st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
	FROM usgsCoverage as cov
	JOIN(
		select *
		from dh_timeseries_weather as met
		left outer join dh_variabledefinition as b
			on (met.varid = b.hydroid) 
		where b.varkey='daymet_mod_daily'
			and met.featureid = :covid
			and met.tsendtime = :'tsendin'
	) AS met
	ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
	LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
	ON 1 = 1
	WHERE cov.dataID = 2
),
--Union all NLDAS rasters for the day to get the sum of NLDAS for the day
nldasFullDay AS (
	SELECT st_union(met.rast,'sum') as rast
	FROM (
		select *
		from dh_timeseries_weather as met
		left outer join dh_variabledefinition as b
			on (met.varid = b.hydroid) 
		where b.varkey='nldas2_precip_hourly_tiled_16x16'
			and met.featureid = :covid
			and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin'))
			and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin'))
			and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin'))
	) AS met
),
nldasFullDayResamp AS (
	select st_resample(met.rast,rt.rast) as rast
	FROM fullCoverage as f
	JOIN nldasFullDay as met
	ON ST_ConvexHull(met.rast) && f.dh_geofield_geom
	LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
	ON 1 = 1
),
--Union all NLDAS rasters for the day, intersecting by the usgsCoverage geometries
--to leverage the tiled NLDAS rasters. The end result is a raster for each coverage 
--where NLDAS is the most highly rated that is of the full day's dataset, 
--but clipped to only intersecting tiles
nldasDay as (
	SELECT cov.hydroid, cov.hydrocode,
	cov.ftype, cov.bundle, cov.name,
	:'tsendin' as tsendtime,
	st_union(met.rast,'sum') as rast
	FROM usgsCoverage as cov
	JOIN(
		select *
		from dh_timeseries_weather as met
		left outer join dh_variabledefinition as b
			on (met.varid = b.hydroid) 
		where b.varkey='nldas2_precip_hourly_tiled_16x16'
			and met.featureid = :covid
			and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin'))
			and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin'))
			and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin'))
	) AS met
	ON ST_Intersects(ST_ConvexHull(met.rast),cov.dh_geofield_geom)
	WHERE cov.dataID = 0
	GROUP BY cov.hydroid, cov.hydrocode, cov.ftype,
		cov.bundle, cov.name
),
--Now, using the union of NLDAS hourly data in nldasDay, resample to the template raster and clip to each 
--watershed where NLDAS is rated the best via an INNER JOIN and the WHERE in nldasDay
nldas as (
	SELECT cov.*,met.tsendtime,
	st_clip(st_resample(met.rast,rt.rast), cov.dh_geofield_geom) as rast
	FROM usgsCoverage as cov
	INNER JOIN nldasDay as met
	on cov.hydroid = met.hydroid
	LEFT JOIN (select rast from raster_templates where varkey = :'resample_varkey') as rt
	ON 1 = 1
),
--For each feature in usgsCoverage, find the non-NULL summary dataset (which will represent the best rated dataset)
amalgamate as (
	select cov.*, 
	COALESCE(prismMet.rast,daymetMet.rast,nldasMet.rast) as rast
	FROM usgsCoverage as cov
	LEFT JOIN nldas as nldasMet
	on cov.hydroid = nldasMet.hydroid
	LEFT JOIN prism as prismMet
	on cov.hydroid = prismMet.hydroid
	LEFT JOIN daymet as daymetMet
	on cov.hydroid = daymetMet.hydroid
),
--Union the best rated datasets together. Since the data is sorted by drainage area, 
--upstream areas will populate after larger, downstream coverages
amalgamateUnion as (
	SELECT ST_union(amalgamate.rast) as rast
	FROM usgsCoverage as cov
	LEFT JOIN amalgamate as amalgamate
	on cov.hydroid = amalgamate.hydroid
)
--Use a full union to create a column with the amalgamateUnion raster and the nldasFulLDayResamp raster
--Then, union the rasters to get a raster in which the "background" is the nldasFullDayResamp and everything else is 
--the best fit raster
SELECT ST_union(fullUnion.rast) as rast
FROM (
	SELECT rast FROM nldasFullDayResamp
	UNION ALL
	SELECT rast FROM amalgamateUnion 
) as fullUnion
);

In my case, I found that daymet was randomly selected to be the "rating" for usgs_ws_02035000:

select dataid from tmp_usgsCoverage where hydrocode = 'usgs_ws_02035000';
 dataid
--------
      2

PRISM was selected as the random "rating" for usgs_ws_02030000:

select dataid from tmp_usgsCoverage where hydrocode = 'usgs_ws_02030000';
 dataid
--------
      1

First, let's check what value the amalgamated raster has in scenario 1 above (at a point in usgs_ws_02035000 that does not overlap any other watershed). Here, we should see that the amalgamated raster is equal to the raw data in dh_timeseries_weather for daymet. The value in amalgamated raster:
(set some variables first for ease of checking)

\set hydrocode 'cbp6_met_coverage'
\set band '1'
\set tsendin '369489600'
\set varkey 'daymet_mod_daily'
\set resample_varkey 'daymet_mod_daily'
--Set point to extract
\set latitude 37.792314
\set longitude -78.157391
-- sets all integer feature and varid with query 
select hydroid as covid from dh_feature where hydrocode = 'cbp6_met_coverage' \gset
WITH usgs_features AS (
  --SELECT * 
  --FROM  dh_feature as f
  --WHERE f.bundle = 'watershed' AND f.ftype = 'usgs_full_drainage'
  SELECT * 
  FROM  dh_feature as f
  	left outer join field_data_dh_geofield as fgeo
	on (
		fgeo.entity_id = f.hydroid
		and fgeo.entity_type = 'dh_feature' 
	) 
  WHERE f.hydrocode = :'hydrocode'
),
testPoint as (
	SELECT ST_SetSRID(ST_Point( :'longitude', :'latitude'),4326) as testPoint
),
met as (
	Select f.hydrocode as hydrocode, to_timestamp(:'tsendin') as obs_date,
		extract(year from to_timestamp(:'tsendin')) as yr,
		extract(month from to_timestamp(:'tsendin')) as mo,
		extract(day from to_timestamp(:'tsendin')) as da,
		extract(hour from to_timestamp(:'tsendin')) as hr,
		st_value(met.rast, :'band', testPoint.testPoint) as stats
	FROM usgs_features as f
	JOIN tmp_amalgamate AS met
	ON ST_Intersects(ST_ConvexHull(met.rast),ST_SetSRID(f.dh_geofield_geom,4326))
	JOIN testPoint as testPoint
	ON ST_INTERSECTS(testPoint.testPoint,ST_SetSRID(f.dh_geofield_geom,4326))
)
select hydrocode, obs_date, yr, mo, da, hr, 
  0.0393701 * stats precip_in
from met
order by met.obs_date;
hydrocode obs_date yr mo da hr precip_in
usgs_ws_02035000 1981-09-16 08:00:00-04 1981 9 16 8 0.8047248650259018

We compare this to the value for daymet from dh_timeseries_weather. Notice we supply a hydrocode here to speed up the query:

WITH usgs_features AS (
  SELECT * 
  FROM  dh_feature as f
  	left outer join field_data_dh_geofield as fgeo
	on (
		fgeo.entity_id = f.hydroid
		and fgeo.entity_type = 'dh_feature' 
	) 
  WHERE f.hydrocode = :'hydrocode'
),
testPoint as (
	SELECT ST_SetSRID(ST_Point( :'longitude', :'latitude'),4326) as testPoint
),
metUnion as (
	Select 
		st_union(met.rast,'sum') as rast
	FROM usgs_features as f
	JOIN testPoint as testPoint
	ON 1 = 1
	JOIN(
		select *
		from dh_timeseries_weather as met
		left outer join dh_variabledefinition as b
			on (met.varid = b.hydroid) 
		where b.varkey=:'varkey'
			and met.featureid = :covid
			and extract(year from to_timestamp(met.tsendtime)) = extract(year from to_timestamp(:'tsendin'))
			and extract(month from to_timestamp(met.tsendtime)) = extract(month from to_timestamp(:'tsendin'))
			and extract(day from to_timestamp(met.tsendtime)) = extract(day from to_timestamp(:'tsendin'))
	) AS met
	ON ST_INTERSECTS(testPoint.testPoint, met.bbox)
),
met as (
	Select f.hydrocode, to_timestamp(:'tsendin') as obs_date,
		extract(year from to_timestamp(:'tsendin')) as yr,
		extract(month from to_timestamp(:'tsendin')) as mo,
		extract(day from to_timestamp(:'tsendin')) as da,
		extract(hour from to_timestamp(:'tsendin')) as hr,
		st_value(st_resample(met.rast,rt.rast), :'band', testPoint.testPoint) as stats
	FROM usgs_features as f
	JOIN metUnion AS met
	ON ST_Intersects(ST_ConvexHull(met.rast),ST_SetSRID(f.dh_geofield_geom,4326))
	JOIN testPoint as testPoint
	ON ST_INTERSECTS(testPoint.testPoint,ST_SetSRID(f.dh_geofield_geom,4326))
	left join (select rast from raster_templates where varkey = :'resample_varkey') as rt
	ON 1 = 1
)
select hydrocode, obs_date, yr, mo, da, hr, 
  0.0393701 * stats precip_in
from met
order by met.obs_date;
featureid obs_date yr mo da hr precip_in
617850 1981-09-16 08:00:00-04 1981 9 16 8 0.8047248650259018

Success! We have matching values. Now, what about in a point in hydrocode usgs_ws_02030000 that overlaps usgs_ws_02035000? Here, the amalgamated raster should be equal to the value in the raw PRISM data as that was "selected" for this upstream watershed and should have been unioned on top of the daymet data in the amalgamated raster. Rerun the above amalgamate code now with:

\set latitude 37.923872
\set longitude -78.587890
\set varkey 'prism_mod_daily'
hydrocode obs_date yr mo da hr precip_in
usgs_ws_02030000 1981-09-16 08:00:00-04 1981 9 16 8 0.6486224125185013

We now rerun the query against dh_timeseries_weather using the PRISM data and the updated coordinates. We expect and identical value for precip:

featureid obs_date yr mo da hr precip_in
617850 1981-09-16 08:00:00-04 1981 9 16 8 0.6486224125185013

Success! The smaller watershed was correctly amalgamated on top of the larger, downstream watershed. With our last case, we want to ensure that the "background" raster exists where there is no USGS coverage. In the amalgamated raster, this should be:

\set latitude 40.299638
\set longitude -76.909790
\set varkey 'nldas2_precip_hourly_tiled_16x16'
hydrocode obs_date yr mo da hr precip_in
cbp6_met_coverage 1981-09-16 08:00:00-04 1981 9 16 8 0.28930330582999997

Now let's compare to the tiled nldas data in dh_timeseries_weather:

hydrocode obs_date yr mo da hr precip_in
cbp6_met_coverage 1981-09-16 08:00:00-04 1981 9 16 8 0.28930330582999997

Once again, we see the same value. So NLDAS2 is successfully being amalgamated as the "background" precip values.

@COBrogan
Copy link
Collaborator

COBrogan commented Sep 26, 2024

Putting a little thought into how ratings could be included as part of the WITH section. Here it is in SQL form:

@rburghol I like your idea of how to add in the ratings and the "best" varid. I think that will help clean up our query because that combined PRISM/daymet statement will work well. This is a great way to structure out how to deal with the ratings files and will make the query more applicable to future datasets (currently, we'd have to keep adding to the WITH and we can get around that by storing+using the varid as you indicated). I'll get started on this tomorrow if I can get done a few VPDES tasks, otherwise I'll start early next week.

@COBrogan
Copy link
Collaborator

COBrogan commented Oct 2, 2024

Moved to issue "Amalgamate Data Model" #96

@COBrogan
Copy link
Collaborator

COBrogan commented Oct 8, 2024

New variable definitions for ratings files with varkeys met_rating (variable hydroid is 1465, 1466):

--Variable definition for rating files
insert into dh_variabledefinition(varname,vardesc, vocabulary, varunits, varkey,datatype,varcode,varabbrev,multiplicity)
VALUES
('Best fit met ratings','Best fit ratings timeseries derived from meteorology',
 'met_rating', '', 'met_rating','value','met_rating','met_rating','');

insert into dh_variabledefinition(varname,vardesc, vocabulary, varunits, varkey,datatype,varcode,varabbrev,multiplicity)
VALUES
('Best fit precip raster','Best fit precip raster data amalgamated from rating scenarios',
 'amalgamate_daily', '', 'amalgamate_daily','value','amalgamate_daily','amalgamate_daily','');

@COBrogan
Copy link
Collaborator

COBrogan commented Oct 10, 2024

Added two steps to geo import:
02_importModelPropsToDB = Finds or creates a property for the met1.0 model on the USGS coverage. Then creates a scenario based on the config. Ranking timeseries will be stored under this scenario in EST timezone.
03_importRatingstoDB = Imports the timeseries into dh_weather

MODEL_ROOT=/backup/meteorology/
MODEL_BIN=$MODEL_ROOT
SCRIPT_DIR=/opt/model/model_meteorology/sh
export MODEL_ROOT MODEL_BIN SCRIPT_DIR

/opt/model/meta_model/run_model raster_met simple_lm_PRISM usgs_ws_02038000 auto geo import

Found feature with hydroid = 290098
Model property with propname = usgs_ws_02038000 met-1.0  created/found with pid = 7685427
Ranking Scenario property with propname = simple_lm_PRISM  created/found with pid = 7692579
Amalgamate Scenario property with propname = amalgamate_simple_lm  created/found with pid = 7692580
SELECT 0
COPY 2302
UPDATE 0
INSERT 0 2302

/opt/model/meta_model/run_model raster_met simple_lm_daymet usgs_ws_02038000 auto geo import

Found feature with hydroid = 290098
Model property with propname = usgs_ws_02038000 met-1.0  created/found with pid = 7685427
Ranking Scenario property with propname = simple_lm_daymet  created/found with pid = 7692581
Amalgamate Scenario property with propname = amalgamate_simple_lm  created/found with pid = 7692580
SELECT 0
COPY 1151
UPDATE 0
INSERT 0 1151

/opt/model/meta_model/run_model raster_met simple_lm_nldas2_tiled usgs_ws_02038000 auto geo import

Found feature with hydroid = 290098
Model property with propname = usgs_ws_02038000 met-1.0  created/found with pid = 7685427
Ranking Scenario property with propname = simple_lm_nldas2_tiled  created/found with pid = 7692582
Amalgamate Scenario property with propname = amalgamate_simple_lm  created/found with pid = 7692580
SELECT 0
COPY 2334
DELETE 0
INSERT 0 2334

Test on hydrocode usgs_ws_02021500: /opt/model/meta_model/run_model raster_met simple_lm_PRISM usgs_ws_02021500 auto geo import
Found feature with hydroid = 289948

@COBrogan
Copy link
Collaborator

COBrogan commented Dec 5, 2024

Get all ratings for a model scenario:

SELECT modelProp.featureid as featureid, 
CASE 
	WHEN ts.tsvalue IS NOT NULL THEN scenProp.pid  
	ELSE NULL  
END as pid, 
ts.tstime, 
ts.tsendtime, 
ts.tsvalue as tsvalue 
FROM dh_properties as modelProp 
LEFT JOIN dh_properties as scenProp 
ON scenProp.featureid = modelProp.pid 
LEFT JOIN dh_timeseries as ts 
ON ts.featureid = scenProp.pid 
WHERE modelProp.propcode = 'met-1.0' 
AND scenProp.propname IN ('simple_lm_nldas2','simple_lm_daymet','simple_lm_PRISM')
AND modelProp.featureid = :coveragefeatureid 
ORDER BY ts.tstime 

Now, we need to:

  • Get each varkey in its own column with rating. Each row should represent a unique tstime (e.g. pivot wider)
Details
-- Ratings in each column
SELECT tstime,tsendtime,
MAX(CASE WHEN varkey = 'rating_daymet' THEN tsvalue END) as rating_daymet,
MAX(CASE WHEN varkey = 'rating_prism' THEN tsvalue END) as rating_prism,
MAX(CASE WHEN varkey = 'rating_nldas2' THEN tsvalue END) as rating_nldas
FROM (
	SELECT to_timestamp(ts.tstime) as tstime,
	 to_timestamp(ts.tsendtime) as tsendtime,
	 ts.tsvalue,
	 v.varkey 
	FROM dh_properties as modelProp
	LEFT JOIN dh_properties as scenProp
	  ON scenProp.featureid = modelProp.pid
	  AND scenProp.entity_type = 'dh_properties'
	LEFT JOIN dh_timeseries as ts
	  ON ts.featureid = scenProp.pid
	LEFT JOIN dh_variabledefinition as v
	  ON v.hydroid = ts.varid
	WHERE modelProp.propcode = 'met-1.0'
	  AND scenProp.propname = 'lm_simple'
) AS ratings
GROUP BY tstime,tsendtime;
  • Select best rating for each time period
Details
WITH maxRating AS (
	SELECT modelProp.featureid as featureid,
		ts.tstime as tstime,
		ts.tsendtime as tsendtime,
		max(ts.tsvalue) as maxtsvalue
	FROM dh_properties as modelProp
	LEFT JOIN dh_properties as scenProp
		ON scenProp.featureid = modelProp.pid
	LEFT JOIN dh_timeseries as ts
		ON ts.featureid = scenProp.pid
	WHERE modelProp.propcode = 'met-1.0'
		AND scenProp.propname = 'lm_simple'
		GROUP BY modelProp.featureid,
		ts.tstime,
		ts.tsendtime
)
SELECT modelProp.featureid AS featureid,
	to_timestamp(ts.tstime) as tstime,
	to_timestamp(ts.tsendtime) as tsendtime,
	ts.tsvalue,
	v.varkey 
FROM dh_properties as modelProp
LEFT JOIN dh_properties as scenProp
  ON scenProp.featureid = modelProp.pid
LEFT JOIN dh_timeseries as ts
  ON ts.featureid = scenProp.pid
LEFT JOIN dh_variabledefinition as v
  ON v.hydroid = ts.varid
INNER JOIN maxRating
  ON ts.tstime = maxRating.tstime
  AND ts.tsendtime = maxRating.tsendtime
  AND ts.tsvalue = maxRating.maxtsvalue
  AND modelProp.featureid = maxRating.featureid
WHERE modelProp.propcode = 'met-1.0'
  AND scenProp.propname = 'lm_simple'
ORDER BY modelProp.featureid,ts.tstime;

Note that we see some date inconsistencies in 1980 due to the weekly expansion. We can get around this by modifying our obs_date in the coverage precip file or by using monthly ratings only:

tstime tsendtime tsvalue varkey
1980-01-01 00:00:00-05 1980-01-07 00:00:00-05 -6.65557443442 rating_nldas2
1980-01-02 00:00:00-05 1980-01-07 00:00:00-05 -9.52706366158768 rating_daymet

But they do not persist:

tstime tsendtime tsvalue varkey
1988-01-01 00:00:00-05 1988-01-07 00:00:00-05 -5.89993382771783 rating_daymet
1988-01-08 00:00:00-05 1988-01-14 00:00:00-05 -5.08896900326159 rating_nldas2

@COBrogan

This comment has been minimized.

@rburghol
Copy link
Contributor Author

rburghol commented Jan 1, 2025

@COBrogan I am eager to discuss this with you. It has been a while, but I think ideally if we have references to the best data set for each coverage stored as time series attached to those scenario properties as you suggest, we should be able to roll up all gage rasters into the single domain wide coverage.

But we've never been so close to actually getting it done… Let's get together ASAP and discuss our proposals. Maybe we can put a few queries together and then see where the best place would be to alter the workflow?

@COBrogan
Copy link
Collaborator

COBrogan commented Feb 20, 2025

/opt/model/meta_model/run_model raster_met amalgamate_simple_lm 2022-02-18 auto amalgamate process 02

@rburghol amalgamate process ->02 will now generate rasters where each cell represents the varid of the precip data source that produces the maximum rating (e.g. "best fit") from the simple_lm workflow. These rasters will be inserted into dh)timeseries_weather where the featureid is a scenario property on the feature that represents the full model extent (e.g. hydrocode cbp6_met_coverage). Using gdal_transform, I brought one of these rasters into R:

gdal_translate -of GTiff PG:"host=192.168.0.21 port=5432 sslmode=disable user=postgres dbname=drupal.dh03 schema=public table=dh_timeseries_weather column=rast where='tid = 443073019'" test.tiff

Image

It'd be interesting to see @mwdunlap2004 visualize these with the USGS gage boundaries to see if he gets the same thing in mapserver. I had to try a few different combinations of inputs for st_setValue. The raster_templates do not have a NoData value set, so when setting values and clipping, it kept using the minimum value as the NoData value. This causes data to get overwritten where it was not supposed to, but everything looks like it is working now. During the amalgamation (see bestData within the SQL WITH), watersheds with NULL best ranking are ignored. For these watersheds, no ranking file was available often due to a lack of data. For this example, the following hydroids had NULL "best" fit rankings and were thus set to background (NLDAS2): 290018, 289938, 290009, 290050, 289960, 290078, 437559, 290098, 289964, 289931, 289972, 290090 .

The tid for this raster is 443073019

R Code for image (slow):

library(raster)
library(sf)
#Get the watershed coverage from the server
watershedGeo <- read.csv("http://deq1.bse.vt.edu:81/met/usgsGageWatershedGeofield.csv")

#Get the gage numbers as their own field and store a copy of the data
gageWatershed <- watershedGeo
gageWatershed$gage <- gsub(".*_(\\d+)","\\1",gageWatershedSF$hydrocode)
gageSF <- st_as_sf(gageWatershed,wkt = 'wkt',crs = 4326)
gageSF$area <- st_area(gageSF$wkt)
gageSF <- gageSF[order(gageSF$area),]

#Get the amalgamate test raster
aTest <- raster("http://deq1.bse.vt.edu:81/met/amalgamateTest.tiff")
crs(aTest) <-  crs(aTest,4326)

pal <- c("steelblue2","darkgreen","grey50","grey50","grey50","grey50","maroon4")

plot(aTest, axes = TRUE,
     #plot with defined breaks
     col = pal,
     ylim = c(36,40.5),xlim = c(-84,-77)) 
plot(gageSF$wkt,col = NA,add = TRUE)

@rburghol
Copy link
Contributor Author

Super interesting @COBrogan !! Maybe we can get a primer on the edge phenomena you describe on Monday? I am not sure that I know what I'm seeing. Still awesome progress!

@rburghol
Copy link
Contributor Author

rburghol commented Feb 26, 2025

@COBrogan I just had an idea. It might be really slow, or it might be surprisingly fast I haven't the faintest notion.

Rather than calling the routine multiple times and saving the intermediate as we discussed perhaps it could be accomplished with RECURSIVE. Like we do for the nested model property arrays that we pull into R.

So as it is recursively called, it merges the product of the last merger with the next mask that you create?

Example from that issue where I was developing the approach: HARPgroup/hydro-tools#435

@COBrogan
Copy link
Collaborator

@rburghol I'm still trying to figure out RECURSIVE. It seems super useful for multilevel analysis (parent -> child ->grandchild -> etc), but I'm not sure of the syntax to get it to work here. Usually, the anchor member passes something down to the recursive portion of the statement but in this case each record is kind of unrelated. I'd be happy to discuss this more though, because I think this could be a really clean means of doing this.

For now, under amalgamate ->import 03, I have the FOR loop set-up to iterate through SCENARIOS_TO_RANK and will call amalgamate.sh in each loop to slowly amalgamate the best-fit precip raster

@COBrogan
Copy link
Collaborator

COBrogan commented Feb 27, 2025

@rburghol The amalgamation appears to be working for both storm_volume and simple_lm! All of the work is happening in process for now (see main issue) where process 01 creates variables, process 02 creates the varid raster, and process 03 amalgamates the best fit precip data based on the varid raster from step 2

MODEL_ROOT=/backup/meteorology/
MODEL_BIN=$MODEL_ROOT
SCRIPT_DIR=/opt/model/model_meteorology/sh
export MODEL_ROOT MODEL_BIN SCRIPT_DIR
/opt/model/meta_model/run_model raster_met amalgamate_storm_vol 2020-02-19 auto amalgamate process

Check it out! Here is the varid raster from step 2 (red = NLDAS2, green= daymet, blue= PRISM):

/opt/model/meta_model/run_model raster_met amalgamate_storm_vol 2020-02-19 auto amalgamate process 02

Image

And here is the amalgamated best fit precip data for that same day in mm!
Image

One item that became important was that the NoDataValue for NLDAS2 is 9999 and PRISM/daymet is -9999. I had to use a ST_reclass on the final raster to set all negative values to 9999 to make all the data consistent. In this reclass, I also set the pixel type to 32 float.

The slowest part of the workflow is step 2. The varid key raster takes ~10 - 20 seconds to be created. The amalgamated raster forms quickly, typically within a few seconds. So, I would estimate about 25-30 seconds per day........or 121 hours for the entire model time period. Not ideal, but with slurm that's likely closer to 121 / 6 = 20 hours. So, it would likely need to be run in chunks or over the weekend.

I think it would be good to QC this method to make sure all the correct days were selected. Given how complicated the timestamps became (Fiji time vs EST), I think it would be good to trace data from start to finish. Looking at a cell in the raw data, find out what rating that watershed ultimately gets, make sure it is higher than the other datasets (and is thus selected as the "best"), and make sure that that same data ultimately makes it into the amalgamated raster e.g. the correct day is selected. I want to spend some time working on the main issue on this page to flush out some plausible next steps and to maybe better document some of the raster tricks we've employed across these issues (or maybe I'll add to #54 and #55)

@rburghol
Copy link
Contributor Author

@COBrogan This is fantastic!! And I am eager to know how you generated that raster image :). Regardless of the "how", the "what" is interesting to me in terms of how organic it looks -- at least qualitatively there is no discernable artifacting at this zoom level. Can't wait to dive more deeply!

@durelles
Copy link

@COBrogan excited to see this too. I’ll have to learn what slurm is…

@rburghol
Copy link
Contributor Author

@COBrogan - to your earlier questions about how to RECURSIVE on this, I still have only a vague idea, and may need so e fancy footwork as aggregate functions like ST_UNION are apparently tricky to use in recursive statements (see here), but it goes like this:

with met_sets as (
  select row_number() over (order by rating) as rint,
    rast
  from met_resampled
),
base_set as (
  select rint, rast 
  from met_sets 
  where rint = 1
) 
UNION
select st_union(b.rast, n.rast)
from met_sets as n 
left outer join base_set as b
on (b.rint = n.rint - 1)

@rburghol
Copy link
Contributor Author

Actually @COBrogan this extension adds array math capabilities to postgresql and I suspect that it is intended for use on rasters, since the contributors to the project are Paul Ramsey and Regina Obe who are primary postGIS developers/maintainers. https://github.com/pramsey/pgsql-arraymath

In my mind, if the step where we create holes in the dataset is set to make the holes where the data is NOT the best, then we should be able to simply add the 3 (or more) rasters together. In fact, if we can set "not the best" cells to zero maybe that fixes the whole thing and allows us to use ST_UNION with sum mode? Apologies, it's early, and I know we've been over so much of this and I feel like maybe I'm asking questions/proposing alternatives that we've already ruled out as technically impossible…

@COBrogan
Copy link
Collaborator

COBrogan commented Feb 28, 2025

@rburghol @mwdunlap2004 We can now generate plots automatically for both the variable ID (key) rasters and the amalgamated precip rasters using the following. These are stored under the amalgamate scenario with their tsendtime e.g. http://deq1.bse.vt.edu:81/met/amalgamate_storm_vol/plots/

#Variable ID rasters
/opt/model/meta_model/run_model raster_met amalgamate_storm_vol 2022-02-18 auto amalgamate analyze 01
#Amalgamated Precip raster
/opt/model/meta_model/run_model raster_met amalgamate_storm_vol 2022-02-18 auto amalgamate analyze 02

@COBrogan
Copy link
Collaborator

COBrogan commented Mar 3, 2025

The following dates failed to amalgamate, all with the same error that happens in the first step of amalgamate process 03 when trying to resample NDLAS2 to daymet

ERROR:  invalid memory alloc request size 1216428621

2021-01-01
2021-01-02
2021-01-03
2021-01-04
2021-01-06
2023-12-31

It appears the issue is caused by some very large NLDAS2 rasters. Looking at nldas2_precip_daily, ST_Summary shows it has a large domain but ST_Count shows most of it is null for the raster of 2023-12-31

											  st_summary                             
--------------------------------------------------------------------------------------------
 Raster of 1000x1000 pixels has 1 band and extent of BOX(-83.6255 -80.9995,41.3745 44.0005)+
 band 1 of pixtype 32BF is in-db with NODATA value of 9999

| st_count (with NULLs)| st_count (without NULLS)|
|---------------------|----------|
|  1000000             |     4632|

Compared to 2023-12-30

                                        st_summary                               
----------------------------------------------------------------------------------------
 Raster of 81x80 pixels has 1 band and extent of BOX(-83.6255 34.0005,-73.5005 44.0005)+
     band 1 of pixtype 32BF is in-db with NODATA value of 9999
 Raster of 81x80 pixels has 1 band and extent of BOX(-83.6255 34.0005,-73.5005 44.0005)+
     band 1 of pixtype 32BF is in-db with NODATA value of 9999

It should also be noted that two distinct rasters were selected for 2023-12-30 due to how the daily fraction were developed. Both fixes will involve ensuring a consistent bounding box across the daily fractions and consistent times:

      to_timestamp      |      to_timestamp      |   tstime   | tsendtime  |                                       st_summary                                       | st_count | st_count
------------------------+------------------------+------------+------------+----------------------------------------------------------------------------------------+----------+----------
 2023-12-29 07:00:00-05 | 2023-12-30 07:00:00-05 | 1703851200 | 1703937600 | Raster of 81x80 pixels has 1 band and extent of BOX(-83.6255 34.0005,-73.5005 44.0005)+|     6480 |     4632
                        |                        |            |            |     band 1 of pixtype 32BF is in-db with NODATA value of 9999                          |          |
 2023-12-30 07:00:00-05 | 2023-12-30 18:00:00-05 | 1703937600 | 1703977200 | Raster of 81x80 pixels has 1 band and extent of BOX(-83.6255 34.0005,-73.5005 44.0005)+|     6480 |     4632
                        |                        |            |            |     band 1 of pixtype 32BF is in-db with NODATA value of 9999                          |          |

It appears the overly large rasters were a result of bad daily and hourly NLDAS2 fractions. Rerunning these steps in met #64 seems to have fixed this issue and created only one daily fraction per day, but some additional QC may be warranted. The hourly and daily fractions should be rerun across the entire time period to fix any other entries that may be causing issues that were not caught in the slurm reports. e.g.

MODEL_ROOT=/backup/meteorology/
MODEL_BIN=$MODEL_ROOT
SCRIPT_DIR=/opt/model/model_meteorology/sh
export MODEL_ROOT MODEL_BIN SCRIPT_DIR
for j in {1984..2023}; do 
  yr=$j
  doy=`date -d "${yr}-12-31" +%j`
  i=0
  while [ $i -lt $doy ]; do
   thisdate=`date -d "${yr}-01-01 +$i days" +%Y-%m-%d`
   echo "Running: sbatch /opt/model/meta_model/run_model raster_met nldas2 \"$thisdate\" auto met import"
   sbatch /opt/model/meta_model/run_model raster_met nldas2 "$thisdate" auto met import
   i=$((i + 1))
  done
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants