Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REST / RomDataSource, metrics and analysis #1354

Open
11 of 24 tasks
rburghol opened this issue Sep 12, 2024 · 7 comments
Open
11 of 24 tasks

REST / RomDataSource, metrics and analysis #1354

rburghol opened this issue Sep 12, 2024 · 7 comments

Comments

@rburghol
Copy link
Contributor

rburghol commented Sep 12, 2024

Analysis Steps

  • Build script to create and save summary analytic values for a given part of our operation
    • Analysts must install hydrotools package https://github.com/HARPgroup/hydro-tools
    • Get rest account credentials set up in auth.private (harp specific rest account- @nathanielf22 can give other analysts details)
    • RWB Gives analysts vahydro accounts for their future use.
    • Install hydrotools package
    • Base analysis workflow script on https://github.com/HARPgroup/meta_model/blob/main/scripts/met/met_store_info.R
    • Review how to retrieve or create the containers for storing data in the database via REST with the following (more details below at REST Find feature, Model and Scenario )
      • Uses fn RomFeature to retrieve the coverage of interest
      • Uses fn om_model_object() to retrieve or create a model property on the feature
      • Uses fn om_get_model_scenario() to retrieve or create a scenario container property on the model
      • Uses fn vahydro_post_metric_to_scenprop() to save the metrics to the scenario container prop
      • Review Using om_vahydro_metric_grid() to retrieve large amounts of data.
    • Analyst begins developing some metrics, visualizations etc in advance of meeting,
      • At meeting we decide what coverage type, workflow module, branch and step to insert the workflow into, or we modify the analysis to fit a need.
      • What workflow step does your script belong in?
      • Create issue for this workflow step
    • Arguments needed:
      • scenario = met_scenario (i.e., stormVol_nldas2, etc)
      • location/coverage identifier (like hydrocode in most cases)
      • Data source file, ex: rating CSV path as input (do not guess the file name! discuss in group meetings)
      • Others? These will be script specific, detail in specific script issue for workflow step.
    • What metrics will the script create? Some sample metrics includes below
      • Note: For summarizing regression models, we can save individual monthly $R^2$ as property named jan, feb, mar, ... as propname (perhaps organized below a container property named monthly_rsq)
      • Further metrics, create and illustrate in a specific script issue for the workflow step

Overview

Data Model & REST

Data Model Outline

REST Find feature, Model and Scenario

  • Feature
    • Function RomFeature to retrieve object
      • Arguments:
        • hydrocode: unique identifier, ex: usgs_ws_01613900, N51113
        • bundle: general feature class, ex: watershed, landunit
        • ftype: specific feature type (source of data), ex: usgs_full_drainage, cbp6_landseg
    • Use: feature <- RomFeature$new(ds, list( hydrocode=coverage_hydrocode, ftype=coverage_ftype, bundle=coverage_bundle), TRUE)
    • Returns: a feature object of class RomFeature
    • See also: *Detail View of REST Feature Query below
  • Model
    • Function om_model_object()
      • Arguments:
        • feature: an object of class RomFeature
        • model_version: model overall version, ex: pf-1.0, cbp-6.1
      • Use: model <- om_model_object(ds, feature, model_version)
  • Scenario:
    • Function: om_get_model_scenario()
    • Arguments:
      • model
      • scenario_name
    • Use: scenario <- om_get_model_scenario(ds, model, scenario_name)
  • Metric:
    • Function: vahydro_post_metric_to_scenprop()
    • Use: vahydro_post_metric_to_scenprop(pid, varkey, propcode, propname, propvalue, ds)
    • Arguments:
      • pid: scenario unique integer value pid
      • varkey: use om_class_Constant for most decimal attributes
      • propcode:
      • propname: required, descriptive variale name
      • propvalue: required
      • ds: RomDatasource
    • Ex: vahydro_post_metric_to_scenprop(scenario$pid, 'om_class_Constant', NULL, 'num_records', numrecs, ds)
Detail View of REST Feature Query
feature <- RomFeature$new(
  ds,
  list(
    hydrocode=coverage_hydrocode,
    ftype=coverage_ftype,
    bundle=coverage_bundle
  ),
  TRUE
)

Using om_vahydro_metric_grid to retrieve data


# GET VAHydro 1.0 RIVERSEG l90_Qout DATA
mdf <- data.frame(
  'model_version' = c('cbp-6.1'),
  'runid' = c('stormVol_prism'),
  'metric' = c('precip_annual_max_in'),
  'runlabel' = c('precip_annual_max_in')
)
met_data <- om_vahydro_metric_grid(
  metric = metric, runids = mdf, bundle="watershed", ftype="usgs_full_drainage",
  base_url = paste(site,'entity-model-prop-level-export',sep="/"),
  ds = ds
)

@nathanielf22
Copy link
Collaborator

Here is the current function that I updated a bit tonight. This works for any PRISM, daymet, and NLDAS2 data, such as:
read.csv("http://deq1.bse.vt.edu:81/met/stormVol_prism/precip/usgs_ws_01613900-PRISM-all.csv")

When calculating a metric, you can run something like this to return the lowest 90-day flow:
summary_analytics(prism)$l90_precip_in

`library(tidyr)
library(sqldf)
library(zoo)

For ANY data

summary_analytics <- function(df){
df <-
separate (data = df,
col = obs_date,
into = c("obs_date","obs_time"),
sep = " ")
df <-
separate (data = df,
col = obs_date,
into = c("obs_year","obs_month","obs_day"),
sep = "-")

#creating a yearly summary with each year and its total precip
yearly.summary <-
sqldf(
"SELECT obs_year, SUM(precip_in) AS total_precip
FROM df
GROUP BY obs_year"
)

#summary analytics

precip_annual_max_in <-
max(yearly.summary$total_precip)

precip_annual_max_year <-
yearly.summary$obs_year[which.max(yearly.summary$total_precip)]

precip_annual_mean_in <-
mean(yearly.summary$total_precip)

#For min values and years, we can exclude the first and last row since the
#current year and first years are incomplete data
precip_annual_min_in <-
min(yearly.summary$total_precip[c(-nrow(yearly.summary),-1)])
precip_annual_min_year <-
yearly.summary$obs_year[which.min
(yearly.summary$total_precip[c(-nrow
(yearly.summary),-1)])]

Create daily summary to use for all data. This makes hourly data daily sums.

daily.summary <-
sqldf(
"SELECT obs_year, obs_month, obs_day, SUM(precip_in) AS total_precip
FROM df
GROUP BY obs_year, obs_month, obs_day"
)
precip_daily_max_in <- max(daily.summary$total_precip)

#if else statement evaluates the amount of unique hours and if 24,
#then hourly max is taken. If not, hourly max is NA
if(length(unique(df$hr)) == 24){
precip_hourly_max_in <- max(df$precip_in)
} else {
precip_hourly_max_in <- NA
}
#Alternatively to a null value for hourly precip in daily data,
#we could look at the rainfall distribution table for a
#type II storm and multiply by the max P(t)/P(24) value.
#this function uses NULL for daily data

#l90 using zoo package
l90_precip_in <-
min(rollapply(daily.summary$total_precip, width = 90, FUN = mean,
fill = NA, align = "right"),
na.rm = TRUE)

#Qout_zoo <- zoo(as.numeric(df$precip_in), order.by = df$tstime)
#Qout_g2 <- data.frame(group2(Qout_zoo))

#l90 done using zoo package. IHA did not work.

#makes data frame with all 8 metrics
metrics<- data.frame(precip_annual_max_in = precip_annual_max_in,
precip_annual_max_year = precip_annual_max_year,
precip_annual_mean_in = precip_annual_mean_in,
precip_annual_min_in = precip_annual_min_in,
precip_annual_min_year = precip_annual_min_year,
precip_daily_max_in = precip_daily_max_in,
precip_hourly_max_in = precip_hourly_max_in,
l90_precip_in = l90_precip_in)
}`

I can run through this in detail next week since I know this is a lot to look at. I'll be out of town tomorrow afternoon through Sunday and likely out of service, but I will do my best to respond before I leave or after I get back this weekend. In the meantime, I wanted to make sure this is stored here.

@rburghol
Copy link
Contributor Author

@nathanielf22 hey Nate, thanks a bunch for pushing this up before you headed out. Very much look forward to working with this, and congratulations on the progress.

@nathanielf22
Copy link
Collaborator

nathanielf22 commented Oct 25, 2024

10/25 meeting:

  • Review REST services to input metrics
  • Github push troubleshooting
  • Review VAHydro_metric_script.R
  • new: yearly_metrics.R

@rburghol
Copy link
Contributor Author

rburghol commented Oct 25, 2024

@nathanielf22 if you can post up the input variables that created that strange output, I will check on my machine. But, from my recollection, I think that we used the wrong coverage_ftype, and in my testing with another coverage (details below), I got correct results. The ftype should be usgs_full_drainage for those USGS gage watershed features.


scenario_name <- 'daymet2'
coverage_hydrocode <- 'usgs_ws_01673550'
coverage_bundle <- 'watershed'
coverage_ftype <- 'usgs_full_drainage'
model_version <- 'cbp-6.1'
met_file <- 'http://deq1.bse.vt.edu:81/met/daymet/precip/usgs_ws_01673550-daymet-all.csv'

@rburghol
Copy link
Contributor Author

@nathanielf22 -- I have one observation about the script as is. I tested it in full, and it works! The only thing you should update is the variable name summary should be changed to something else. The reason being, there is an existing R function called summary(), and when you assign the output of summary_analytics() to summary that function gets overwritten, and that is a fairly important function that is used a good deal. precip_summary might be a good option :).

@nathanielf22
Copy link
Collaborator

@rburghol @COBrogan -- The code we put together last Friday to integrate into the workflow has an issue with the daymet and NLDAS data and should be updated. when including the n>=365days limitation, daymet excludes certain years where it only has 364 days as shown below:
image
image

In addition, NLDAS has an issue because yearly summary creation happens before daily summaries, which I switched around.
This data is for gage usgs_ws_01613900.

@COBrogan
Copy link
Collaborator

@nathanielf22 The daymet dataset will always have 365 days in each year. They always exclude December 31st on leap years (but include February 29th, which is an unusual set-up). But, remember that daymet was based in the GMT time zone. That means that each day in EST runs from 7AM to 7AM the next day. So, when grouping data by year, it matters if you are grouping by tstime or tsendtime.

In this case, we can trace through our routines and see that we are grouping data by tsendtime (a confusing by-product of how PRISM operates and an assumption we made about daymet based on a daymet-PRISM regression). That means that leap years will always have 366 data points and the year after will have 364. To better demonstrate this, we are essentially beginning our year at 7 AM January 2nd and ending it on 7AM Dec 31. So, since a daymet leap year does not have Dec 31st, it must have 364 days.

My suggestion for now is to include years >= 364 days but let's discuss at our next meeting!

For reference, note that daymet has 365 days each year when grouping by tstime:

select count(ts.rast),extract(year from to_timestamp(ts.tstime))
from dh_timeseries_weather as ts
left join dh_variabledefinition var on ts.varid = var.hydroid
where var.varkey = 'daymet_mod_daily'
group by extract(year from to_timestamp(ts.tstime));

 count | extract
-------+---------
   365 |    1980
   365 |    1981
   365 |    1982
   365 |    1983
   365 |    1984
   365 |    1985
   365 |    1986
   365 |    1987
   365 |    1988
   365 |    1989
   365 |    1990
   365 |    1991

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants