Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Week 09/16 - 09/23 #1355

Open
3 of 29 tasks
COBrogan opened this issue Sep 13, 2024 · 11 comments
Open
3 of 29 tasks

Week 09/16 - 09/23 #1355

COBrogan opened this issue Sep 13, 2024 · 11 comments
Assignees

Comments

@COBrogan
Copy link
Collaborator

COBrogan commented Sep 13, 2024

  • Poster Competition Discussion: Any good questions or comments from audience?
  • Temporal Disaggregation of NLDAS2 #Integrate temporal disaggregation step into wdm workflow. model_meteorology#86
  • Amalgamation #Workflow amalgamate model_meteorology#66
    • Continue workflow development
      • Identify best performing dataset per coverage
      • Extract rasters per coverage
      • Union rasters together in order of drainage area
    • Design storm identification and incorporation
    • Count of design storm vs. temporal disaggregation
      • Default design storm will be total / 24 to give average hourly precip and keep constant over day
  • Increasing temporal resolution of ratings workflows
    • simple_lm #GEO_MET_MODEL option 1: Simple lm() analysis: model_meteorology#57
      • Create a copy of simple_lm that develops monthly regressions and applies them on a weekly basis. This will involve regressions based on four weekly data points applied on a weekly basis. Issue created, but needs reworking #GEO_MET_MODEL option 2: Weekly min(e) lm() analysis: model_meteorology#59
        • Identify the error (model - obs) each week of that month based on the monthly regression and output error normalized to flow via fraction error $rating = 1 - \frac{abs(Q_{obs} - Q_{model})}{Q_{obs}}$
        • The goal is to be able to pick the "best" performing dataset on a weekly basis when doing amalgamate
    • stormVol #GEO_MET_MODEL option 3: Storm Hydrograph Analysis and Precip/Hydrograph Volume Analysis model_meteorology#60
      • Increasing temporal resolution model_meteorology#95
        • The goal is to be able to pick the "best" performing dataset on an event basis when doing amalgamate
        • Create a copy of stormVol that applies divides events into 8 year periods and runs regressions based on months. The end result would be a timeseries that has constant months during each 8 year period
      • Create a copy of stormVol configs to run the stormVol approach comparing volume above baseflow to precipitation during the storm event to have a more deterministic approach
      • Alter creation of [coverage]-[met_scenario]-lm_simple-rating-ts.csv file to include tstime and tsendtime - covering periods, rather than an entry for each day.
        • Verify that our current SQL queries to create the best rating timeseries can handle these arbitrary time periods, that is, they do NOT break if we give them something other than a value for every single day in the simulation.
  • Case Studies
    • What is the gage with biggest difference between NLDAS and daymet, NLDAS and PRISM, and PRISM and daymet?
      • Any obvious trends in residual plots at these gages?
      • How does largest and mean storm perform in stormVol?
    • GIS visualizations of coverages based on best performing dataset for each method and each month - See code in Week of 09/02/2024 - 09/06/2024 #1348
      • Use om_vahydro_metric_grid. Can get started by pulling L30 from models and begin to test spatial visualization while we process our results and eventually store them via REST. Can pull scripts from last year's project
    • GIS visualizations of coverages based on best performing dataset for each month - See code in Week of 09/02/2024 - 09/06/2024 #1348
  • Precipitation Dataset Analysis
    • Workflow to characterize coverage precip datasets (was: Data, Ratings + REST )
    • Develop workflow for temporal verification of data: do PRISM, daymet, and NLDAS align reasonably when regressed against each other? A linear regression of any two datasets should yield roughly a 1:1 regression e.g. lm(PRISM ~ NLDAS2). We can leverage the baseline variables set-up in the *.con files to easily integrate these into our existing workflows
      • If datasets do not align, are there any obvious errors in raster timestamps e.g. do the regressions of any two datasets improve if one of data sets is +/- 1 or 2 days e.g. lm(PRISM$precip[PRISM$time + 1] ~ NLDAS2)
      • Also add visualizations of comparisons via REST to workflow. Note: .con files include variable BASELINE_MET_SCENARIO to provide this info
    • Spatial - temporal relationships across the datasets. Where are some datasets showing higher precip than others?
      • GIS and tabular annual comparisons of precip at each USGS gage coverage with all three datasets
      • Tabular comparisons of monthly precip at each USGS gage coverage with all three datasets
@ilonah22
Copy link
Collaborator

Here is the path to the file I was working on today which finds max and min r-squared ranges on the data-analysis branch:

HARP-2024-2025/DrainageAreaAnalysis.R

@COBrogan
Copy link
Collaborator Author

COBrogan commented Sep 16, 2024

Thanks @ilonah22 ! @nathanielf22 here is a useful GIS in R reference I've used in the past:
https://r-spatial.github.io/sf/articles/sf1.html

Leafing through it, these tutorials seem helpful. The most relevant R code will be those that use the sf library as many other R spatial libraries are slotted for deprecation (like rgdal which used to be very popular)
https://tmieno2.github.io/R-as-GIS-for-Economists/vector-basics.html
https://bookdown.org/michael_bcalles/gis-crash-course-in-r/

There are several chapters in there to go over using sf to do basic GIS stuff in R. Otherwise, here's some extra code to help get some spatial plots going. It's pretty long and it's assuming you have the ratings files downloaded (easiest way is sftp, which Ilona has used before or you can reach out to me and I can get you going). From there, you need the ratings in their own folders all in a director called ratings. Let me know if you have any questions!

R Code for SF Plotting
library(tidyverse)
#Assumes all ratings files are in a folder ratings/daymet/ or ratings/prism/
for(j in c("daymet_stormvol","prism_stormvol","nldas_stormvol","daymet_simplelm","prism_simplelm","nldas_simplelm")){
  print(j)
  #Set path to read in ratings  based on data source of outer loop
  pathToread <- paste0("ratings/",j,"/")
  for( i in list.files(pathToread) ){
    #For each file in the directory, read in the file i as a csv
    filei <- read.csv(paste0(pathToread,i))
    #Store the gage number
    filei$gage <- gsub(".*_(\\d+)-.*","\\1",i)
    #Store the analysis type
    filei$workflow <- gsub(".*_","",j)
    #Keep only the necessary columns, depending on the workflow:
    if(filei$workflow[1] == "simplelm"){
      filei <- filei[,c("mo","rating","gage","workflow")]
    }else{
      filei <- filei[,c("mo","r_squared","gage","workflow")]
    }
    names(filei) <- c("mo","rating","gage","workflow")
    #Combine the file into a single data frame
    if(!exists("combineFile")){
      combineFile <- filei
    }else{
      combineFile <- rbind(combineFile,filei)
    }
  }
  #Assign to a specific variable and delete the generic combineFile
  assign(paste0('combineFile',j),combineFile)
  rm(combineFile)
}

#Join the daymet and prism data together
joinData <- combineFileprism_stormvol %>% 
  #Rename the rating for clarity
  select(prismRatingstorm = rating,gage,mo,workflow) %>% 
  #Join in the dayment data, but first rename the rating column for clarity.
  #Join on gage, month, and workflow
  full_join(combineFiledaymet_stormvol %>% 
              select(daymetRatingstorm = rating,gage,mo,workflow),
            by = c("gage","mo","workflow")) %>% 
  full_join(combineFiledaymet_simplelm %>% 
              select(daymetRatinglm = rating,gage,mo,workflow),
            by = c("gage","mo","workflow")) %>% 
  #Add remaining prism data:
  full_join(combineFileprism_simplelm %>% 
              select(prismRatinglm = rating,gage,mo,workflow),
            by = c("gage","mo","workflow")) %>% 
  #Join in the nldas data, but first rename the rating column for clarity.
  #Join on gage, month, and workflow
  full_join(combineFilenldas_stormvol %>% 
              select(nldasRatingstorm = rating,gage,mo,workflow),
            by = c("gage","mo","workflow")) %>% 
  full_join(combineFilenldas_simplelm %>%
              select(nldasRatinglm = rating,gage,mo,workflow),
            by = c("gage","mo","workflow")) %>%
  #For easier viewing, combine lm and storm columns such that there is only one
  #column for prism, daymet, nldas classified by the workflow column
  mutate(prismRating = coalesce(prismRatingstorm,prismRatinglm),
         daymetRating = coalesce(daymetRatingstorm,daymetRatinglm),
         nldasRating = coalesce(nldasRatingstorm,nldasRatinglm)
  ) %>% 
  #Remove now superflous columns:
  select(-prismRatingstorm,-prismRatinglm,
         -daymetRatingstorm,-daymetRatinglm,
         -nldasRatingstorm,-nldasRatinglm) %>% 
  #Pivot it longer to have a column with the data source and one for the
  #ratings, for plotting ease
  pivot_longer(c(prismRating,daymetRating,nldasRating),
               names_to = 'dataSource',
               values_to = 'rating') %>% 
  arrange(gage,mo,workflow)

#At each gage, does the best performing data source change between workflows?
#The below code is for ESTIMATES ONLY. The left_join assumes that the ratings
#are unique between datasources for each gage, workflow, month. This is brittle
#and could result in incorrect answers!
gageCompare <- joinData %>% dplyr::ungroup() %>% 
  #Group by gage, workflow, and month and find the max rating:
  dplyr::group_by(gage,workflow,mo) %>% 
  dplyr::summarise(maxRating = max(rating,na.rm = TRUE)) %>% 
  #Join the joinData df back in matching by gage, workflow, mo, and rating. This
  #could be problematic with duplicate ratings as in a case where all ratings
  #across the data sources are the same value
  left_join(joinData,by = c('gage','workflow','mo','maxRating' = 'rating')) %>% 
  #Now pivot wider such that we can see ratings and best data sources side-by-side
  pivot_wider(names_from = workflow,values_from = c(maxRating,dataSource)) %>% 
  #Now filter to only find instances where the best data sources are different:
  filter(dataSource_simplelm != dataSource_stormvol) %>% 
  #Create a column showing the difference in rating and arrange by the difference
  mutate(differenceInRating = maxRating_simplelm - maxRating_stormvol) %>% 
  arrange(differenceInRating)


#Isolate one month
gageCompareMonth <- gageCompare[gageCompare$mo == 10,]

library(sf)
#Get the watershed coverage from the server
watershedGeo <- read.csv("http://deq1.bse.vt.edu:81/met/usgsGageWatershedGeofield.csv")

#Get the gage numbers as their own field and store a copy of the data
gageWatershedSF <- watershedGeo
gageWatershedSF$gage <- gsub(".*_(\\d+)","\\1",gageWatershedSF$hydrocode)

#Let's begin by plotting October daymet ratings for stormVol
forPlot <- joinData[joinData$dataSource == 'daymetRating' & 
                      joinData$workflow == 'stormvol' &
                      joinData$mo == 10,]
#Join the geometry data onto out plot data
joinDataSF <- forPlot %>% 
  left_join(gageWatershedSF %>% select(-hydrocode),
            by = 'gage') %>% 
  filter(!is.na(wkt))
#Create an SF object. Specify the coordinate system and the field name in the
#data frame that contains the well-known text. In this case, WKT is the name of
#the field with the polygon geometries
joinDataSF <- st_as_sf(joinDataSF,wkt = 'wkt',crs = 4326)
#Repair broken geometries
joinDataSF <- st_make_valid(joinDataSF)
#Add shape area in coordinate system units (likely meaningless in crs 4326)
joinDataSF$area <- st_area(joinDataSF)
#Order the data by largest to smallest area to make plotting more effective
joinDataSF <- joinDataSF[order(joinDataSF$area,decreasing = TRUE),]
#Plot SF polygons
ggplot(joinDataSF) + 
  geom_sf(aes(fill = rating)) + 
  scale_fill_viridis_c(option = 'magma') + 
  theme_classic()

The above code should generate the following image, which is showing October adjusted R squared values for daymet from the Storm Volume method:
image

@nathanielf22
Copy link
Collaborator

Thank you @COBrogan! I have been attempting to use the code, but having some issues with the creation of gageCompare, where it has all the dplyr steps. The ratings appear to be repeating, and it leads to the creation of list-cols that make the data frame unusable. Now, I'm trying to sftp the ratings again to be sure I have the right data, but I'm struggling to find the correct folder on the server to sftp from. Could you help point me in the right direction to get the data that works for this R script?

@rburghol
Copy link
Contributor

@nathanielf22 -- no need for sftp, as all data should be available via web link with pattern: http://deq1.bse.vt.edu:81/met/[scenario]/

For example you can find the precip data and flow analyses for things that were run for scenario stormVol_nldas2 (storm volume analysis based on nkdas2 data) at - http://deq1.bse.vt.edu:81/met/stormVol_nldas2/

@nathanielf22
Copy link
Collaborator

@rburghol, that makes sense. What about the simple lm data? The code includes those on the third line, but there isn't a folder labelled simplelm.

@COBrogan
Copy link
Collaborator Author

The simplelm data is in the folders labeled NLDAS, PRISM, and daymet. Those directories should be renamed eventually, but we haven’t done so yet. So, in ‘PRISM/stats/‘ should be the simplelm results. @rburghol i think ‘sftp’ is easier, unless there’s a way to read all those csv urls at once into R? So far, I’ve been telling everyone to just use sftp if they are analyzing ALL 1,080 csv files…
If you agree, @nathanielf22 this data is in /media/model/met/ but you may need to use ‘ls’ to poke around the directories within there which is why the urls would be easier if there’s a way to get all of them. So, ‘/media/model/met/PRISM/out’ has the simple on stats

@rburghol
Copy link
Contributor

rburghol commented Sep 18, 2024

@COBrogan I think that if we are at the point that we need to analyze thousands of files at the same time we're at the need to prioritize putting the analysis inside the workflow and setting summary metrics in the database.

Building local desktop workflows to analyze thousands of files that are downloaded via STP on the server we risk time wasted and trouble created with redundant analysis and download and then data sets coming out of date.

I think that we really should be focusing on small case study analysis right now rather than broadbrush global things. When we have analytical lenses that we think are worthy of repeating over 1000 files then we move forward.

As for the scenario names, for sure scenario nldas2 is actually that which has the nldas2 + simple lm data -- totally agree that we DO need to remedy this, and this will be a good thing to prioritize. It involves only creating a new .con file as a copy of nldas2.con. Then rerunning the workflows.

Interestingly on the server we have the .con files as well: http://deq1.bse.vt.edu:81/p6/vadeq/config/control/met/

@COBrogan
Copy link
Collaborator Author

Hmm. I agree that these metrics should be shifted into the db soon. I think we discussed making a data model sketch for that next week, but we talked about Nate beginning to create those workflows to analyze/QC precip data based on annual means, potential timestamp issues, and spatial visualization of precip annual data. This work will naturally become a workflow as it’s based on the precip files we’re generating, so no issues there.
I believe the spatial visualization of mean precip for a year is a valuable tool. To test a workflow locally, you’d want to grab all existing coverage precip files. So, I think this work is still valuable as long as we keep in mind that precip file path or directory should be an input to the script. If we keep to one script = one purpose, then Nate will naturally develop scripts that create mean precip which will be a mandatory step in putting it in the db. Maybe we can chat a bit about this tomorrow so I can catch up on our intended structure, but I think Nate’s on the right track here to creating a dataset QC workflow that may become part of geo/process and geo/analyze

@rburghol
Copy link
Contributor

rburghol commented Sep 18, 2024

@COBrogan I 100% agree that we will eventually want to see global trends in all sorts of variables, but the QA workflow step is at the single coverage level, not over the global data sets. I think I may have been too vague in my comments on this previously. I will elaborate below. @nathanielf22

In our workflows, everything operates at a coverage level, so in developing the QA steps we should:

  1. Write code to load load single coverage data sets (perhaps multiple components of the data, like precipitation, flow, model predictions…)
  2. calculate metrics for the single coverage described by the data set (mean, min annual, max, 90 day max/min, single hiurly max, daily max, ...)
  3. Create visualizations for that single coverage.
  4. Look for bad values, like NULLs, negatives, or perhaps values that exceed some threshold
  5. Do overall summaries that include record counts.
  6. Dive deep into the data set, trying to find places in the time series where our relationships are good or bad or just plain interesting. This is the kind of think Nate was showing yesterday and it's spot on but we need to go further and also make sure we are relating the files from a single scenario together (unless we are purposely comparing scenarios).
  7. Once we have something that works well, i.e. we have metrics that we think reflect the integrity of the data set, Then we can plug that directly into the workflow that will then get executed for different cases.
  8. We can also add in comparisons scripts between two scenarios for a single coverage, for example, the timestamp offset analysis that @COBrogan piloted before.

Otherwise, we spend time doing file management (sftp), and we write a code workflow that iterates through files rather than a standalone workflow to handle analyzing a single file, for a single coverage. Then we have to disentangle our batch code to operate standalone.

But we already have a robust set of code to allow us to retrieve all the metrics from all the scenarios at one single time in a very efficient manner om_vahydro_metric_grid, and we have a workflow and file structure to handle iterating over 1,000s of combos, we just have to get ourselves prepared to use it. And preparing to use it is exactly what is done by the single coverage/data set workflow components outlined above. And once we have them, we will insert them via REST, and then map to our hearts content.

@COBrogan
Copy link
Collaborator Author

COBrogan commented Oct 1, 2024

Working on rerunning workflows simple_lm and stormVol to incorporate changes to JSON and calc_raster_ts. The following workflows need to be rerun:

  • PRISM (simple_lm)
  • daymet (simple_lm)
  • nldas2_tiled (simple_lm)
  • stormVol_daymet
  • stormVol_nldas2_tiled
  • stormVol_prism

ALL other methods are out of date!

@ilonah22
Copy link
Collaborator

Update for Thanksgiving break:

On Tuesday of this week, I met with Connor to integrate some of the scripts I have been working on recently into the workflow, which went relatively smooth. There were a few issues that came up with some steps of the workflow that we tried to fix as we went.

My next step is going to be trying to split up the StormEventsLM_cmd.R script into multiple parts, because right now it's performing multiple functions in the same script, which we don't want for our final workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants