Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weather Data QA #72

Open
27 of 29 tasks
rburghol opened this issue Jun 25, 2021 · 20 comments
Open
27 of 29 tasks

Weather Data QA #72

rburghol opened this issue Jun 25, 2021 · 20 comments
Assignees

Comments

@rburghol
Copy link
Contributor

rburghol commented Jun 25, 2021

Big questions:

  • Note: Some of this will be best done on the files that are produced after area-weighting, but some may be best done on the raw files themselves.
  • How do we know that we have all of the data we need?
    • Since the WDM data is organized by county, we can express our need as:
      • we need grid cells that overlap every county that overlaps every river segment that flows into or through Virginia
  • Do we have the time periods we need?
  • Are the data sane?
    • look for missing data codes (missing data code is -9999)
    • anomalously high values
    • gaps in data: for example, temperature should have value for every day, ET may have days that are 0.0, Precip will definitely have days that are zero.
  • Data model/workflow for storing data QA results in VAHydro
    • Can use same data model for each grid cell
    • Do we store data for each land segment? (definitely!)
    • How do we store these, as time series values with properties attached for each time we update our datasets?
    • General Data Model
      • Use: Feature -> Model -> nldas_datasets -> XXXX-YYYY -> QA data name
      • All attributes are numerical constants at this time
      • Ex:
        • Feature: ows-watershed-dash-info/595100
        • Model: om-model-info/6863465/dh_properties
        • nldas_datasets: om-model-info/6863472/dh_properties
        • 1984010100-2020123123: om-model-info/6863473/dh_properties
          • propname = 1984010100-2020123123
          • PRC_anomaly_count = 2,
          • PRC_daily_error_count = 0
          • PRC_hourly_error_count = 0
          • record_count = 324360
  • Script:
  • Integrate in work flow Running P5.3.2 (Southern Rivers) for new Meteorology #166
  • Prop helper: nldas_feature_dataset_prop()
  • Problem cells
    • x385y94, multiple, ex: DDPT, x385y94, 1986 , -40.3535233
    • x386y94, multiple, ex: DDPT, x386y94, 1986 , -40.3506203
    • x386y95, multiple, ex: DDPT, x386y95, 1986 , -40.3687019
    • x387y95, multiple, ex: DDPT, x387y95, 1986 , -40.2827759
    • x388y95, multiple, ex: DDPT, x388y95, 1986 , -40.0947647
    • x388y96, multiple, ex: DDPT, x388y96, 1986 , -40.0809441
    • x389y96, multiple, ex: DDPT, x389y96, 1986 , -39.9123192

QA Scripts/Code Samples

Find -9999 in any file in downloaded and parsed grid cell data

cd /backup/meteorology
fgrep -R "-9999" ./out/grid_met_csv/*
@alexwlowe
Copy link
Contributor

alexwlowe commented Jun 28, 2021

Big Questions Answers

  • Do we have the time periods we need?
    • We have downloaded all of the raw .grb NLDAS data (1979-present, organized by year in /backup/meteorology directory).
    • We are now extracting the time series for all the grids in VA's minor basins (using NLDAS_GRIB_to_ASCII). We are extracting 2015-present for every grid, and 2005-2014 for just the grids for the southern rivers
      • These are the dates Rob told us we needed, can anyone else verify that they are correct?
  • Do we have all of the landsegs that we need?
    • We have a list of all of the land segments in VA's basins that are OUTSIDE of VA (in an excel in the drive). I believe Rob told us to only find these
      • Do we have a list of the ones inside VA? Maybe somewhere in the drive or on github
  • Are the data sane?
    • There is value for every TT (temp), VP (pressure), RH (humidity), and WD (wind speed); but not ET (evapotranspiration), RN (radiation), and PP (precip)
    • In terms of missing data from grids, there are hundreds of grids so we will probably find out if there are missing grids when we batch run NLDAS_ASCII_to_LSEGS

@rburghol
Copy link
Contributor Author

  • dates: i amend my statement: we need to do 1984-present for southern rivers, since we will be creating new data files, not appending old ones
  • "We have a list of all of the land segments in VA's basins that are OUTSIDE of VA (in an excel in the drive)." We need verification of this. The list if Va land segs are in the GIS files. This should be a priority
  • QA - per the previous one, we need you all to figure out a way to verify that you have all the data you need. The "we'll probably find out" if it blows up after we run NLDAS_ASCII_to_LSEGS isn't sufficient. We can't afford a probably, we need a plan.
  • Precip: please elaborate - if we don't have precip we are pretty much dead in the water. Is this everywhere or just somewhere?
  • Temp: great!
  • Figure out verification methods: how could you verify? Imagine you are working in a bank: we're pretty sure we have all the money, guess we'll find out when we try to spend it... some ideas:
    • visual verification
    • Queries using joins
    • Charts of your data for a single grid cell
    • Charts of your data for a land segment (for testing purposes as you begin to explore)
    • How many records in each file?
    • Averages, totals, mins and maxes -- what is an anamolous value? Any negative precip?

@alexwlowe
Copy link
Contributor

Hey Rob,

  • Sounds good on the dates, we can extract time series for those years of the southern rivers.
  • We talked to Joey and Dr. Scott about the land segments in VA. We are going to look around in the google drive/github for a list and if we cannot find anything then we will go ahead and generate a list ourselves.
  • In terms of making sure we have all of the data, all of your suggestions sound great, yesterday during the meeting Joey mentioned that it would be good to start making some visualizations so they can also act as a verification. I can also check how many lines are in the .txt files to make sure every day has data.
  • Precip: We do have precipitation, I just meant that some of the days have a value of zero, which is good!

@rburghol
Copy link
Contributor Author

Excellent - thanks for the update! Glad to see I misinterpreted the precip status!

@alexwlowe
Copy link
Contributor

http://deq1.bse.vt.edu:81/met/ - web address for /backup/meteorology directory

@alexwlowe
Copy link
Contributor

alexwlowe commented Jul 8, 2021

Various docs/resources that we have used for QA

  • google doc that outlines the entire process of everything we have done so far with important notes, steps, etc.
  • excel doc that contains basins, sections of VA and their corresponding grids.
  • rnoaa guide: a cheat sheet guide to the rnoaa package in R that can be used to download NOAA data from various stations. We have been using this to check our data to make sure it is similar to the station data across VA.
    • Last page of google doc describes how null values are represented in NOAA dataset
  • google slides presentation with simple visualizations of the data for QA purposes. It also has some graphs comparing the data with noaa data throughout the state.
  • nldas data overview that shows what values happen if there is missing data

@katealbi11
Copy link
Contributor

katealbi11 commented Jul 14, 2021

Helpful Links for Null Values [here ](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt
open-data-docs/docs/noaa/noaa-ghcn at main · awslabs/open-data-docs (github.com))

  • WT** = Weather type → ** = identifying value
    -19 = unknown source of precipitation
  • VALUE1 = value of first day of the month (missing = -9999)
  • MFLAG1 = measurement flag for first day of the month
    -Blank = no measurement information available
    -P = identified as “missing presumed zero” in DSI 3200 and 3206
  • QFLAG1 = quality flag for the first day of the month
  • Blank = did not fail any quality assurance check
    -D = failed duplicate check
    -G = failed gap check
    -I = failed internal consistency check
    -K = failed streak/frequent-value check
    -L = failed check on length of multiday period
    -M = failed megaconsistency check
    -N = failed naught check
    -O = failed climatological outlier check
    -R = failed lagged range check
    -S = failed spatial consistency check
    -T = failed temporal consistency check
    -W = temperature too warm for snow
    -X = failed bounds check
    -Z = flagged as a result of an official Datzilla investigation
  • SFLAG1 = source flag for the first day of the month
    -Blank = No sourceand so on through the 31st day of the month. Note: If the month has less than 31 days, then the
    -remaining variables are set to missing (e.g., for April,
  • VALUE31 = -9999, MFLAG31 = blank, QFLAG31 = blank, SFLAG31 = blank).
    -ID = station identification code
    -0 = unspecified
  • GSN FLAG = flag that indicates is station is part of GCOS Surface Network
    -Blank = non-GSN station
    -GSN = GSN station
  • HCN/CRN FLAG = not a member = Blank
  • WMO ID = not assigned a number = Blank

@alexwlowe
Copy link
Contributor

7/15/2021 Update

When batch running the land segments we discovered that we had missed some grids when using the NLDAS2_GRIB_to_ASCII function. The grids that we were missing are grids that are not inside VA's minor basins, however they are inside land segments that are partially in VA's minor basins.

All of the missing grids are currently being extracted from 1984-2020 (I used the handy nohup trick Rob showed us yesterday so it should run all night even when I get signed out of deq4). I will probably login for a couple of min tomorrow when the grids are finished downloading and batch run the NLDAS2_ASCII_to_LSegs so that on Monday we should be able to begin implementing our PET calculations with all of our data!

UPDATE ON MISSING VALUES IN DATA

  • No -9999 values in /out/grid_met_csv folder returned
  • Got part of the way into checking for NA values until deq4 crashed (we all got logged out by a 'remote user' for a couple of minutes, not sure why). However, there were no instances of NA in what data it got through.

@katealbi11
Copy link
Contributor

NLDAS vs RNOAA Difference Graphs

Example for One Month: July 2017
Screen Shot 2021-07-13 at 11 59 32 AM

Example for Total Monthly Precipitation 2017
Screen Shot 2021-07-14 at 11 11 34 AM

@kylewlowe
Copy link
Contributor

kylewlowe commented Jul 22, 2021

We ran into a couple land segments with missing data:

  • Land seg A37001 and A37135 are missing temperature data between Oct. 27 to Dec. 31 of 2008
  • Values are stored as 0.01 in terminal and 10-29 hour 19 is repeated until 01-01-2009 hour one
  • I left a nohep command running in the /backup/meteorology directory to look for any other 0.01 values, so we should be able to find any other missing data once it finishes

The problem seems to be that some of the grid data didn't finish downloading in 2008. We are currently redownloading and updating deq4 with complete timeseries data for both grids and corresponding land segments. The fact that our function caught the missing data is a good sign and we should be able to fix the issue and have all the ET csv files on deq4 by the meeting on Monday.

7/26/2021 Update:

  • All of the missing data has been completely redownloaded
  • The .HET and .HSET csv files for each land segment are up on deq4
    • They are all on the /backup/meteorology/out/lseg_csv/1984010100-2020123123 directory

@alexwlowe
Copy link
Contributor

image
Here is a graph comparing the different potential evapotranspiration method. All of the land segments and years I have graphed show similar trends: Gopal's PET is larger during the summer, Hamon method is smaller during the summer, and Hargreaves-Samani is kind of all over the place

@kylewlowe
Copy link
Contributor

kylewlowe commented Sep 20, 2021

9/20 update on issue that was previously being tracked in #122

The issue regarding missing data we were dealing with over the summer was dealing with two land segments not near the land segments from the new problem we have been discussing. Therefore, the potential for the previous NLDAS2_ASCII_to_LSegs run having used bad grid data is not what caused the bad precip time series data.

Here is the side by side comparison of the same time between the old and new data. There seems to be no pattern from what I can see.

Table:
Screenshot (124)

Plot:

Line plots:
One day:
image
Whole timeseries:
image

Log plots:
One day:
image
Whole timeseries:
image

After reviewing the summary stats, the precip_annual and 90_day_max_precip columns are extremely high. Searching and filtering each land segment by these will be a QA test to run.

@katealbi11
Copy link
Contributor

katealbi11 commented Sep 27, 2021

Another way of visualizing anomalies in the data: finding the upper and lower quartiles of data set, computing the IQR, if a value is 1.5*IQR it is flagged as an outlier.

  • A way of using the data given to determine anomalies
  • An extreme outlier would be considered as greater than or less than 3*IQR -> values are greater than this but not my too much -> still plausible!

@kylewlowe
Copy link
Contributor

Searching through all of the land segment data and flagging for yearly precipitation values greater than 150 inches resulted in 30 years of land segment data. For whatever reason 2008 seems to have been a problem year. However, this is using the data from before we just reran the function: which fixed the 2 land segments we have been looking at. It will be interesting to see if the data is fixed for every single of these land segments now too (this is also a reminder for me to do that tomorrow).

Here is the .txt with year and land segment:
FlaggedLsegs.txt

@rburghol
Copy link
Contributor Author

@kylewlowe great outcome ^^. Eagerly anticipating the re-run that you do and see if that fixes many of these.

@kylewlowe
Copy link
Contributor

Update on rerunning of flagged segments function:

All of the 2008 values seemed to have fixed themselves after the re run. However, the two 1985 land segments did not change. We checked the grid data for the corresponding grids to see if the raw meteorological data downloaded wrong for 1985 and found a grid that only goes has data up until June 10th on the 10th hour. The grid is x382y101, which is in both land segments. This was probably a result of grib_to_ascii function not finishing, or the actual raw data downloading from NLDAS not finishing while running over the summer. We will continue to work on figuring out which one of these is the problem and redownload necessary data tomorrow.

@kylewlowe
Copy link
Contributor

Update on Timeseries QA after reimporting data

All data checked out with nothing being overly unusual. Flagged segment txt files for each metric (DPT, PRC, etc.) are located in the /backup/meteorology directory for viewing of individual values.

The number of flagged data points are as follows:

  • Precipitation - 389
  • Dew Point - 96
  • Wind Speed - 45
  • Evapotranspiration - 460
  • Radiation - 136

The test values used were the same values used before the database reset. They are as follows:

  • Precip > 1.0 in/hr
  • PET > 0.035 in/hr
  • Dew point > 27 C (86 F)
  • Wind speed > 50 mph
  • Solar Radiation > 90 ly/hr

@rburghol
Copy link
Contributor Author

rburghol commented Jul 29, 2022

  • Try exporting a single grid cell
    • Make CSV for each cell of a land segment with bad cells: ./a2l_test 1984010100 2020123123 /backup/meteorology/out/grid_met_csv /backup/meteorology/out/lseg_csv A51800
    • Make WDM for a single cell and look for errors: wdm_pm_one x385y94 1984010100 2020123123 nldas2 harp2021 nldas1221 p20211221
    • DPT has error is some x385y94, multiple, ex: DDPT, x385y94, 1986 , -40.3535233
  • Re process the bad year: ./g2a_one.bash 1986010100 1986123123 /backup/meteorology /backup/meteorology/out/grid_met_csv x385y94
    • Examine equations (see R below)
    • The range of DPT seems reasonable in 1986, never going outside into the range reported in the error
  • Now, try reprocessing all cells for the bad year with grid2land.sh
    • Reprocess: grid2land.sh 1986010100 1986123123 /backup/meteorology /backup/meteorology/out/grid_met_csv A51800
    • Lseg CSV: a2l_one 1984010100 2020123123 /backup/meteorology/out/grid_met_csv /backup/meteorology/out/lseg_csv A51800
    • Regen CLDC/SolarRad: LongTermAvgRNMax /backup/meteorology/out/lseg_csv/1984010100-2020123123 /backup/meteorology/out/lseg_csv/RNMax 1
    • Lseg WDM: `
  • Proceeded Perfect!!

R Code examine equations.

library("sqldf")
# from fgrep DPT /opt/model/model_meteorology/nldas2/NLDAS2_ASCII_to_LSegs.cpp
# we get:
# DPT = 237.7 * ( (17.271*TMP/(237.7+TMP)) + log(RHX) ) / (17.271 - ( (17.271*TMP/(237.7+TMP)) + log(RHX) ));

# read temp
tmp <- read.table("out/grid_met_csv/1986/x385y94zTT.txt");  
# read rh
rh <- read.table("out/grid_met_csv/1986/x385y94zRH.txt");
names(rh) <- c('year','mo','da','hr','value')
names(tmp) <- c('year','mo','da','hr','value')

dpt <- sqldf(
  "
    select a.year, a.mo, a.da, a.value as temp, b.value as rh
    from tmp as a 
    left outer join rh as b 
   on (
      a.year = b.year
      and a.mo = b.mo
      and a.da = b.da
      and a.hr = b.hr
    )
")

dpt$dpt <- 237.7 * ( (17.271*dpt$temp/(237.7+dpt$temp)) + log(dpt$rh) ) / (17.271 - ( (17.271*dpt$temp/(237.7+dpt$temp)) + log(dpt$rh) ))

quantile(dpt$dpt)

        0%        25%        50%        75%       100%
-19.296831   3.598815  11.937289  19.029749  25.111637

R Examine Data.

rad <- read.csv('/opt/model/p53/p532c-sova/input/unformatted/nldas2/harp2021/1984010100-2020123123/A51800.RAD')
names(rad) <- c('yr', 'mo', 'da', 'hr', 'value')
quantile(rad$value)
     0%     25%     50%     75%    100%
 0.0000  0.0000  0.0925 29.4794 92.0369

@rburghol
Copy link
Contributor Author

rburghol commented Apr 17, 2023

  • N51053 has anomaly in precip
  • Looks like:
    image
  • Investigate, fyi, cells for a landseg are obtained with: ./nldas_land_cells N51053
  • make csv of each cell in the land segment (a2l_test = will create teh CSV files for each cell):
sdate=1984010100
edate=2022123123
./a2l_test $sdate $edate /backup/meteorology/out/grid_met_csv /backup/meteorology/out/lseg_csv N51053
  • Now, go thru the cell pairs (N51053: 18 x379y95 x377y96 x378y96 x379y96 x380y96 x381y96 x377y97 x378y97 x379y97 x380y97 x381y97 x377y98 x378y98 x379y98 x380y98 x381y98 x379y99 x380y99)
    • Generate WDM for cell to see output
    • If there is an error like PROBLEM ERROR Hourly data outside valid range, and a summary error for a year like HPET, x379y95, 2021 , 9657.27148 with message PROBLEM ERROR Annual data out of range, you need to try regenerate the grid from NLDAS2 data.
  • Regenerate the grid from the original NLDAS2 data:
    • Grid to land for a single cell: ./g2a_one.bash 2021010100 2021123123 /backup/meteorology /backup/meteorology/out/grid_met_csv x379y95
    • Re-do long term max so cloud cover is correct: LongTermAvgRNMax /backup/meteorology/out/lseg_csv/${sdate}-${edate} /backup/meteorology/out/lseg_csv/RNMax 1 x379y95
    • Generate WDM for cell to see output
    • Check the CSV file if there were errors in WDM for the given year: nano out/grid_met_csv/2021/x380y99zET.txt
    • If it gives no errors you are in business!
  • If the above gave no errors, you can re-process all cells for the landseg and create the landseg file with grid2land.sh
    • Reprocess all cells for bad year: grid2land.sh 2021010100 2021123123 /backup/meteorology /backup/meteorology/out/grid_met_csv N51053
    • Regenerate the CSVs for the landseg: a2l_one 1984010100 2022123123 /backup/meteorology/out/grid_met_csv /backup/meteorology/out/lseg_csv N51053
    • go to model dir cd /opt/model/p6/vadeq/
    • Regen CLDC/SolarRad: LongTermAvgRNMax /backup/meteorology/out/lseg_csv/1984010100-2022123123 /backup/meteorology/out/lseg_csv/RNMax 1 N51053
    • Regen the WDM: wdm_pm_one N51053 1984010100 2022123123 nldas2 harp2021 nldas1221 p20211221
  • Or, you can just try regenerating the whole land seg and hope only one cell was bad (ut it doesn't really take that long and why bother?)

Details on error

  • Errors appear:
# shows
 PROBLEM ERROR Hourly data outside valid range
 data=    10.4600000
 PROBLEM ERROR Hourly data outside valid range
 data=    10.3100004
...

# then finally , a huge number for the summary annual HPET in 2021
 HPET, x379y95,        2021 ,   9657.27148
 PROBLEM ERROR Annual data out of range

@rburghol
Copy link
Contributor Author

rburghol commented Apr 17, 2023

Still had a problem, some grid cells fixed, others not.

  • found bad ET data starting 2/18/2021
    • 2022 segs: N51029, N51135, N51049, N51011
  • Try redownloading and regen for a single day to see if it improves:
    • ./get_nldas_to_date 2021 49 1
    • Maybe grid2land does NOT call g2a_one.bash for each cell?
    • Cause this took more time ./g2a_one.bash 2021010100 2021123123 /backup/meteorology /backup/meteorology/out/grid_met_csv x377y96

Batch process:

basin=JA5_7480_0001
segs=`cbp get_landsegs $basin`

badstart=2022010100 
badend=2022123123
dstart=1984010100
dend=2022123123 
i=N51049
./nldas_land_grids $i
# 10 x377y96 x375y97 x376y97 x377y97 x375y98 x376y98 x377y98 x378y98 x375y99 x376y99
# update all the grid cell CSVs in the land segment
grid2land.sh $badstart $badend /backup/meteorology /backup/meteorology/out/grid_met_csv $i
   # just update a single cell:
   # ./g2a_one.bash $badstart $badend /backup/meteorology /backup/meteorology/out/grid_met_csv x376y99
# weight all grid cells into the land segment
a2l_one 1984010100 2022123123 /backup/meteorology/out/grid_met_csv /backup/meteorology/out/lseg_csv  $i

LongTermAvgRNMax /backup/meteorology/out/lseg_csv/${dstart}-${dend} /backup/meteorology/out/lseg_csv/RNMax 1 $i
wdm_pm_one $i $dstart $dend nldas2 harp2021 nldas1221 p20211221
  • Found one cell, data ended 5/31/2022: nano out/grid_met_csv/2022/x376y97zET.txt
    • Regenerate:
    • Verify: nano out/grid_met_csv/2022/x376y97zET.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants