Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data request example (from Markus Erhard Schorn) and my solution strategy #45

Open
tz05 opened this issue Sep 24, 2019 · 8 comments
Open

Comments

@tz05
Copy link

tz05 commented Sep 24, 2019

The original request (from Markus Erhard Schorn):

a) a tree-level datafile

  • one row for each measurement (i.e. if a tree was remeasured once, the file contains two lines for this tree)
  • each row contains a unique tree ID, which is the same for all measurements of a single tree (currently not the case in raw FIA data)
  • each row contains a unique plot ID, which is the same for all the trees in a single plot and for all the measurements of those trees, matching the plot ID in the plot-level datafile (see below)
  • if possible, data from plot-level datafile (important columns as mentioned below) already merged in this file

b) a plot-level datafile

  • in the best case one file containing merged data from XX_PLOT.csv and XX_COND.csv
  • important columns for me: FORTYPCD, STDAGE, SITECLCD from XX_COND.csv; INVYR, MEASYEAR, MEASMON, MEASDAY from XX_PLOT.csv; but it could be all the columns from both XX_PLOT.csv and XX_COND.csv as well
  • one row for each census
  • each row contains a unique plot ID (see above)

I need this for data from WA and OR. For training purposes, the data could be filtered for FORTYPCD == 201 already, but I'd rather have the full dataset and do that myself later on. Also you could filter for remeasured plots only, but the same applies here.


Description of my data product (with R):

  1. It is tree-level data, with plot-level information merged (as Markus preferred). Fields in TREE data are all included in the files.
  2. The file contains a field of “ID”, which is for trees. It is unique for each tree and same in different surveys. It is the TREE’s CN in the EARLIEST survey it appears.
  3. The fields from PLOT and COND data include those fields mentioned in Markus’ Email.
  4. Because Markus wanted to merge PLOT fields and COND fields (and further merge to TREE’s fields), and he wanted each row to be for each census of a plot, I excluded plots containing >1 conditions to avoid the cases where the merged PLOT and COND data contain multiple rows for each plot.
  5. PLT_CN and COND_CN (IDs for plot and condition records) are same as used in PLOT and COND data.
@ethanwhite
Copy link
Member

Thanks @tz05! You can also add the code here either by dragging and dropping the file or by putting it in a comment using ``` (3 backticks) on the line before and the line after the code.

For example:

data <- read.csv("mydatafile.csv")

@henrykironde
Copy link
Contributor

Thanks @tz05, will have a look at this and get back to you if I do get any question.

@henrykironde
Copy link
Contributor

henrykironde commented Nov 3, 2019

@tz05 I want to confirm what I think is the goal of the code.
Get tree data from XX_TREE.csv and merge with XX_PLOT.csv using PLT_CN
and merge XX_COND.csv using XX_TREE.csv's CN.
And if so, we are supposed to have 969305 records.
Is this correct.

@tz05
Copy link
Author

tz05 commented Nov 3, 2019

My suggestion is merging PLOT.csv and COND.csv first and then merging it with TREE.csv.

Merging of PLOT.csv and COND.csv needs to use CN (in PLOT) and PLT_CN (in COND). The CN field in COND is irrelevant in this merging. And in this step, I filtered out the plots which contain multiple (>1) conditions. This is because for these plots, tree records cannot easily be merged to the plot+condition records. A tree should only associate with one condition; but for these plots, it is hard (if not impossible) to tell which "condition" a tree is associated with.

Merging this plot+condition data with TREE data needs to use PLT_CN (in TREE) and CN (in PLOT).

@tz05 I want to confirm what I think is the goal of the code.
Get tree data from XX_TREE.csv and merge with XX_PLOT.csv using PLT_CN
and merge XX_COND.csv using XX_TREE.csv's CN.
And if so, we are supposed to have 969305 records.
Is this correct.

@henrykironde
Copy link
Contributor

UNADJUSTEDNONRAW_thumb_22a

@henrykironde
Copy link
Contributor

I hope this is what the end goal is. I will have to add some code to perform specific attribute filtering at this level for individual files.

@tz05
Copy link
Author

tz05 commented Nov 6, 2019

About your diagram, two things need to be pay attention to:

  1. The tree with CN of 1 in TREE table should not be in the final table, because its plot (CN of 45 in PLOT table) has multiple conditions. But the same tree in the previous survey might be in the final table if by that time the plot has only one single condition.
  2. The final table needs an ID field and each tree has to have one unique ID. That means the tree records with CN of 2 and 5 in TREE table have to share the same ID. So do the tree records with CN of 3 and 4. In my R script, my strategy for the ID values was to use a tree's CN in its earliest survey. So its value won't change over time even when new surveys are included and the dataset is updated.

Wish these comments helpful.

@tz05
Copy link
Author

tz05 commented Feb 12, 2020

To keep a record, I uploaded my demonstration pdf file here for reference.
illustration.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants