-
Notifications
You must be signed in to change notification settings - Fork 1
/
instructions.txt
64 lines (50 loc) · 4.58 KB
/
instructions.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
Technical Documentation for Processing Discharge and SSC-Data for Sedigraph-Reconstruction and Sediment Yield Calculation
13.10.2017, Till Francke
Ref.: Zimmermann, A., Francke, T., & Elsenbeer, H. (2012). Forests and erosion: Insights from a study of suspended-sediment dynamics in an overland flow-prone rainforest catchment. Journal of Hydrology, 428-429, 170–181. doi:10.1016/j.jhydrol.2012.01.039
Knowledge requirements: read and understand methodology of abovementioned paper
Software requirements: R (r-project.org), packages: randomForest, quantregForest
General advice
--------------
Process the numbered directories step-by-step. Where adjustments need to be made in the scripts, the respective line is referred to with a ##tag that can be searched for in the source code.
Stick to relative paths of every script, change R's working directory at the very beginning to the place where the script resides for every step (in R, click "File->Change Directory").
All directories and associated scripts are in the right order. You want to leave all files in that order. The scripts only have access to each other if they stay that way.
In general, modifications need only be made in the mentioned scripts and passages. Don't change anything else, unless you have good reasons.
Steps
-----
0. data preparation
Provide continous data for discharge, rainfall and intermittent SSC-data in the directory 0_data/.
Use "H1_discharge.txt", "TESP_5minres_1989_2010.txt" and "H1_ssc.txt" as templates (check the column names).
1. General settings
Specify temporal resolution (tres), names and format of date (e.g. Y:m:d:h:m:s) column in settings.R.
2. Generation of Ancillary Predictors (2_predictor_generation\ )
in runme.R (in 2_predictor_generation) adjust ##settings for ancillary predictors to be derived: Provide names of primary predictors, number of aggregation levels and factor for aggregation steps. Make sure the table has a format the code is able to read (e.g.: sep="\t" or sep =" ").
Add more predictors and aggregation levels increase memory demands and may slow down or stall the process.
If ANY of the series (or their children) has NODATA values for any timestep, then these timesteps cannot be used in either model training or prediction. If you encounter a great loss in training data (step 4: 4_model_building), successively exclude predictors that hold little explanatory power and/or have lots of gaps/NODATA values.
Start runme.R (in 2_predictor_generation)
Check the generated files "ancillary_data_5_train_ssc.txt" and "ancillary_data_5.txt" (or similar names). If they look OK (i.e. correct column names, not too many NAs - some at the beginning are OK) you may delete them, as the further scripts use the binary versions.
3. Record selection from very large training data sets(3_record_selection_discharge_model)
Excessively large sets of training data need to be thinned out to prevent the model generation from getting too bloated. This is done by this script.
4. Model building and validation (4_model_building\ )
(builds RF and QRF model from training data and displays their performance)
in settings.R adjust parameters at ##4_model_building
if you want to fine-tune the model
- enable "importance"
- run runme4.R and analyze the resulting variable-importance-plots
- you may add the predictors of lowest importance to "dontuse_cols"
- re-iterate and watch performance
- if desired, enable plotting at ##enable/disable plots
- adjust the number of desired periods for crosswise validation at ##number of validation periods
start runme4.R (in 4_model_building) (only training data or working directory should be changed here)
5. 5_model_application\ (apply QRF model built in previous step)
edit flood_numbering.txt to specify the time periods (events, months, years) for which the yield calculations will be performed;
in settings.R adjust/ call parameters at ##5_model_application
If parallelisation (multiprocessor computer, cluster, multiple computers) is desired
- in settings.R adjust ##selection of flood periods that will be treated
- run replicate_dirs.R (generates 20 directories based on the files in ./templ)
- run call_mc.R in each generated folder (alternatively, run start_jobs.sh to submit jobs to a batch system)
If no parallelisation (single computer)
- copy /templ/call_mc.R to /MC_01
- edit /MC_01/conf.txt: replace "1:10" by the range of periods (in flood_numbering.txt) that should be computed
- run /MC_01/call_mc.R
6. 6_visualise_results\
various visualisations that need to be customized according to specific problem