diff --git a/C02_background.qmd b/C02_background.qmd index 8d98a2e..7bc2899 100644 --- a/C02_background.qmd +++ b/C02_background.qmd @@ -48,7 +48,7 @@ greedy_input ``` You may encounter a warning message that says, "There are fewer available cells for raster...". This is useful information, there simply weren't a lot of non-NA cells to sample from. Let's plot this. -```{r plot_greedye_input} +```{r plot_greedy_input} plot(greedy_input['class'], axes = TRUE, pch = ".", @@ -67,26 +67,43 @@ Well, that's imbalanced with a different number presences than background points # The conservative approach - data thinning -The conservative approach says that the environmental covariates (that's the Brickman data), or more specifically the resolution of the envirnomental covariates, should dictate the sampling. The core thought here is that it doesn't produce more or better information to have replicate measurements of either presences or In this approach we eliminate (thin) presences so that we have no more than one per covariate array cell. +The conservative approach says that the environmental covariates (that's the Brickman data), or more specifically the resolution of the envirnomental covariates, should dictate the sampling. The core thought here is that it doesn't produce more or better information to have replicate measurements of either presences or + +## Thin by cell + +In this approach we eliminate (thin) presences so that we have no more than one per covariate array cell. ```{r thin_by_cell} dim_before = dim(obs) -cat("number of rows before thinning:", dim_before[1], "\n") -obs = thin_by_cell(obs, mask) -dim_after = dim(obs) -cat("number of rows after thinning:", dim_after[1], "\n") +cat("number of rows before cell thinning:", dim_before[1], "\n") +thinned_obs = thin_by_cell(obs, mask) +dim_after = dim(thinned_obs) +cat("number of rows after cell thinning:", dim_after[1], "\n") +``` + +So, that dropped quite a few! + +## Make a weighted sampling map + +There is a technique we can use to to make a weighted sampling map. Simply counting the number of original observations per cell will indicate where we are most likely to oberve `Mola mola`. + +```{r sample_weight} +samp_weight = rasterize_point_density(obs, mask) +plot(samp_weight, axes = TRUE, breaks = "equal", col = rev(hcl.colors(10)), reset = FALSE) +plot(coast, col = "orange", lwd = 2, add = TRUE) ``` -So, that dropped quite a few! Now let's take a look at the background, but this time we'll try to match the count of presences. +Now let's take a look at the background, but this time we'll try to match the count of presences. ```{r sample_background_conservative} -conservative_input = sample_background(obs, mask, +conservative_input = sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), class_label = "background", - method = c("dist_max", 30000), + method = "bias", return_pres = TRUE) count(conservative_input, class) ``` +Whoa - that's many fewer background points. ```{r plot_conservative_input} plot(conservative_input['class'], @@ -97,6 +114,7 @@ plot(conservative_input['class'], reset = FALSE) plot(coast, col = "orange", add = TRUE) ``` +It appears that background points are essentially shadowing the thinned presence points. # Greedy or Conservative? @@ -176,13 +194,16 @@ make_model_input_by_month = function(mon = "Jan", write_sf(greedy_input, file.path(path, filename)) # thin the obs - obs = thin_by_cell(obs, raster) + thinned_obs = thin_by_cell(obs, raster) + + # sampling weight + samp_weight = rasterize_point_density(obs, raster) # make the conservative model - conservative_input = sample_background(obs, raster, + conservative_input = sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), class_label = "background", - method = c("dist_max", 30000), + method = "bias", return_pres = TRUE) # save the conservative data @@ -199,7 +220,6 @@ make_model_input_by_month = function(mon = "Jan", } ``` - # Reusing the function in a loop More phew! But that is it! Now we use a for loop to run through the months, calling our function each time. Happily, the built-in variable `month.abb` has all of the month names in order. @@ -245,8 +265,8 @@ plot(coast, col = "orange", add = TRUE) We have prepared what we call "model inputs", in particular for *Mola mola*, by selecting background points using two different approaches: greedy and conservative. There are lots of other approaches, too, but for the sake of learning we'll settle on just these two. We developed a function that will produce our model inputs for a given month, and saved them to disk. Then we read at least one back and showed that we can restore these from disk. # Coding Assignment -Use the [iterations tutorial](https://bigelowlab.github.io/handytandy/iterations.html) to apply your `make_model_input_by_month()` for each month. You'll know you have done it correctly if your result is a list filled with lists of greedy-conservative tables, **and** your `model_inputs` directory holds at least 24 files (12 months x 2 sampling schemes). +Use the [iterations tutorial](https://bigelowlab.github.io/handytandy/iterations.html) to apply your `make_model_input_by_month()` for each month. You'll know you have done it correctly if your result is a list filled with lists of greedy-conservative tables, **and** your `model_inputs` directory holds at least 24 files (12 months x 2 sampling schemes). # Challenge diff --git a/docs/C00_coding.html b/docs/C00_coding.html index e95c1ae..d716d9c 100644 --- a/docs/C00_coding.html +++ b/docs/C00_coding.html @@ -345,7 +345,7 @@

+<bytecode: 0x7fcb072131b8>

If that still doesn’t work, we highly recommend trying Rseek.org which is an R-language specific search engine.

diff --git a/docs/C02_background.html b/docs/C02_background.html index 0042361..1e8450a 100644 --- a/docs/C02_background.html +++ b/docs/C02_background.html @@ -226,7 +226,11 @@

On this page

-
  • 3 The conservative approach - data thinning
  • +
  • 3 The conservative approach - data thinning +
  • 4 Greedy or Conservative?
  • 5 Model input per month
      @@ -333,7 +337,7 @@

      -

      +

      @@ -359,49 +363,73 @@

      3 The conservative approach - data thinning

      -

      The conservative approach says that the environmental covariates (that’s the Brickman data), or more specifically the resolution of the envirnomental covariates, should dictate the sampling. The core thought here is that it doesn’t produce more or better information to have replicate measurements of either presences or In this approach we eliminate (thin) presences so that we have no more than one per covariate array cell.

      +

      The conservative approach says that the environmental covariates (that’s the Brickman data), or more specifically the resolution of the envirnomental covariates, should dictate the sampling. The core thought here is that it doesn’t produce more or better information to have replicate measurements of either presences or

      +
      +

      3.1 Thin by cell

      +

      In this approach we eliminate (thin) presences so that we have no more than one per covariate array cell.

      dim_before = dim(obs)
      -cat("number of rows before thinning:", dim_before[1], "\n")
      +cat("number of rows before cell thinning:", dim_before[1], "\n")
      -
      number of rows before thinning: 2459 
      +
      number of rows before cell thinning: 2459 
      -
      obs = thin_by_cell(obs, mask)
      -dim_after = dim(obs)
      -cat("number of rows after thinning:", dim_after[1], "\n")
      +
      thinned_obs = thin_by_cell(obs, mask)
      +dim_after = dim(thinned_obs)
      +cat("number of rows after cell thinning:", dim_after[1], "\n")
      -
      number of rows after thinning: 1204 
      +
      number of rows after cell thinning: 1204 
      -

      So, that dropped quite a few! Now let’s take a look at the background, but this time we’ll try to match the count of presences.

      +

      So, that dropped quite a few!

      +
      +
      +

      3.2 Make a weighted sampling map

      +

      There is a technique we can use to to make a weighted sampling map. Simply counting the number of original observations per cell will indicate where we are most likely to oberve Mola mola.

      -
      conservative_input = sample_background(obs, mask, 
      -                              n = 2 * nrow(obs),
      -                              class_label = "background",
      -                              method = c("dist_max", 30000),
      -                              return_pres = TRUE)
      -count(conservative_input, class)
      +
      samp_weight = rasterize_point_density(obs, mask)
      +plot(samp_weight, axes = TRUE, breaks = "equal", col = rev(hcl.colors(10)), reset = FALSE)
      +plot(coast, col = "orange", lwd = 2, add = TRUE)
      +
      +
      +
      +

      +
      +
      +
      +
      +

      Now let’s take a look at the background, but this time we’ll try to match the count of presences.

      +
      +
      conservative_input = sample_background(thinned_obs, samp_weight, 
      +                              n = 2 * nrow(obs),
      +                              class_label = "background",
      +                              method = "bias",
      +                              return_pres = TRUE)
      +
      +
      Warning in sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), : There are fewer available cells for raster 'NA' (1204 presences) than the requested 4918 background points. Only 1204 will be returned.
      +
      +
      count(conservative_input, class)
      Simple feature collection with 2 features and 2 fields
       Geometry type: MULTIPOINT
       Dimension:     XY
      -Bounding box:  xmin: -74.89169 ymin: 38.8679 xmax: -65.02004 ymax: 45.21401
      +Bounding box:  xmin: -74.5867 ymin: 38.868 xmax: -65.02004 ymax: 45.1333
       Geodetic CRS:  WGS 84
       # A tibble: 2 × 3
         class          n                                                      geometry
       * <fct>      <int>                                              <MULTIPOINT [°]>
      -1 presence    1204 ((-65.07 42.68), (-65.05 42.6), (-65.067 42.617), (-65.16 42…
      -2 background  2408 ((-65.02004 42.25251), (-65.02004 42.74609), (-65.1023 42.66…
      +1 presence 1204 ((-65.067 42.65), (-65.05 42.6), (-65.067 42.617), (-65.16 4… +2 background 1204 ((-65.1023 42.66383), (-65.02004 42.58157), (-65.1023 42.581…
      +

      Whoa - that’s many fewer background points.

      -
      plot(conservative_input['class'], 
      -     axes = TRUE,  
      -     pch = ".", 
      -     extent = mask, 
      -     main = "August conservative class distribution",
      -     reset = FALSE)
      -plot(coast, col = "orange", add = TRUE)
      +
      plot(conservative_input['class'], 
      +     axes = TRUE,  
      +     pch = ".", 
      +     extent = mask, 
      +     main = "August conservative class distribution",
      +     reset = FALSE)
      +plot(coast, col = "orange", add = TRUE)
      @@ -410,6 +438,8 @@

      3 The conservativ

      +

      It appears that background points are essentially shadowing the thinned presence points.

      +

      4 Greedy or Conservative?

      @@ -445,73 +475,76 @@

      Phew! That’s a lot of steps. To manually run those steps 12 times would be tedious, so we roll that into a function that we can reuse 12 times instead.

      This function will have a name, make_model_input_by_month. It’s a long name, but it makes it obvious what it does. First we start with the documentation.

      -
      #' Builds greedy and conservative model input data sets for a given month
      -#' 
      -#' @param mon chr the month abbreviation for the month of interest ("Jan" by default)
      -#' @param obs table, the complete observation data set
      -#' @param raster stars, the object that defines the sampling space, usually a mask
      -#' @param species chr, the name of the species prepended to the name of the output files.
      -#'   (By default "Mola mola" which gets converted to "Mola_mola")
      -#' @param path the output data path to store this data (be default "model_input")
      -#' @param min_obs num this sets a threshold below which we wont try to make a model. (Default is 3)
      -#' @return a named two element list of greedy and conservative model inputs - they are tables
      -make_model_input_by_month  = function(mon = "Jan",
      -                                      obs = read_observations("Mola mola"),
      -                                      raster = NULL,
      -                                      species = "Mola mola",
      -                                      path = data_path("model_input"),
      -                                      min_obs = 3){
      -  # the user *must* provide a raster
      -  if (is.null(raster)) stop("please provide a raster")
      -  # filter the obs
      -  obs = obs |>
      -    filter(month == mon[1])
      -  
      -  # check that we have at least some records, if not enough then alert the user
      -  # and return NULL
      -  if (nrow(obs) < min_obs){
      -    warning("sorry, this month has too few records: ", mon)
      -    return(NULL)
      -  }
      -  
      -  # make sure the output path exists, if not, make it
      -  make_path(path)
      -  
      -  
      -  # make the greedy model input by sampling the background
      -  greedy_input = sample_background(obs, raster,
      -                                   n = 2 * nrow(obs),
      -                                   class_label = "background",
      -                                   method = c("dist_max", 30000),
      -                                   return_pres = TRUE)
      -  # save the greedy data
      -  filename = sprintf("%s-%s-greedy_input.gpkg", 
      -                     gsub(" ", "_", species),
      -                     mon)
      -  write_sf(greedy_input, file.path(path, filename))
      -  
      -  # thin the obs
      -  obs = thin_by_cell(obs, raster)
      -  
      -  # make the conservative model
      -  conservative_input = sample_background(obs, raster,
      -                                   n = 2 * nrow(obs),
      -                                   class_label = "background",
      -                                   method = c("dist_max", 30000),
      -                                   return_pres = TRUE)
      -  
      -  # save the conservative data
      -  filename = sprintf("%s-%s-conservative_input.gpkg", 
      -                     gsub(" ", "_", species),
      -                     mon)
      -  write_sf(conservative_input, file.path(path,filename))
      -  
      -  # make a list
      -  r = list(greedy = greedy_input, conservative = conservative_input)
      -  
      -  # return, but disable automatic printing
      -  invisible(r)
      -}
      +
      #' Builds greedy and conservative model input data sets for a given month
      +#' 
      +#' @param mon chr the month abbreviation for the month of interest ("Jan" by default)
      +#' @param obs table, the complete observation data set
      +#' @param raster stars, the object that defines the sampling space, usually a mask
      +#' @param species chr, the name of the species prepended to the name of the output files.
      +#'   (By default "Mola mola" which gets converted to "Mola_mola")
      +#' @param path the output data path to store this data (be default "model_input")
      +#' @param min_obs num this sets a threshold below which we wont try to make a model. (Default is 3)
      +#' @return a named two element list of greedy and conservative model inputs - they are tables
      +make_model_input_by_month  = function(mon = "Jan",
      +                                      obs = read_observations("Mola mola"),
      +                                      raster = NULL,
      +                                      species = "Mola mola",
      +                                      path = data_path("model_input"),
      +                                      min_obs = 3){
      +  # the user *must* provide a raster
      +  if (is.null(raster)) stop("please provide a raster")
      +  # filter the obs
      +  obs = obs |>
      +    filter(month == mon[1])
      +  
      +  # check that we have at least some records, if not enough then alert the user
      +  # and return NULL
      +  if (nrow(obs) < min_obs){
      +    warning("sorry, this month has too few records: ", mon)
      +    return(NULL)
      +  }
      +  
      +  # make sure the output path exists, if not, make it
      +  make_path(path)
      +  
      +  
      +  # make the greedy model input by sampling the background
      +  greedy_input = sample_background(obs, raster,
      +                                   n = 2 * nrow(obs),
      +                                   class_label = "background",
      +                                   method = c("dist_max", 30000),
      +                                   return_pres = TRUE)
      +  # save the greedy data
      +  filename = sprintf("%s-%s-greedy_input.gpkg", 
      +                     gsub(" ", "_", species),
      +                     mon)
      +  write_sf(greedy_input, file.path(path, filename))
      +  
      +  # thin the obs
      +  thinned_obs = thin_by_cell(obs, raster)
      +  
      +  # sampling weight
      +  samp_weight = rasterize_point_density(obs, raster)
      +  
      +  # make the conservative model
      +  conservative_input = sample_background(thinned_obs, samp_weight,
      +                                   n = 2 * nrow(obs),
      +                                   class_label = "background",
      +                                   method = "bias",
      +                                   return_pres = TRUE)
      +  
      +  # save the conservative data
      +  filename = sprintf("%s-%s-conservative_input.gpkg", 
      +                     gsub(" ", "_", species),
      +                     mon)
      +  write_sf(conservative_input, file.path(path,filename))
      +  
      +  # make a list
      +  r = list(greedy = greedy_input, conservative = conservative_input)
      +  
      +  # return, but disable automatic printing
      +  invisible(r)
      +}

      @@ -519,29 +552,65 @@

      6 Reusing the function in a loop

      More phew! But that is it! Now we use a for loop to run through the months, calling our function each time. Happily, the built-in variable month.abb has all of the month names in order.

      -
      for (this_month in month.abb){
      -  result = make_model_input_by_month(this_month,
      -                                     obs = read_observations(scientificname = "Mola mola"),
      -                                     raster = mask,
      -                                     species = "Mola mola",
      -                                     path = data_path("model_input"),
      -                                     min_obs = 3)
      -}
      +
      for (this_month in month.abb){
      +  result = make_model_input_by_month(this_month,
      +                                     obs = read_observations(scientificname = "Mola mola"),
      +                                     raster = mask,
      +                                     species = "Mola mola",
      +                                     path = data_path("model_input"),
      +                                     min_obs = 3)
      +}
      +
      +
      Warning in sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), : There are fewer available cells for raster 'NA' (4 presences) than the requested 8 background points. Only 4 will be returned.
      +
      +
      +
      Warning in sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), : There are fewer available cells for raster 'NA' (16 presences) than the requested 34 background points. Only 16 will be returned.
      +
      +
      +
      Warning in sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), : There are fewer available cells for raster 'NA' (13 presences) than the requested 28 background points. Only 13 will be returned.
      +
      +
      +
      Warning in sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), : There are fewer available cells for raster 'NA' (148 presences) than the requested 480 background points. Only 148 will be returned.
      +
      +
      +
      Warning in sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), : There are fewer available cells for raster 'NA' (364 presences) than the requested 1286 background points. Only 364 will be returned.
      +
      Warning in sample_background(obs, raster, n = 2 * nrow(obs), class_label = "background", : There are fewer available cells for raster 'NA' (2325 presences) than the requested 4650 background points. Only 3882 will be returned.
      +
      Warning in sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), : There are fewer available cells for raster 'NA' (996 presences) than the requested 4650 background points. Only 996 will be returned.
      +
      +
      +
      Warning in sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), : There are fewer available cells for raster 'NA' (968 presences) than the requested 4168 background points. Only 968 will be returned.
      +
      +
      Warning in sample_background(obs, raster, n = 2 * nrow(obs), class_label = "background", : There are fewer available cells for raster 'NA' (2459 presences) than the requested 4918 background points. Only 4818 will be returned.
      +
      +
      Warning in sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), : There are fewer available cells for raster 'NA' (1204 presences) than the requested 4918 background points. Only 1204 will be returned.
      +
      +
      +
      Warning in sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), : There are fewer available cells for raster 'NA' (541 presences) than the requested 1844 background points. Only 542 will be returned.
      +
      +
      +
      Warning in sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), : There are fewer available cells for raster 'NA' (311 presences) than the requested 832 background points. Only 312 will be returned.
      +
      +
      +
      Warning in sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), : There are fewer available cells for raster 'NA' (210 presences) than the requested 676 background points. Only 210 will be returned.
      +
      +
      +
      Warning in sample_background(thinned_obs, samp_weight, n = 2 * nrow(obs), : There are fewer available cells for raster 'NA' (20 presences) than the requested 112 background points. Only 20 will be returned.
      +

      7 Listing the output files

      You can always look into you output directory to see if the files we made, but even better might be to use the computer to list them for you. If your species is found in sufficient numbers year round, you’ll have 24 files: 12 months x 2 approaches (greedy vs conservative)

      -
      path = data_path("model_input")
      -files = list.files(path, full.names = TRUE)
      -files
      +
      path = data_path("model_input")
      +files = list.files(path, full.names = TRUE)
      +files
       [1] "/Users/ben/Dropbox/code/projects/ColbyForecasting_data/2025/ben/model_input/Mola_mola-Apr-conservative_input.gpkg"
        [2] "/Users/ben/Dropbox/code/projects/ColbyForecasting_data/2025/ben/model_input/Mola_mola-Apr-greedy_input.gpkg"      
      @@ -574,15 +643,15 @@ 

      7 Listing the out

      8 Reading the files

      We know that each file should have a table with spatial information included. Let’s read one back and plot it.

      -
      x = read_sf(files[1])
      -filename = basename(files[1])
      -plot(x['class'], 
      -     axes = TRUE,  
      -     pch = "+", 
      -     extent = mask, 
      -     main = filename,
      -     reset = FALSE)
      -plot(coast, col = "orange", add = TRUE)
      +
      x = read_sf(files[1])
      +filename = basename(files[1])
      +plot(x['class'], 
      +     axes = TRUE,  
      +     pch = "+", 
      +     extent = mask, 
      +     main = filename,
      +     reset = FALSE)
      +plot(coast, col = "orange", add = TRUE)
      @@ -606,18 +675,18 @@

      11 Challenge

      Create a function to read the correct model input when given the species, month and approach.

      Use the menu option File > New File > R Script to create a blank file. Save the file (even though it is empty) in the “functions” directory as “model_input.R”. Use this file to build a function (or set of functions) that uses this set of arguments. Below is a template to help you get started.

      -
      #' Reads a model input file given species, month, approach and path
      -#' 
      -#' @param scientificname chr, the species name
      -#' @param mon chr month abbreviation
      -#' @param approach chr, one of "greedy" or "conservative"
      -#' @param path chr the path to the data directory
      -read_model_input = function(scientificname = "Mola mola",
      -                            mon = "Jan",
      -                            approach = "greedy",
      -                            path = data_path("model_input")){
      -    # your part goes in here
      -}
      +
      #' Reads a model input file given species, month, approach and path
      +#' 
      +#' @param scientificname chr, the species name
      +#' @param mon chr month abbreviation
      +#' @param approach chr, one of "greedy" or "conservative"
      +#' @param path chr the path to the data directory
      +read_model_input = function(scientificname = "Mola mola",
      +                            mon = "Jan",
      +                            approach = "greedy",
      +                            path = data_path("model_input")){
      +    # your part goes in here
      +}
      diff --git a/docs/C02_background_files/figure-html/plot_conservative_input-1.png b/docs/C02_background_files/figure-html/plot_conservative_input-1.png index 0d6a5fd..1973b24 100644 Binary files a/docs/C02_background_files/figure-html/plot_conservative_input-1.png and b/docs/C02_background_files/figure-html/plot_conservative_input-1.png differ diff --git a/docs/C02_background_files/figure-html/plot_greedye_input-1.png b/docs/C02_background_files/figure-html/plot_greedy_input-1.png similarity index 100% rename from docs/C02_background_files/figure-html/plot_greedye_input-1.png rename to docs/C02_background_files/figure-html/plot_greedy_input-1.png diff --git a/docs/C02_background_files/figure-html/read_file-1.png b/docs/C02_background_files/figure-html/read_file-1.png index 3a2c04c..df45f04 100644 Binary files a/docs/C02_background_files/figure-html/read_file-1.png and b/docs/C02_background_files/figure-html/read_file-1.png differ diff --git a/docs/C02_background_files/figure-html/sample_weight-1.png b/docs/C02_background_files/figure-html/sample_weight-1.png new file mode 100644 index 0000000..ee31307 Binary files /dev/null and b/docs/C02_background_files/figure-html/sample_weight-1.png differ diff --git a/docs/C03_covariates.html b/docs/C03_covariates.html index e986dbe..e4d8a8b 100644 --- a/docs/C03_covariates.html +++ b/docs/C03_covariates.html @@ -7,7 +7,7 @@ -Prediction – Colby Forecasting 2025 +Covariates – Colby Forecasting 2025