- Juan Zuloaga and Andrew Gonzalez (McGill University)
- Nikol Dimitrov and Richard Schuster (The Nature Conservancy of Canada)
Develop HSMs for a selected group of species in Canada as a complementary tool to help The Nature Conservancy of Canada in conservation planning.
- This is an exploratory analysis using coarse resolution datasets (~1km2).
- Canada
- Observations (points): GBIF
- Extent of the analysis: we used geogrpahic range maps when available; otherwise we created a minimum convex polygon (MCP) using all available observations and buffered it using a distance of 10Km (R function/package: mcp {adehabitatHR}).
- Predictors (rasters):
- Topographic heterogeneity developed by Amatulli et al 2018. Access date (September 17, 20) on this website.
Vector Ruggedness Measure (VRM)
,Roughness
,Eastness
,Northness
,Slope
. - The Dynamic Habitat Index (DHI) is an integrated metric of vegetation production from satellite imagery that measures the fraction of photosynthetically active radiation (or fPAR) intercepted by vegetation (Coops et al 2008).
Cummulative annual productivity
,the minimum level of vegetation cover
, andthe degree of seasonality
. Access date: November, 2019. Spatial resolution: 0.0083 decimal degrees (~ 1 Km resolution). Datasets downloaded from: (DHI, from the Silvis Lab). - BIOCLIM (1970-2000). Nineteen (19) bioclimatic variables from Worldclim (Hijmans et al 2005); using R function/package (tile_name, tile_get and tile_merge {WorldClimTiles}).
- Percentage of lakes. We calculated the number of lake-pixels (100m) within a 1000 pixel size, using lakes datasset from HydroLakes v1.
- Landcover. We aggregated 32 landcover types of the year 2015 from ESA-CCI into 8. We then reclassified these landscover types into binary presence absence rasters and used a 9x9 moving window analysis to convert them into continuous variables. This moving window analysis was conducted by calculating the proportion of cells that had a given landcover type within the 9X9 window.
- Topographic heterogeneity developed by Amatulli et al 2018. Access date (September 17, 20) on this website.
-
We used Maxent algorithm (for Maxent performance see also: Hernandez et al. 2006; Aguirre-Gutierrez et al. 2013; Philips et al. 2017; Radomsky et al. 2017) implemented in the
ENMeval
v.2.0.0 package (Kass et al 2021) in R. -
Spatial filtering (thining): To reduce spatial sampling bias and spatial autocorrelation in species observations we thinned the occurrence data set of each species using a thinning algorithm (implemented in the spThin package by (Aiello-Lammens et al 2015), using a distance of 1 Km.
-
Multicolinearity: we used a correlation coefficient threshold of 0.7 (Dormann et al 2012) to select a set of uncorrelated variables, using removeCollinearity fucntion from the virtualspecies package in R.
-
Background points (bg): we selected 'bg' based on to number of pixels in study area:
- 40% if study area contains < 100,000 pixels
- 20% if study area contains > 100,000 and < 1,000,000 pixels
- 5% if study area contains > 1,000,000 pixels and < 2,000,000 pixels
- 1% if study area contains > 2,000,000 pixels
-
Bias layer: we created a bias raster layer using all observations, using R function/package (MASS::kde2d). We used this bias layer to select the background points.
-
Features (fc): “The idea of Maxent is to estimate a target probability distribution by finding the probability distribution of maximum entropy (i.e., that is most spread out, or closest to uniform), subject to a set of constraints that represent our incomplete information about the target distribution. The information available about the target distribution often presents itself as a set of real-valued variables, called “features”, and the constraints are that the expected value of each feature should match its empirical average (average value for a set of sample points taken from the target distribution). When Maxent is applied to presence-only species distribution modeling, the pixels of the study area make up the space on which the Maxent probability distribution is defined, pixels with known species occurrence records constitute the sample points, and the features are climatic variables, elevation, soil category, vegetation type or other environmental variables, and functions thereof." (Phillips et al., 2006). Available features are: Linear (L), Quadratic (Q), Product (P), Threshold (T), and Hinge (H). We selected the type of features according to the number of observations (Merow et al., 2013):
- Less than 10 observations (L)
- 11 - 15 observations (L, Q, LQ)
- 16 - 80 observations (L, Q, LQ, H, QH)
- More than 80 observations (Q, H, LQ, LQP, QPT)
-
Regularization multiplier (rm): The 'rm' prevents model over-fittig and smooths the distribution, making it more regular. “A smaller RM value will result in a more localized output distribution that fits the given occurrence records better, but is more prone to overfitting. Overfitted models are poorly transferable to novel environments and, thus, not appropriate to project distribution changes under environmental change. A larger RM value will produce a prediction that can be applied more widely" (Li et al., 2020). “More complex feature classes will require more regularization to yield accurate predictions.” ( Phillips and Dudík (2008)). We used regularization values (RMs) proposed by ( Phillips and Dudík (2008)), based on number of observations:
- 0.05
- 0.50 and,
- 1.00 (Default in MaxEnt).
-
Data partitioning method
- Crossvalidate (More than 25 observations; kfolds=10).
- K-1 Jackknife (less than 25 observations).
We implemented a sequential method to select the best model using two performance metrics. This approach uses cross-validation results by selecting models with the lowest average test omission rate, and to break ties, with the highest average validation AUC (Kass et al 2020 and Radosavljevic & Anderson 2014). Further analysis were based on this ‘best model’.
-
Omission rate (OR) indicates the “fraction of the test localities that fall into pixels not predicted as suitable for the species. A low omission rate is a necessary (but not sufficient) condition for a good model.” (Phillips et al., 2006). There is no thresholding rule developed yet to determine the optimal threshold for the omission rate, so we suggest this provisional relative scale (is the model potentially useful?): (0 - 0.25, the model can be informative; 0.25 - 0.50, there is some concern in model predictions; > 0.50 there is high concern in model predictions). Omission rate values(or.10p.avg).
-
AUC values are “the probability that a random positive instance and a random negative instance are correctly ordered by the classifier.” (Phillips et al., 2006). AUC values above 0.75 are considered potentially useful (Elith. 2002) and Phillips and Dudík (2008)). AUC values(auc.val.avg).
We used the best model's parameters and ran 10 times the model, letting the selection of background points to vary in space. Uncertainty map was calcualted as the coefficient of variation among 10 HSA maps generated in model fitting. we expect low spatial variation in this uncertainty map.
We used the 10th perecentile training presence logistic threshold to convert our continuous maps to binary maps.
The following outputs were projected to the NCC national grid (projection: albers equal area conic, resolution: 1km2)
- Habitat suitability map (continuous and binarized)
- Uncertainty map (coeficient of variation 10 runs best model)
Using 16 virtual machiches from Google Cloud Compute to scale up the work to produce 1000+ species outputs and biodiversity predictions