08-PSM_practical_tips.Rmd

# Practical tips for organizing Predictive Soil Mapping {#practical-tips}

*Edited by: T. Hengl, R. A. MacMillan and I. Wheeler*

## Critical aspects of Predictive Soil Mapping

Previous chapters in this book have reviewed many of the technical aspects of PSM. 
For a statistician, following the right procedures and applying the right statistical frameworks will are 
the key elements of success for a PSM project. In practice, 
it is really a combination of all elements and aspects that determines a success of a PSM project.
In this chapter we provide some practical tips on how to organize work and what to be especially 
careful about. We refer to these as the *critical aspects of PSM projects*.

At the end of the chapter we also try to present additional practical tips in the form of check-lists and 
simple economic analysis, to help readers avoid making unrealistic plans or producing maps 
that may not find effective use.

### PSM main steps

Based on previously presented theory, we can summarize the usual PSM processes as:

1.  Preparation of point data (training data).

2.  Preparation of covariate data (the explanatory variables).

3.  Model fitting and validation (building rules by overlay, model fitting and cross-validation).

4.  Prediction and generation of (currently best-possible) final maps
    (applying the rules).

5.  Archiving and distribution of maps (usually via soil geographical
    databases and/or web services).

6.  Updates and improvements (support).

```{r psm-decision-tree, echo=FALSE, fig.cap="General decision tree in a Predictive Soil Mapping project.", out.width="90%"}
knitr::include_graphics("figures/Fig_decisiontree_PSM_project.png")
```

An even more comprehensive list of steps in PSM projects is given in Fig. \@ref(fig:psm-decision-tree),
which also includes market/user-domain researcher and might be also used for soil monitoring projects. 

In principle, we recognize three main types of PSM projects:

A.  PSM projects in new, previously unmapped, areas — no point observations or samples
    currently exist.

B.  PSM projects using legacy points — sufficient point data to support
    PSM exist and are available, but no previous PSM modelling has
    been implemented for this area.

C.  PSM projects aimed at optimizing predictions and usability —
    Previous PSM models have already been completed but previous
    results can still be improved / optimized.

If point data are not available, then collecting new point data, via
field work and laboratory analysis will usually consume a majority of
any PSM project budget (PSM projects type A). Otherwise, if point data
are already available (and only need to be imported and harmonized), the
most time consuming part of PSM will likely be preparation of covariate
data layers (PSM projects type B). Predictions can also take a long time
and computing costs per update can be significant (see further sections).
Personnel costs can be more significant than server costs as programming
can require weeks of staff time. However, if programming is done at a
high level (e.g. through generic functions and objects), subsequent
updates should require less personnel time as predictions can be
increasingly automated.

Another aspect of PSM is the time dimension i.e. will maps be
continuously updated or do they need to produced only once and then
remain relevant and useful for years (often up to decades), so that PSM
projects can also be classified into:

I.  PSM projects for the purpose of mapping static (stable) spatial
    patterns only.

II. PSM projects for the purpose of one-time change detection (e.g. two
    time intervals).

III. PSM projects for the purpose of monitoring soil conditions / status
    (continuous updates at regular intervals).

To date, almost all conventional soil mapping ignores time and change
and instead tries to assume that soil properties are static and persist
through time virtually unaltered. Increasingly, however, new generation
PSM projects aim to monitor changes in soil resources, with special
focus given to changes in soil organic carbon, soil nutrients, soil moisture and
similar (Fig. \@ref(fig:psm-types-monitoring)). For PSM project 
type III spatio-temporal prediction models can
be used (as in meteorology for example), but then this requires that
sufficient training data are available in both the space and time domains e.g. at least
five measurement intervals / repetitions.

```{r psm-types-monitoring, echo=FALSE, fig.cap="Types of PSM projects depending on whether maps are generated for single usage, or for detecting change or soil monitoring.", out.width="60%"}
knitr::include_graphics("figures/Fig_types_of_PSM_projects.png")
```

### PSM input and output spatial data layers

In PSM, there are, in principle, three (3) main types of spatial input
layers [@Hengl2017SoilGrids250m]:

1.  Soil samples (usually points or transects) are spatially incomplete.
    They are used as evidence in generating spatial predictions. In
    vertical and horizontal dimensions, soil points might refer to
    volumes i.e. have a block support. Often only the horizontal (2D)
    support is mentioned, and the 3D support has to be inferred from
    the description of the depth slice(s) sampled.

2.  Soil mask i.e.a raster map delineating the spatial domain of
    interest for PSM. Commonly derived from a land cover map with
    water bodies, permanent ice and similar removed from predictions.

3.  Covariates i.e. grid maps that depict environmental conditions.
    Ideally all covariates are "stacked" to exactly match the same
    grid and all missing values and inconsistencies are resolved prior
    to PSM.

And three (3) main types of spatial output layers:

4.  Spatial predictions of (primary) soil variables that are spatially
    complete i.e. are produced and available for the entire extent of
    the soil mask.

5.  Maps of (secondary) soil variables which are derived using
    calculations applied to combinations of the primary predicted
    soil variables. These are usually less expensive to produce, by an
    order of magnitude, than spatial predictions of primary
    soil variables.

6.  Maps quantifying uncertainty in terms of prediction error,
    prediction interval, confusion index or similar metrics. These may
    be derived at the same time as predictions are made or can be made
    completely independently of predictions.

Each element of the map types listed above needs to have a consistent
spatio-temporal reference, which typically includes:

-   Geographic location in local or geographic coordinates (for global
    modelling we usually prefer initial georeferencing that uses
    longitude and latitude in the WGS84 coordinate system);

-   Depth interval expressed in cm from the land surface (upper and
    lower depth) for layers and point depth for point predictions;

-   Support size or referent soil volume (or voxel) i.e. the horizontal
    sampling area multiplied by the thickness of the sampling
    block e.g. 30 $\times$ 30 $\times$ 0.3 m.

-   Temporal reference i.e. a begin and an end date/time of the period
    of measurements/estimations. Specifying exact spatial and temporal
    references in the metadata can is vital for optimal production and
    use of maps.

Spatial predictions of primary soil properties can be used to:

-   Derive spatial aggregates (upscaling to coarser resolution).

-   Derive vertical aggregates e.g. mean pH in 0–100 cm of soil (for
    this we usually recommend using the trapezoidal rule as explained in @Hengl2017SoilGrids250m).

-   Derive secondary soil properties e.g. available water capacity,
    organic carbon stock etc.

Spatial predictions of primary soil variables and derived soil variables
are meant to be used for decision making and further modeling i.e. they are used to
construct a **Soil Information System** once all values of all variables
are known for all pixels within the soil mask. A SIS should ideally
provide information that can directly support input to modeling,
planning and decision-making.

## Technical specifications affecting the majority of production costs

The majority of the costs of a PSM project are controlled by the following:

1.  **Spatial resolution** (commonly 30 m, 100 m or 250 m): Spatial
    resolution is crucial in determining the total costs of PSM,
    especially in terms of computing, storage, network traffic and
    hardware requirements. Changing the spatial resolution from 100 to
    30 m means that about 10 times more pixels will need to be
    produced, stored and shared via the network. This does not always
    imply that the costs of PSM will also be 10 times greater than for
    a 100 m resolution project, but the increase in costs is often
    going to follow a quadratic function. Also note that, for even
    finer resolutions e.g. 5 m, very limited free public covariate
    data are available and additional purchases of commercial RS
    products will typically be required. For example the latest 12 m
    resolution WorldDEM (https://worlddem-database.terrasar.com/) can cost 
    up to 10 USD per square km, which can increase PSM costs significantly.

2.  **List of target variables and their complexity**: Some PSM projects
    focus on mapping 1–2 soil variables only, and as such can be
    rather straightforward to implement. Any PSM project that requires
    creation of a complete Soil Information System (tens of
    quantitative soil variables and soil types), will definitely
    demand more effort and hence potentially significantly
    increase costs. Typically, evaluation and quality control of maps
    in a SIS requires an analyst to open and visually compare patterns
    from different maps and to make use of considerable empirical
    knowledge of soils. Costs of production can also be significantly
    increased depending on whether lower and upper prediction
    intervals are required. As with increasing spatial resolution,
    requesting lower and upper prediction intervals means that two
    times more pixels will need to be produced.

3.  **Targeted accuracy/quality levels**: Often the agencies that order
    spatial soil information expect that predictions will achieve some
    desired accuracy targets. Accuracy of predictions can, indeed,
    often be improved (but only up to a certain level), by simply
    improving the modelling framework (PSM projects type C). In
    practice, if a contractor requires significant improvements in
    accuracy, then this often means that both additional point records
    and improved covariate data (for example at finer
    spatial resolution) will need to be collected and/or purchased.
    This can often mean that the original budget will have to be
    increased until the required accuracy level can be reached.

4.  **List of targeted services / user domain**: Is the goal of the PSM
    project to produce data only, or to serve this data for a number
    of applications (use-cases)? How broad is the user domain? Is the
    data being made for a few targeted clients or for the widest
    possible user base? Is high traffic expected and, if so, how will
    the costs of hosting and serving the data and processes be met?
    Producing a robust, scalable web-system that can serve thousands
    of users at the same time requires considerable investments in
    programming and maintenance.

5.  **Commercialization options**: Commercialization of data and
    services can also significantly increase costs, since the
    development team needs to prepare also workflows where invoices
    and bills are generated on demand, or where efficient support and
    security are now critically important. Even though many companies
    exist that offer outsourcing of this functionality, many
    organizations and companies prefer to have full control of the
    commercialization steps, hence such functionality needs to be then
    developed internally within the project or organization.

### Field observations and measurements

**Observations and measurements** (O&M) are at the heart of all advances in
scientific endeavor. One cannot describe, or attempt to understand,
what one cannot see, or measure. Great leaps in scientific understanding
have always followed from major improvements in the ability to see, and
measure, phenomenon or objects. Think of the telescope and astronomy,
the microscope and microbiology, the X-ray and atomic structure or
crystallography and so on.

In the area of resource inventories, observations and measurements
carried out in the field (field data) provide the evidence critical to
developing the understanding of spatial patterns and spatial processes
that underpins all models that predict the spatial distribution of
properties or classes. This applies equally to subjective, empirical
mental, or conceptual, models and objective, quantitative statistical
models. The more and better the observations and measurements we obtain,
the better will be our ability to understand and predict spatial
patterns of soils and other natural phenomena. Consider here some
general observations on how to maximize efficiency of O&M:

-   For maximum utility, field data should be objective
    and reproducible.

-   They should be collected using some kind of unbiased sampling design
    that supports reproducibility and return sampling [@BRUS2019464; @Malone2019PeerJ].

-   They should be located as accurately as possible in both
    space (geolocation) and time (temporal location).

-   They should describe and measure actual conditions in their present
    state (and current land use) and not in some assumed natural,
    climax or equilibrium condition.

-   They should capture and permit description of spatial and temporal
    variation across multiple spatial scales and time frames.

It is widely assumed that collecting new field data to produce new and
improved inventory products is prohibitively expensive and will never be
possible or affordable in the foreseeable future. Consequently, most
current projects or programs that aim to produce new maps of soils or
other terrestrial entities have explicitly embraced the assumption that
the only feasible way to produce new soil maps is to locate, and make
use of, existing legacy data consisting of previously reported field
observations or existing laboratory analysed field samples. However,
recent activities in Africa (www.Africasoils.net), for example, have
demonstrated conclusively that it is feasible, affordable and beneficial
to collect new field observations and samples and to analyse new soil
samples affordably and to a high standard [@ShepherdWalsh2007JNIS].

### Preparation of point data

Import of basic O&M field data (e.g. soil point data) can be time
consuming and require intensive, often manual, checking and harmonization.
Communicating with the original data producers is highly recommended to
reduce errors during import. Getting original data producers involved
can be best achieved by inviting them to become full participants (e.g.
join in joint publications) or by at least providing adequate and
visible acknowledgement (e.g. listing names and affiliations in metadata
or on project websites).

Documenting all import, filtering and translation steps applied to
source data is highly recommended, as these steps can then be
communicated to the original field data producers to help filter out
further bugs. We typically generate a single tabular object with the
following properties as our final output of point data preparation :

-   Consistent column names are used; metadata explaining column names
    is provided,

-   All columns contain standardized data (same variable type, same
    measurement units) with harmonized values (no significant bias in
    values from sub-methods),

-   All artifacts, outliers and typos have been identified and corrected
    to the best of our ability,

-   Missing values have been imputed (replaced with estimated values) as
    much as possible,

-   Spatial coordinates, including depths, (x,y,z) are available for all
    rows (point locations).

### Preparation of covariates

As mentioned previously, preparation of covariate layers can require
significant effort, even if RS data is publicly available and well
documented. For example, MODIS land products are among the most used RS
layers for global to regional PSM. Using raw reflectance data, such as
the mid-infrared MODIS bands from a single day can, however, be of
limited use for soil mapping in areas with dynamic vegetation, i.e. with
strong seasonal changes in vegetation cover. To account for seasonal
fluctuation and for inter-annual variations in surface reflectance, we
instead advise using long-term temporal signatures of the soil surface
derived as monthly averages from long-term MODIS imagery (18+ years of
data). We assume here that, for each location in the world, long-term
average seasonal signatures of surface reflectance or vegetation index
provide a better indication of site environmental characteristics than
just a single day snapshot of surface reflectance. Computing temporal
signatures of the land surface requires a large investment of time
(comparable to the generation of climatic images vs temporary weather
maps), but it is possibly the only way to effectively represent the cumulative
influence of living organisms on soil formation [@Hengl2017SoilGrids250m].

Typical operations to prepare soil covariates for PSM thus include:

-   Downloading the original source RS data,

-   Filtering missing pixels using neighborhood filters and/or simple
    rules,

-   Running aggregation functions (usually via some tiling system),

-   Running hydrological and morphological analysis on source DEM data

-   Calculation of a Gaussian pyramid, for some relevant covariates, at
    multiple coarser resolutions, in order to capture multi-scale
    variation at appropriate (longer range) process scales.

-   Preparing final mosaics to be used for PSM (e.g. convert to GeoTIFFs
    and compress using internal compression `"COMPRESS=DEFLATE"` or
    similar),

For processing the covariates we currently use a combination of Open
Source GIS software, primarily SAGA GIS, GRASS GIS, Whitebox tools, R
packages raster, sp, GSIF and GDAL for reprojecting, mosaicking and
merging tiles. SAGA GIS and GDAL were found to be highly suitable for
processing massive data sets, as parallelization of computing was
relatively easy to implement.

Preparation of covariate layers is completed once:

-   all layers have been resampled to exactly the same grid resolution
    and spatial reference frame (downscaling or aggregation applied
    where necessary),

-   all layers are complete (present for >95% of the soil mask at
    least; the remaining 5% of missing pixels can usually be filled-in
    using some algorithm),

-   there are no visibly obvious artifacts or blunders in the input
    covariate layers,

### Soil mask and the grid system

We recommend using a raster mask file to define the spatial domain of
interest (i.e. total number of pixels to be mapped), and the spatial
reference framework for PSM. The mask file defines the maximum extent,
or bounds, of the area for which predictions will be made. It also
identifies any grid cells, within the maximum bounds, which are to be
excluded from prediction for various reasons (e.g. water, ice or human
disturbance). Finally, the mask file establishes the resolution (pixel
size) and spatial coordinate system that all other layers included in
the analysis must conform to, to ensure consistent overlay of all grids.
In most of our PSM projects we typically restrict ourselves to making
predictions only for pixels that exhibit some evidence of having
photosynthetically active vegetative cover at some point in time. We tend to exclude from
prediction any grid cells that have no evidence of vegetative cover at
any time, such as permanent bodies of water or ice, bare rock and man
made features such as roads, bridges and buildings. A generic definition
of a soil mask can differ somewhat from the one we use, but this has
been our practice.

```{r soil-mask-scheme, echo=FALSE, fig.cap="Example of a soil (land) mask scheme.", out.width="85%"}
knitr::include_graphics("figures/Fig_soil_mask_scheme.png")
```

From the perspective of global soil mapping, any terrestrial location on
Earth can be considered to belong to one and only one of six categories
(excluding oceans):

A. Fresh water (lakes, rivers)

B. Permanent ice or snow

C. Bare rocks

D. Urban areas

E. Bare soil areas

F. Soils with vegetation cover

This gives the following formulas:

F = Land mask - ( A + B + C + D + E )

Soil mask = D + E + F

Hence the values in the soil mask can be typically coded as:

-   0 = NA or non-soil

-   1 = soils with vegetation cover

-   2 = urban areas

-   3 = bare soil areas

If no other layers are available, global maps of land cover can be used
to produce a soil mask file (following the simple formula from above).
Some known global land cover layers are:

-   300 m resolution: ESA CCI Land cover — 300 m annual global land
    cover time series from 1992 to 2015
    (https://www.esa-landcover-cci.org/),

-   100 m resolution: ESA PROBA-V 100 m land cover map
    (http://land.copernicus.eu/global),

-   30 m resolution: Chinese GLC data product (GlobeLand30) with 10
    classes for the years 2000 and 2010
    (http://www.globallandcover.com),

Using widely accepted, published, global land cover maps to define a
soil mask is highly recommended. This allows users to validate the maps
and also ensures future consistency in case there is a need in the
future to merge multiple maps covering larger areas.

Another important technical consideration for a PSM project is the
**grid system**. The grid system is defined by the bounding box, pixel
size and number of rows and columns:

  * Xmin, Xmax, Ymin, Ymax,
  * Spatial resolution in m (projected),
  * Spatial resolution in DD,
  * Number of rows (X) and columns (Y),

Maps referenced by geographical coordinates
([*EPSG:4326*](http://spatialreference.org/ref/epsg/wgs-84/); used by
the GPS satellite navigation system and for NATO military geodetic
surveying) have spatial resolution given in abstract decimal degrees
(which do not relate 1:1 with metric resolution). Some standard spatial
resolutions (in decimal degrees) can be derived using the following
simple rules of thumb (d.d. = decimal degrees):

-   30 m ≈ 1/4000 d.d. = 0.00025

-   100 m ≈ 1/1200 d.d. = 0.0008333333

-   250 m ≈ 1/480 d.d. = 0.002083333

-   500 m ≈ 1/240 d.d. = 0.004166667

-   1 km ≈ 1/120 d.d. = 0.008333333

Again, these are only approximate conversions. Differences in resolution
in x/y coordinates in projected 2D space and geographical coordinates
can be large, especially around poles and near the equator.

Another highly recommended convention is to use some widely accepted
Equal area projection system for all intermediate and final output maps. This ensures the best
possible precision in determining area measures, which is often
important e.g. for derivation of total stocks, volumes of soil and soil
components and similar. Every country tends to use a specific equal area
projection system for it’s mapping, which is usually available from the
National mapping agency. For continental scale maps we recommend using
e.g. the [*Equi7 grid system*](https://github.com/TUW-GEO/Equi7Grid).
Some recognized advantages of the Equi7 Grid system are:

-   The projections of the Equi7 Grid are equidistant and hence suitable
    for various geographic analyses, especially for derivation of
    buffer distances and for hydrological DEM modeling, i.e. to derive
    all DEM-based soil covariates,

-   Areal and shape distortions stemming from the Equi7 Grid projection
    are relatively small, yielding a small grid oversampling factor,

-   The Equi7 Grid system ensures an efficient raster data storage while
    suppressing inaccuracies during spatial transformation.

### Uncertainty of PSM maps

For soil maps to be considered trustworthy and used appropriately,
producers are often required to report mapping accuracy (usually per
soil variable) and identify limitations of the produced maps. 
There are many measures of mapping accuracy, but usually these can be
grouped around the following two approaches:

1.  Prediction intervals at each prediction point, i.e. lower and upper
    limits for 90% probability range.

2.  Global (whole-map) measures of the mapping accuracy (RMSE, ME, CCC,
    z-scores, variogram of CV residuals).

The mean width of prediction intervals and global measures of mapping
accuracy should, in principle, match, although it is possible that the
mean width of prediction intervals can often be somewhat wider (a
consequence of extrapolation). In some cases, measures of uncertainty can
be over-optimistic or biased (which will eventually be exposed by new
observations), which can decrease confidence in the product, hence
providing realistic estimates of uncertainty of uncertainty is often
equally as important as optimizing predictions.

Common approaches to improving the accuracy of predicted maps i.e.
narrowing down the prediction intervals are to (a) collect new
additional data at point locations where models perform the poorest
(e.g. exhibit the widest prediction intervals), and (b) invest in
preparing more meaningful covariates, especially finer resolution
covariates. Technical specifications, however, influence the production
costs and have to be considered carefully as production costs can
significantly increase with e.g. finer pixel size. Aiming at 30% lower
RMSE might seem trivial but the costs of such improvement could exceed
the original budget by several times [@Hengl2013JAG].

### Computing costs

To achieve efficient computing, experienced data scientists understand
the importance of utilizing the full capacity of the available hardware
to its maximum potential (100%). This usually implies that:

-   the most up-to-date software is used for all computing tasks,

-   the software is installed in such a way that it can achieve
    maximum computing capacity,

-   any function, or process, that can be parallelized in theory is
    also parallelized in practice,

-   running functions on the system will not result in system
    shutdowns, failures or artifacts,

As mentioned previously, applying PSM for large areas at finer
resolutions (billions of pixels) benefits from use of a high performance
computing (HPC) server to run overlay, model fitting and predictions and
to then generate mosaics. The current code presented in this PSM with R
book is more or less 90% optimized so that running of the most important
functions can be easily scaled up. The total time required to run one
global update on a single dedicated HPC server (e.g. via Amazon AWS) for a 
soil mask that contains >100 million pixels can
require weeks of computing time. Copying and uploading files can also be
a lengthy process.

A configuration we adopt, and recommend, for processing large stacks of
grids with a large number of evidence points is e.g. the [OVH server](https://www.ovh.nl/dedicated_servers/HG/):

-   [*EG-512-H*](https://www.ovh.nl/dedicated_servers/infrastructure/1801eg08.xml)
    (512GB RAM takes 3 weeks of computing; costs ca € 950,00
    per month)

An alternative to using OVH is the Amazon AWS (Fig. \@ref(fig:aws-htop-server)). 
Amazon AWS server, with a similar configuration, might appear to cost much more than an OVH server
(especially if used continuously over a month period), but Amazon permits
computing costs to be paid by the hour, which provides more flexibility
for less intensive users. As a rule of thumb, a dedicated server at
Amazon AWS, if used continuously 100% for the whole month, could cost
up to 2.5 times more than an OVH server. 

The recommended server for running PSM on Amazon AWS to produce predictions for billions of pixels is:

-   [*AWS m4.16xlarge*](https://aws.amazon.com/ec2/pricing/on-demand/)
    (\$3.84 per Hour);

```{r aws-htop-server, echo=FALSE, fig.cap="Example of an AWS dedicated server running spatial predictions on 96 threads and using almost 500GB of RAM. Renting out this server can cost up to 8 USD per hour.", out.width="100%"}
knitr::include_graphics("figures/Fig_htop_96treads.png")
```

A HPC server should also have at least 2–3TB of hard disk space to host
all input and output data. In addition to computing costs, one also
needs to carefully consider web hosting and web traffic costs. For large
data sets these can almost equal actual computing production costs.

## Final delivery of maps

### Delivery data formats

A highly suitable and flexible data format for delivering raster images
of soil variables is GeoTIFF. We prefer using this format for sharing
raster data for the following reasons [@mitchell2014geospatial]:

-   It is GDAL's default data format and much functionality for
    subsetting, reprojecting, reading and writing GeoTIFFs already
    exists (see [*GDAL
    utils*](http://www.gdal.org/gdal_utilities.html)).

-   It supports internal compression via creation options (e.g.
    `COMPRESS=DEFLATE`).

-   Extensive overlay, subset, index, translate functionality is
    available via GDAL and other open source software. Basically, the
    GeoTIFF format functions as a raster DB.

By exporting spatial data to GeoTiffs, one can create a soil spatial DB
or a soil information system. This does not necessarily mean that its
targeted users will be able to find all information without problems
and/or questions. The usability and popularity of a data set reflect
many considerations in addition to data quality.

Another useful aspect of final delivery of maps is compression of the
GeoTIFFs. To avoid large file sizes, we recommend always using integers
inside GeoTIFF formats because floating point formats can result in
increases in file sizes of up four times (with no increase in accuracy).
This might require multiplication of original values of the soil
property of interest by 10 or 100, in order to maintain precision and
accuracy (e.g. multiply pH values by 10 before exporting your raster
into integer GeoTIFF format).

### General recommendations

Even maps of perfect quality might still not attract users, if they are
not properly designed. Some things to consider to increase both use and
usability of map data are:

1.  Make a landing page for your map data that includes: (a) simple
    access/download instructions, (b) screenshots of your data in
    action (people prefer visual explanations with examples), (c)
    links to key documents explaining how the data were produced,
    and (d) workflows explaining how to request support (who to
    contact and how).

2.  Make data accessible from multiple independent systems e.g. via WCS,
    FTP and through a mirror site (in case one of the access sites
    goes offline). This might be inefficient considering there will be
    multiple copies of the same data, but it can quadruple data usage.

3.  Explain the data formats used to share data, and provide tutorials,
    for both beginners and advanced users, that instruct how to access
    and use the data.

4.  Consider installing and using a version control system (or simply
    use github or a similar repository) so that the users can track
    earlier versions of map data.

5.  Consider closely following principles of [reproducible research](https://ropensci.org/blog/2014/06/09/reproducibility/)
    (all processing steps, inputs and outputs are accessible). For
    example, making the R code available via github so that anyone is
    theoretically able to reproduce all examples shown in the text.
    Transparency increases trust.

### Technical specifications PSM project

A way to improve planning of a PSM project is to spend more time on preparing
the technical aspects of data production. This includes listing the general
specifications of the study area, listing target variables and their 
collection methods, listing covariate layers of interest to be used to improve
mapping accuracy and listing targeted spatial prediction algorithms to be
compared.

General specifications of the study area include:

  * G.1 Project title:
  * G.2 PSM project type:
    - PSM project in a new area
    - PSM project using legacy points
    - PSM project aiming at optimizing predictions and usability
  * G.3 Target spatial resolution:
    - 10 m
    - 30 m
    - 100 m
    - 250 m
    - 1000 m
  * G.4 Target temporal span (period of interest):
    - Begin date,
    - End date,
  * G.5 Soil mask:
    - *raster image or land cover classes in the referent land cover map covering the study area*
  * G.6 Grid definition:
    - Xmin,
    - Ymin,
    - Xmax,
    - Ymax,
  * G.7 Target projection system: 
     - [*proj4*](https://proj4.org/) code,
  * G.8 Total area:
    - in square-km,
  * G.9 Inspection density (observations per square-km):
    - Detailed soil profiles,
    - Soil semi-profiles,
    - Top-soil / sub-soil samples (with laboratory analysis),
    - Quick observations (no lab data),
  * G.10 Total budget (planned):
  * G.11 Total pixels in millions:
    - amount of pixels for all predictions
  * G.12 Total planned production costs per 1M pixels (divide G.10 by G.11):
  * G.13 Target data license:
  * G.14 Target user groups and workflows (targeted services):
  * G.15 Further updates of maps:
    - Continuous updates in regular intervals,
    - Two prediction time intervals (start, end period),
    - No updates required except fixes and corrections,
  * G.16 Commercialization of the PSM outputs:
    - No commercial data nor services,
    - Commercial data products,
    - Commercial services,
  * G.17 Support options:
    - Dedicated staff / live contact,
    - Mailing list,
    - Github / code repository issues,

### Standard soil data production costs

Standard production costs can be roughly split into three main
categories:

-   Fixed costs (e.g. project initiation, equipment, materials,
    workshops etc),

-   Variable data production costs expressed per: 
    - M (million) of pixels of data produced,
    - Number of points samples,
    - Number of variables modeled,

-   Data maintenance and web-serving costs, usually expressed as
    monthly/yearly costs,

Although in the introduction chapter we mentioned that the production costs are mainly a function of 
grid resolution i.e. cartographic scale, in practice several other factors determine the total costs. 
Standard soil data production costs (approximate estimates) per
soil data quality category (see below) are connected to the quality level of the output maps.
Consider that there are four main quality levels:

  * L0 = initial product with only few soil properties, no quality/accuracy requirements, 

  * L1 = final complete product with no quality/accuracy requirements, 

  * L2 = final complete product matching standard accuracy requirements, 

  * L3 = final complete certified product according to the ISO or similar standards.

```{r psm-table-costs, echo=FALSE}
knitr::kable(
  head(read.csv("extdata/psm_table_costs.csv", header = TRUE, stringsAsFactors = FALSE), 10), booktabs = TRUE,
  caption = 'Example of a costs table for various quality levels PSM products. Prices expressed in USD / Mpix (million of pixels produced).'
)
```

To convert average costs / M pixels to total costs we run the following
calculus:

  - Pixel resolution = 100 m

  - USA48 area = 8,080,464.3 square-km

  - Total pixels 6 depths 3 soil properties = 14,544 Mpix

  - Average production costs (L1) = 0.8 US\$ / Mpix

  - Total production costs PSM projects using legacy points (single state,
L1) = 11,635 US\$

  - Average production costs (L2) = 2 US\$ / Mpix

  - Total production costs PSM projects using legacy points (single state,
 L2) = 29,088 US\$

Note: this is a very generic estimate of the production costs and actual
numbers might be significantly different. Additional fixed costs +
monthly/yearly costs need to be added to these numbers to account also
for any web hosting, support or update costs.

Compare these costs with the following standard estimated costs to
deliver completed conventional manual soil survey products (see also section \@ref(mapping-accuracy-soil-survey-costs)):

  - USDA estimate of the total soil survey costs: 4 US\$ per ha for
    1:20,000 scale maps [@Durana2008] to map USA48 area = 3.2 billion
    US\$,

  - New Zealand: 4 US\$ per ha for 1:20,000 scale maps [@Carrick2010WCSS],

  - Canada: 3–10 CA\$ per ha for 1:20,000 scale maps [@MacMillan2010DSM],

## Summary notes 

Predictive soil mapping applies statistical and/or machine learning
techniques to fit models for the purpose of producing spatial and/or
spatiotemporal predictions of soil variables i.e. to produce maps of
soil properties or soil classes at various resolutions. This chapter
identifies and discusses some of the key technical specifications users need to
consider to prepare for data production and to obtain realistic
estimates of workloads and production costs. 

The key technical specifications of a PSM project are considered to consist of defining
the following: a soil mask, a spatial resolution, a list of target
variables and standard depth intervals (for 3D soil variables),
prediction intervals (if required), any secondary soil variables (and
how they will be derived) and required accuracy levels. Technical
specifications determine the production costs and need to be considered
carefully as production costs are sensitive to specifications, (e.g. 3
times finer pixel size can increase production costs up to 10 times, or
setting targets such as 30% lower RMSE can increase costs as either more
points or more covariates, or both, need to be included. General forms
at the end of the chapter provide an example of detailed list of technical 
specifications in relation to target variables and covariate layers 
typically used in PSM projects to date.