Area-based approach 2: prediction model calibration

Source: vignettes/area-based.2.model.calibration.Rmd

The code below presents a workflow to calibrate prediction models for the estimation of forest parameters from ALS-derived metrics, using the area-based approach (ABA). The workflow is based on functions from R packages lidaRtRee (tested with version 4.0.7) and lidR (tested with version 4.0.4).

Load data

The “Quatre Montagnes” dataset from France, prepared as described in the data preparation tutorial is loaded from the R archive files located in the folder “vignettes/data/aba.model/output”.

Field data

The file “plots.rda” contains the field data, organized as a data.frame named plots. For subsequent use in the workflow, the data.frame should contain at least two fields: plot_id (unique plot identifier) and a forest stand parameter. Each line in the data.frame corresponds to a field plot. A factor variable is required to calibrate stratified models. Plot coordinates are required for subsequent inference computations.

The provided data set includes one categorical variable: stratum, which corresponds to forest ownership, XY coordinates and three forest stand parameters :

basal area in m²/ha (G_m2_ha),
stem density in /ha (N_ha),
mean diameter at breast height in cm (D_mean_cm).

Scatterplots of stand parameters are presented below, colored by ownership (green for public forest, blue otherwise).

# load plot-level data
load(file = "./data/aba.model/output/plots.rda")
summary(plots)

##    plot_id                X                Y            cluster_id       
##  Length:96          Min.   :895945   Min.   :6439570   Length:96         
##  Class :character   1st Qu.:896989   1st Qu.:6443484   Class :character  
##  Mode  :character   Median :899661   Median :6451211   Mode  :character  
##                     Mean   :899394   Mean   :6450228                     
##                     3rd Qu.:900696   3rd Qu.:6455042                     
##                     Max.   :903303   Max.   :6459820                     
##     G_m2_ha           N_ha           D_mean_cm        stratum  
##  Min.   :15.67   Min.   :  56.59   Min.   :14.70   private:32  
##  1st Qu.:31.22   1st Qu.: 488.08   1st Qu.:19.55   public :64  
##  Median :37.35   Median : 714.43   Median :22.65               
##  Mean   :40.20   Mean   : 810.22   Mean   :24.78               
##  3rd Qu.:44.78   3rd Qu.:1110.55   3rd Qu.:26.31               
##  Max.   :99.18   Max.   :1994.74   Max.   :75.53

# display forest variables
plot(plots[, c("G_m2_ha", "N_ha", "D_mean_cm")],
  col = ifelse(plots$stratum == "public", "green", "blue")
)

ALS data

Normalized ALS point clouds extracted over each plot, as well as terrain statistics previously computed from the ALS ground points can also be prepared according to the data preparation tutorial. Point clouds corresponding to each field plot are organized in a list of LAS objects. Meta data of one LAS object are displayed below.

# list of LAS objects: normalized point clouds inside plot extent
load("./data/aba.model/output/llas_height.rda")
# display one point cloud # lidR::plot(llasn[[1]])
llas_height[[1]]

## class        : LAS (v1.2 format 1)
## memory       : 980.1 Kb 
## extent       : 898185.8, 898225.7, 6455688, 6455728 (xmin, xmax, ymin, ymax)
## coord. ref.  : RGF93 v1 / Lambert-93 
## area         : 1307 m²
## points       : 15.6 thousand points
## density      : 11.95 points/m²
## density      : 9.54 pulses/m²

The first lines of the terrain statistics are displayed hereafter.

##           altitude azimut_gr slope_gr
## Verc-01-1   1096.9      99.1     20.1
## Verc-01-2   1083.4      94.0     16.1
## Verc-01-3   1120.7     104.0     22.9

The following lines ensure that the plots are ordered in the same way in the three data objects.

llas_height <- llas_height[plots$plot_id]
metrics_terrain <- metrics_terrain[plots$plot_id, ]

ALS metrics computation

Two types of vegetation metrics can be computed.

Point cloud metrics are directly computed from the point cloud or from the derived surface model on the whole plot extent. These are the metrics generally used in the area-based approach.
Tree metrics are computed from the characteristics of trees detected in the point cloud (or in the derived surface model). They are more CPU-intensive to compute and require ALS data with higher density, but in some cases they allow a slight improvement in models prediction accuracy.

Removal of unwanted points

The point cloud contains some points with classification code 7 (Low point) and 12 (Reserved). In this case the provider indicated that the 12 code also contains vegetation points: they are left in the point cloud whereas code 7 are removed. Points higher than a height threshold are also removed. Points with negative heights are set to zero.

# define classes to retain
class_points <- c(0, 1, 2, 3, 4, 5, 12)
# height threshold (60 m)
h_points <- 60
# filter points
llas_height <- lapply(llas_height,
                      function(x)
                        lidR::filter_poi(x,
                                         is.element(Classification,
                                                    class_points) &
                                           Z <= h_points))
llas_height <- lapply(llas_height, function(x)
{
  x$Z[x$Z<0] <- 0
  x
}
)

Point cloud metrics

Point cloud metrics are computed with the function lidaRtRee::clouds_metrics(), which applies the function lidR::cloud_metrics() to all point clouds in a list. Default computed metrics are those proposed by the function lidR::stdmetrics() see also the lidR wiki. Additional metrics are available with the function lidaRtRee::aba_metrics(). The buffer points, which are located outside of the plot extent inventoried on the field, should be removed before computing those metrics.

# define function for later use
aba_point_metrics_fun <- ~ lidaRtRee::aba_metrics(Z, Intensity, ReturnNumber,
                                                  Classification, 2)
# create list of point clouds without buffer
llas_height_plot_extent <-
  lapply(llas_height, function(x) {
    lidR::filter_poi(x, buffer == FALSE)
  })
# apply function on each point cloud in list
metrics_points <- lidaRtRee::clouds_metrics(llas_height_plot_extent,
                                            aba_point_metrics_fun)
round(head(metrics_points[, 1:8], n = 3), 2)

##            zmax zmean  zsd zskew zkurt zentropy pzabovezmean pzabove2
## Verc-01-1 33.87 17.45 8.15 -0.41  1.88     0.93        58.62    99.97
## Verc-01-2 29.78 17.49 5.29 -0.57  3.04     0.89        53.24    99.99
## Verc-01-3 29.89 18.66 5.35 -0.54  2.97     0.90        53.38    99.99

Tree metrics

Tree metrics rely on a preliminary detection of trees, which is performed with the lidaRtRee::tree_segmentation() function. For more details, please refer to the tree detection tutorial (vignette("tree.detection".Rmd")). Tree segmentation requires point clouds or canopy height models with an additional buffer in order to avoid border effects when computing tree characteristics. Once trees are detected, metrics are derived with the function lidaRtRee::std_tree_metrics(). A user-specific function can be specified to compute other metrics from the features of detected trees. Plot radius has to be specified as it is required to exclude trees detected outside of the plot, and to compute the plot surface. Tree segmentation is not relevant when the point cloud density is too low, typically below five points per m². The function first computes a canopy height model which default resolution is 0.5 m, but this should be set to 1 m with low point densities.

# resolution of canopy height model (m)
aba_res_chm <- 0.5
# specify plot radius to exclude trees located outside plots
plot_radius <- 15
# compute tree metrics
metrics_trees <- lidaRtRee::clouds_tree_metrics(llas_height,
                                                plots[, c("X", "Y")],
  plot_radius,
  res = aba_res_chm,
  func = function(x) {
    lidaRtRee::std_tree_metrics(x,
      area_ha = pi * plot_radius^2 / 10000
    )
  }
)
round(head(metrics_trees[, 1:5], n = 3), 2)

##           Tree_meanH Tree_sdH Tree_giniH Tree_density TreeInf10_density
## Verc-01-1      27.11     6.39       0.11       226.35             14.15
## Verc-01-2      25.80     3.27       0.06       268.80              0.00
## Verc-01-3      25.74     3.88       0.08       268.80              0.00

Other metrics

In case terrain metrics have been computed from the cloud of ground points only, they can also be added as variables, and so do other environmental variables which might be relevant in modeling.

metrics <- cbind(
  metrics_points[plots$plot_id, ],
  metrics_trees[plots$plot_id, ],
  metrics_terrain[plots$plot_id, 1:3]
)

Model calibration

Calibration for a single variable

Once a dependent variable (forest parameter of interest) has been chosen, the function lidaRtRee::aba_build_model() is used to select the linear regression model that yields the highest adjusted-R² with a defined number of independent variables (metrics), while checking linear model assumptions. A Box-Cox transformation of the dependent variable can be applied to normalize its distribution, or a log transformation of all variables (parameter transform). Model details and cross-validation statistics are available from the returned object.

variable <- "G_m2_ha"
# no subsample in this case
subsample <- 1:nrow(plots)
# model calibration
model_aba <- lidaRtRee::aba_build_model(plots[subsample, variable],
                                        metrics[subsample, ],
                                        transform = "boxcox", nmax = 4,
                                        xy = plots[subsample, c("X", "Y")])

## Reordering variables and trying again:

# renames outputs with variable name
row.names(model_aba$stats) <- variable
# display selected linear regression model
model_aba$model

## 
## Call:
## stats::lm(formula = dep_var ~ zskew + ipcumzq90 + p_hmin + TreeSup30_density, 
##     data = df_transform)
## 
## Coefficients:
##       (Intercept)              zskew          ipcumzq90             p_hmin  
##         6.2316659         -0.1179932         -0.0484946          1.0462538  
## TreeSup30_density  
##         0.0005935

# display calibration and validation statistics
model_aba$stats

##          n                                        formula     adjR2 transform
## G_m2_ha 96 zskew + ipcumzq90 + p_hmin + TreeSup30_density 0.7367451    boxcox
##             lambda     rmse    cvrmse      pwil    pttest      paov       cor
## G_m2_ha -0.1502047 7.948461 0.1977213 0.6438875 0.9923217 0.9968171 0.8375104
##             looR2    var_res
## G_m2_ha 0.7007292 0.01042615

The function computes values predicted in leave-one-out cross-validation, by using the same combination of dependent variables and fitting the regression coefficients with all observations except one. Predicted values can be plotted against field values with the function lidaRtRee::aba_plot(). It is also informative to check the correlation of prediction errors with other forest or environmental variables.

The prediction errors seem positively correlated with basal area.

# check correlation between errors and other variables
round(cor(cbind(model_aba$values$residual,
                plots[subsample,
                      c("G_m2_ha", "N_ha", "D_mean_cm")],
                metrics_terrain[subsample, 1:3])), 2)[1, ]

## model_aba$values$residual                   G_m2_ha                      N_ha 
##                      1.00                      0.51                      0.17 
##                 D_mean_cm                  altitude                 azimut_gr 
##                      0.12                      0.04                     -0.15 
##                  slope_gr 
##                     -0.07

# significance of correlation value
cor.test(model_aba$values$residual, plots[subsample, variable])

## 
##  Pearson's product-moment correlation
## 
## data:  model_aba$values$residual and plots[subsample, variable]
## t = 5.6794, df = 94, p-value = 1.501e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3393545 0.6409830
## sample estimates:
##       cor 
## 0.5054516

Coloring points by ownership shows that plots located in private forests have the largest basal area values.

par(mfrow = c(1, 2))
# plot predicted  VS field values
lidaRtRee::aba_plot(model_aba,
  main = variable,
  col = ifelse(plots$stratum == "public", "green", "blue")
)
legend("topleft", c("public", "private"), col = c("green", "blue"), pch = 1)
plot(model_aba$values$predicted,# plots[subsample, c("G_m2_ha")],
  model_aba$values$residual,
  ylab = "Prediction error", xlab = "Predicted in LOOCV",
  main = variable,
  col = ifelse(plots$stratum == "public", "green", "blue")
)
abline(h = 0, lty = 2)

Calibration for several variables

The following code calibrates models for several forest parameters. In case different transformations have to be performed on the parameters, models have to be calibrated one by one.

models_aba <- list()
for (i in c("G_m2_ha", "D_mean_cm", "N_ha"))
{
  models_aba[[i]] <- lidaRtRee::aba_build_model(plots[, i],
                                                metrics, transform = "boxcox",
                                                nmax = 4,
                                                xy = plots[, c("X", "Y")])
}

## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:

# bind model stats in a data.frame
model_stats <- do.call(rbind, lapply(models_aba, function(x) {
  x[["stats"]]
}))

The obtained models are presented below. The table columns correspond to:

n number of plots,
metrics selected in the model,
adj-R2.% adjusted R-squared of fitted model (%),
CV-R2.% coefficient of determination of values predicted in cross-validation (CV) VS field values (%),
CV-RMSE.% coefficient of variation of the Root Mean Square Errors of prediction in CV (%),
CV-RMSE Root Mean Square Error of prediction in CV.

The two largest (outlier) values of mean diameter are underestimated by the model, which greatly decreases the accuracy statistics. This might be explained by the fact that when trees reach maturity, diameter growth continues while height growth almost stops. As the ALS point cloud mostly contains height information, there is some signal saturation for high mean diameter values. It might also be the case for high biomass values.

	n	metrics	adj-R2.%	CV-R2.%	CV-RMSE.%	CV-RMSE
G_m2_ha	96	zskew + ipcumzq90 + p_hmin + TreeSup30_density	73.7	70.1	19.8	7.9
D_mean_cm	96	zq70 + ipcumzq70 + Tree_density + altitude	82.5	84.7	15.2	3.8
N_ha	96	zentropy + zq95 + p_1st_hmin + TreeSup10_density	90.7	88.3	18.9	153.0

Stratified models

Motivation

When calibrating a statistical relationship between forest stand parameters, which are usually derived from diameter measurements, and ALS metrics, one relies on the hypothesis that the interaction of laser pulses with the leaves and branches structure is constant on the whole area. However, differences can be expected either due to variations in acquisition settings (flight parameters, scanner model), in forests (stand structure and composition) or in topography (slope). Better models might be obtained when calibrating stratum-specific relationships, provided each stratum is more homogeneous regarding the laser interaction with the vegetation. A trade-off has to be achieved between the within-strata homogeneity and the number of available plots for calibration in each stratum. A minimum number of plots is approximately 50, while 100 would be recommended. In this example we hypothesize that ownership reflects both structure and composition differences in forest stands.

Calibration of stratum-specific models

Stratum-specific models are computed and stored in a list during a for loop. The function lidaRtRee::aba_combine_strata() then combines the list of models corresponding to each stratum to compute aggregated statistics for all plots, making it easier to compare stratified with non-stratified models.

# stratification variable
strat <- "stratum"
# create list of models
model_aba_stratified <- list()
# calibrate each stratum model
for (i in levels(plots[, strat]))
{
  subsample <- which(plots[, strat] == i)
  if (length(subsample) > 0) {
    model_aba_stratified[[i]] <- lidaRtRee::aba_build_model(
      plots[subsample, variable],metrics[subsample, ], transform = "boxcox",
      nmax = 4, xy = plots[subsample, c("X", "Y")])
  }
}

## Reordering variables and trying again:

# backup list of models for later use
model_aba_stratified_boxcox <- model_aba_stratified
# combine list of models into single object
model_aba_stratified <- lidaRtRee::aba_combine_strata(model_aba_stratified,
                                                      plots$plot_id)
# model_aba_stratified$stats

	n	metrics	adj-R2.%	CV-R2.%	CV-RMSE.%	CV-RMSE
NOT.STRATIFIED	96	zskew + ipcumzq90 + p_hmin + TreeSup30_density	73.7	70.1	19.8	7.9
private	32	zpcum8 + ikurt + TreeInf10_density	59.9	51.6	19.4	10.4
public	64	p_1st_hmin + Tree_meanH + Tree_density + altitude	60.8	54.4	16.6	5.5
COMBINED	96	NA	NA	73.1	18.7	7.5

Stratified models with stratum-specific variable tranformations

In case one wants to apply different variable transformations, or use different subsets of ALS metrics depending on the strata, the following example can be used. First models using only the point cloud metrics are calibrated without transformation of the data. The statistics for all plots are then calculated by combining the following stratum-specific models :

public ownership, all metrics, Box-Cox transformation of basal area values (calibrated in the previous paragraph),
private ownership, only point cloud metrics, no data transformation.

# create list of models for no transformation
model_aba_stratified.none <- list()
# calibrate each stratum model
for (i in levels(plots[, strat]))
{
  subsample <- which(plots[, strat] == i)
  if (length(subsample) > 0) {
    model_aba_stratified.none[[i]] <-
      lidaRtRee::aba_build_model(plots[subsample, variable],
                                 metrics_points[subsample, ],
                                 transform = "none",
                                 xy = plots[subsample, c("X", "Y")])
  }
}
# combine list of models into single object
model_aba_stratified_mixed <-
  lidaRtRee::aba_combine_strata(
    list(private = model_aba_stratified.none[["private"]],
         public = model_aba_stratified_boxcox[["public"]]), plots$plot_id)
# bind model stats in a data.frame for comparison
model_stats <- rbind(model_aba$stats, model_aba_stratified_mixed$stats)
row.names(model_stats)[1] <- "NOT.STRATIFIED"

	n	metrics	transform	adj-R2.%	CV-R2.%	CV-RMSE.%	CV-RMSE
NOT.STRATIFIED	96	zskew + ipcumzq90 + p_hmin + TreeSup30_density	boxcox	73.7	70.1	19.8	7.9
private	32	zentropy + zpcum9	none	55.0	51.2	19.5	10.5
public	64	p_1st_hmin + Tree_meanH + Tree_density + altitude	boxcox	60.8	54.4	16.6	5.5
COMBINED	96	NA	NA	NA	73.0	18.8	7.6

Save data before next tutorial

The following lines save the data required for the area-based mapping step.

save(model_aba_stratified_mixed, model_aba, aba_point_metrics_fun, aba_res_chm,
     class_points, h_points,
     file = "./data/aba.model/output/models.rda"
)

Jean-Matthieu Monnet