Area-based approach 2: prediction model calibration
Jean-Matthieu Monnet
area-based.2.model.calibration.Rmd
Source: vignettes/area-based.2.model.calibration.Rmd
The code below presents a workflow to calibrate prediction models for
the estimation of forest parameters from ALS-derived metrics, using the
area-based approach (ABA). The workflow is based on functions from
R
packages lidaRtRee
(tested with version 4.0.7) and lidR (tested with
version 4.0.4).
Load data
The “Quatre Montagnes” dataset from France, prepared as described in
the data preparation
tutorial is loaded from the R
archive files located in
the folder “vignettes/data/aba.model/output”.
Field data
The file “plots.rda” contains the field data, organized as a
data.frame named plots
. For subsequent use in the workflow,
the data.frame should contain at least two fields: plot_id
(unique plot identifier) and a forest stand parameter. Each line in the
data.frame corresponds to a field plot. A factor variable is required to
calibrate stratified models. Plot coordinates are required for
subsequent inference computations.
The provided data set includes one categorical variable:
stratum
, which corresponds to forest ownership, XY
coordinates and three forest stand parameters :
- basal area in m2/ha (
G_m2_ha
), - stem density in /ha (
N_ha
), - mean diameter at breast height in cm (
D_mean_cm
).
Scatterplots of stand parameters are presented below, colored by ownership (green for public forest, blue otherwise).
## plot_id X Y cluster_id
## Length:96 Min. :895945 Min. :6439570 Length:96
## Class :character 1st Qu.:896989 1st Qu.:6443484 Class :character
## Mode :character Median :899661 Median :6451211 Mode :character
## Mean :899394 Mean :6450228
## 3rd Qu.:900696 3rd Qu.:6455042
## Max. :903303 Max. :6459820
## G_m2_ha N_ha D_mean_cm stratum
## Min. :15.67 Min. : 56.59 Min. :14.70 private:32
## 1st Qu.:31.22 1st Qu.: 488.08 1st Qu.:19.55 public :64
## Median :37.35 Median : 714.43 Median :22.65
## Mean :40.20 Mean : 810.22 Mean :24.78
## 3rd Qu.:44.78 3rd Qu.:1110.55 3rd Qu.:26.31
## Max. :99.18 Max. :1994.74 Max. :75.53
ALS data
Normalized ALS point clouds extracted over each plot, as well as terrain statistics previously computed from the ALS ground points can also be prepared according to the data preparation tutorial. Point clouds corresponding to each field plot are organized in a list of LAS objects. Meta data of one LAS object are displayed below.
# list of LAS objects: normalized point clouds inside plot extent
load("./data/aba.model/output/llas_height.rda")
# display one point cloud # lidR::plot(llasn[[1]])
llas_height[[1]]
## class : LAS (v1.2 format 1)
## memory : 980.1 Kb
## extent : 898185.8, 898225.7, 6455688, 6455728 (xmin, xmax, ymin, ymax)
## coord. ref. : RGF93 v1 / Lambert-93
## area : 1307 m²
## points : 15.6 thousand points
## density : 11.95 points/m²
## density : 9.54 pulses/m²
The first lines of the terrain statistics are displayed hereafter.
## altitude azimut_gr slope_gr
## Verc-01-1 1096.9 99.1 20.1
## Verc-01-2 1083.4 94.0 16.1
## Verc-01-3 1120.7 104.0 22.9
The following lines ensure that the plots are ordered in the same way in the three data objects.
llas_height <- llas_height[plots$plot_id]
metrics_terrain <- metrics_terrain[plots$plot_id, ]
ALS metrics computation
Two types of vegetation metrics can be computed.
- Point cloud metrics are directly computed from the point cloud or from the derived surface model on the whole plot extent. These are the metrics generally used in the area-based approach.
- Tree metrics are computed from the characteristics of trees detected in the point cloud (or in the derived surface model). They are more CPU-intensive to compute and require ALS data with higher density, but in some cases they allow a slight improvement in models prediction accuracy.
Removal of unwanted points
The point cloud contains some points with classification code 7 (Low point) and 12 (Reserved). In this case the provider indicated that the 12 code also contains vegetation points: they are left in the point cloud whereas code 7 are removed. Points higher than a height threshold are also removed. Points with negative heights are set to zero.
# define classes to retain
class_points <- c(0, 1, 2, 3, 4, 5, 12)
# height threshold (60 m)
h_points <- 60
# filter points
llas_height <- lapply(llas_height,
function(x)
lidR::filter_poi(x,
is.element(Classification,
class_points) &
Z <= h_points))
llas_height <- lapply(llas_height, function(x)
{
x$Z[x$Z<0] <- 0
x
}
)
Point cloud metrics
Point cloud metrics are computed with the function
lidaRtRee::clouds_metrics()
, which applies the function
lidR::cloud_metrics()
to all point clouds in a list.
Default computed metrics are those proposed by the function
lidR::stdmetrics()
see also the lidR
wiki. Additional metrics are available with the function
lidaRtRee::aba_metrics()
. The buffer points, which are
located outside of the plot extent inventoried on the field, should be
removed before computing those metrics.
# define function for later use
aba_point_metrics_fun <- ~ lidaRtRee::aba_metrics(Z, Intensity, ReturnNumber,
Classification, 2)
# create list of point clouds without buffer
llas_height_plot_extent <-
lapply(llas_height, function(x) {
lidR::filter_poi(x, buffer == FALSE)
})
# apply function on each point cloud in list
metrics_points <- lidaRtRee::clouds_metrics(llas_height_plot_extent,
aba_point_metrics_fun)
round(head(metrics_points[, 1:8], n = 3), 2)
## zmax zmean zsd zskew zkurt zentropy pzabovezmean pzabove2
## Verc-01-1 33.87 17.45 8.15 -0.41 1.88 0.93 58.62 99.97
## Verc-01-2 29.78 17.49 5.29 -0.57 3.04 0.89 53.24 99.99
## Verc-01-3 29.89 18.66 5.35 -0.54 2.97 0.90 53.38 99.99
Tree metrics
Tree metrics rely on a preliminary detection of trees, which is
performed with the lidaRtRee::tree_segmentation()
function.
For more details, please refer to the tree detection tutorial
(vignette("tree.detection".Rmd")
). Tree segmentation
requires point clouds or canopy height models with an additional buffer
in order to avoid border effects when computing tree characteristics.
Once trees are detected, metrics are derived with the function
lidaRtRee::std_tree_metrics()
. A user-specific function can
be specified to compute other metrics from the features of detected
trees. Plot radius has to be specified as it is required to exclude
trees detected outside of the plot, and to compute the plot surface.
Tree segmentation is not relevant when the point cloud density is too
low, typically below five points per m2. The function first
computes a canopy height model which default resolution is 0.5 m, but
this should be set to 1 m with low point densities.
# resolution of canopy height model (m)
aba_res_chm <- 0.5
# specify plot radius to exclude trees located outside plots
plot_radius <- 15
# compute tree metrics
metrics_trees <- lidaRtRee::clouds_tree_metrics(llas_height,
plots[, c("X", "Y")],
plot_radius,
res = aba_res_chm,
func = function(x) {
lidaRtRee::std_tree_metrics(x,
area_ha = pi * plot_radius^2 / 10000
)
}
)
round(head(metrics_trees[, 1:5], n = 3), 2)
## Tree_meanH Tree_sdH Tree_giniH Tree_density TreeInf10_density
## Verc-01-1 27.11 6.39 0.11 226.35 14.15
## Verc-01-2 25.80 3.27 0.06 268.80 0.00
## Verc-01-3 25.74 3.88 0.08 268.80 0.00
Other metrics
In case terrain metrics have been computed from the cloud of ground points only, they can also be added as variables, and so do other environmental variables which might be relevant in modeling.
metrics <- cbind(
metrics_points[plots$plot_id, ],
metrics_trees[plots$plot_id, ],
metrics_terrain[plots$plot_id, 1:3]
)
Model calibration
Calibration for a single variable
Once a dependent variable (forest parameter of interest) has been
chosen, the function lidaRtRee::aba_build_model()
is used
to select the linear regression model that yields the highest
adjusted-R2 with a defined number of independent variables
(metrics), while checking linear model assumptions. A Box-Cox
transformation of the dependent variable can be applied to normalize its
distribution, or a log transformation of all variables (parameter
transform
). Model details and cross-validation statistics
are available from the returned object.
variable <- "G_m2_ha"
# no subsample in this case
subsample <- 1:nrow(plots)
# model calibration
model_aba <- lidaRtRee::aba_build_model(plots[subsample, variable],
metrics[subsample, ],
transform = "boxcox", nmax = 4,
xy = plots[subsample, c("X", "Y")])
## Reordering variables and trying again:
# renames outputs with variable name
row.names(model_aba$stats) <- variable
# display selected linear regression model
model_aba$model
##
## Call:
## stats::lm(formula = dep_var ~ zskew + ipcumzq90 + p_hmin + TreeSup30_density,
## data = df_transform)
##
## Coefficients:
## (Intercept) zskew ipcumzq90 p_hmin
## 6.2316659 -0.1179932 -0.0484946 1.0462538
## TreeSup30_density
## 0.0005935
# display calibration and validation statistics
model_aba$stats
## n formula adjR2 transform
## G_m2_ha 96 zskew + ipcumzq90 + p_hmin + TreeSup30_density 0.7367451 boxcox
## lambda rmse cvrmse pwil pttest paov cor
## G_m2_ha -0.1502047 7.948461 0.1977213 0.6438875 0.9923217 0.9968171 0.8375104
## looR2 var_res
## G_m2_ha 0.7007292 0.01042615
The function computes values predicted in leave-one-out
cross-validation, by using the same combination of dependent variables
and fitting the regression coefficients with all observations except
one. Predicted values can be plotted against field values with the
function lidaRtRee::aba_plot()
. It is also informative to
check the correlation of prediction errors with other forest or
environmental variables.
The prediction errors seem positively correlated with basal area.
# check correlation between errors and other variables
round(cor(cbind(model_aba$values$residual,
plots[subsample,
c("G_m2_ha", "N_ha", "D_mean_cm")],
metrics_terrain[subsample, 1:3])), 2)[1, ]
## model_aba$values$residual G_m2_ha N_ha
## 1.00 0.51 0.17
## D_mean_cm altitude azimut_gr
## 0.12 0.04 -0.15
## slope_gr
## -0.07
# significance of correlation value
cor.test(model_aba$values$residual, plots[subsample, variable])
##
## Pearson's product-moment correlation
##
## data: model_aba$values$residual and plots[subsample, variable]
## t = 5.6794, df = 94, p-value = 1.501e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3393545 0.6409830
## sample estimates:
## cor
## 0.5054516
Coloring points by ownership shows that plots located in private forests have the largest basal area values.
par(mfrow = c(1, 2))
# plot predicted VS field values
lidaRtRee::aba_plot(model_aba,
main = variable,
col = ifelse(plots$stratum == "public", "green", "blue")
)
legend("topleft", c("public", "private"), col = c("green", "blue"), pch = 1)
plot(model_aba$values$predicted,# plots[subsample, c("G_m2_ha")],
model_aba$values$residual,
ylab = "Prediction error", xlab = "Predicted in LOOCV",
main = variable,
col = ifelse(plots$stratum == "public", "green", "blue")
)
abline(h = 0, lty = 2)
Calibration for several variables
The following code calibrates models for several forest parameters. In case different transformations have to be performed on the parameters, models have to be calibrated one by one.
models_aba <- list()
for (i in c("G_m2_ha", "D_mean_cm", "N_ha"))
{
models_aba[[i]] <- lidaRtRee::aba_build_model(plots[, i],
metrics, transform = "boxcox",
nmax = 4,
xy = plots[, c("X", "Y")])
}
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
# bind model stats in a data.frame
model_stats <- do.call(rbind, lapply(models_aba, function(x) {
x[["stats"]]
}))
The obtained models are presented below. The table columns correspond to:
-
n
number of plots, -
metrics
selected in the model, -
adj-R2.%
adjusted R-squared of fitted model (%), -
CV-R2.%
coefficient of determination of values predicted in cross-validation (CV) VS field values (%), -
CV-RMSE.%
coefficient of variation of the Root Mean Square Errors of prediction in CV (%), -
CV-RMSE
Root Mean Square Error of prediction in CV.
The two largest (outlier) values of mean diameter are underestimated by the model, which greatly decreases the accuracy statistics. This might be explained by the fact that when trees reach maturity, diameter growth continues while height growth almost stops. As the ALS point cloud mostly contains height information, there is some signal saturation for high mean diameter values. It might also be the case for high biomass values.
n | metrics | adj-R2.% | CV-R2.% | CV-RMSE.% | CV-RMSE | |
---|---|---|---|---|---|---|
G_m2_ha | 96 | zskew + ipcumzq90 + p_hmin + TreeSup30_density | 73.7 | 70.1 | 19.8 | 7.9 |
D_mean_cm | 96 | zq70 + ipcumzq70 + Tree_density + altitude | 82.5 | 84.7 | 15.2 | 3.8 |
N_ha | 96 | zentropy + zq95 + p_1st_hmin + TreeSup10_density | 90.7 | 88.3 | 18.9 | 153.0 |
Stratified models
Motivation
When calibrating a statistical relationship between forest stand parameters, which are usually derived from diameter measurements, and ALS metrics, one relies on the hypothesis that the interaction of laser pulses with the leaves and branches structure is constant on the whole area. However, differences can be expected either due to variations in acquisition settings (flight parameters, scanner model), in forests (stand structure and composition) or in topography (slope). Better models might be obtained when calibrating stratum-specific relationships, provided each stratum is more homogeneous regarding the laser interaction with the vegetation. A trade-off has to be achieved between the within-strata homogeneity and the number of available plots for calibration in each stratum. A minimum number of plots is approximately 50, while 100 would be recommended. In this example we hypothesize that ownership reflects both structure and composition differences in forest stands.
Calibration of stratum-specific models
Stratum-specific models are computed and stored in a list during a
for
loop. The function
lidaRtRee::aba_combine_strata()
then combines the list of
models corresponding to each stratum to compute aggregated statistics
for all plots, making it easier to compare stratified with
non-stratified models.
# stratification variable
strat <- "stratum"
# create list of models
model_aba_stratified <- list()
# calibrate each stratum model
for (i in levels(plots[, strat]))
{
subsample <- which(plots[, strat] == i)
if (length(subsample) > 0) {
model_aba_stratified[[i]] <- lidaRtRee::aba_build_model(
plots[subsample, variable],metrics[subsample, ], transform = "boxcox",
nmax = 4, xy = plots[subsample, c("X", "Y")])
}
}
## Reordering variables and trying again:
# backup list of models for later use
model_aba_stratified_boxcox <- model_aba_stratified
# combine list of models into single object
model_aba_stratified <- lidaRtRee::aba_combine_strata(model_aba_stratified,
plots$plot_id)
# model_aba_stratified$stats
n | metrics | adj-R2.% | CV-R2.% | CV-RMSE.% | CV-RMSE | |
---|---|---|---|---|---|---|
NOT.STRATIFIED | 96 | zskew + ipcumzq90 + p_hmin + TreeSup30_density | 73.7 | 70.1 | 19.8 | 7.9 |
private | 32 | zpcum8 + ikurt + TreeInf10_density | 59.9 | 51.6 | 19.4 | 10.4 |
public | 64 | p_1st_hmin + Tree_meanH + Tree_density + altitude | 60.8 | 54.4 | 16.6 | 5.5 |
COMBINED | 96 | NA | NA | 73.1 | 18.7 | 7.5 |
Stratified models with stratum-specific variable tranformations
In case one wants to apply different variable transformations, or use different subsets of ALS metrics depending on the strata, the following example can be used. First models using only the point cloud metrics are calibrated without transformation of the data. The statistics for all plots are then calculated by combining the following stratum-specific models :
- public ownership, all metrics, Box-Cox transformation of basal area values (calibrated in the previous paragraph),
- private ownership, only point cloud metrics, no data transformation.
# create list of models for no transformation
model_aba_stratified.none <- list()
# calibrate each stratum model
for (i in levels(plots[, strat]))
{
subsample <- which(plots[, strat] == i)
if (length(subsample) > 0) {
model_aba_stratified.none[[i]] <-
lidaRtRee::aba_build_model(plots[subsample, variable],
metrics_points[subsample, ],
transform = "none",
xy = plots[subsample, c("X", "Y")])
}
}
# combine list of models into single object
model_aba_stratified_mixed <-
lidaRtRee::aba_combine_strata(
list(private = model_aba_stratified.none[["private"]],
public = model_aba_stratified_boxcox[["public"]]), plots$plot_id)
# bind model stats in a data.frame for comparison
model_stats <- rbind(model_aba$stats, model_aba_stratified_mixed$stats)
row.names(model_stats)[1] <- "NOT.STRATIFIED"
n | metrics | transform | adj-R2.% | CV-R2.% | CV-RMSE.% | CV-RMSE | |
---|---|---|---|---|---|---|---|
NOT.STRATIFIED | 96 | zskew + ipcumzq90 + p_hmin + TreeSup30_density | boxcox | 73.7 | 70.1 | 19.8 | 7.9 |
private | 32 | zentropy + zpcum9 | none | 55.0 | 51.2 | 19.5 | 10.5 |
public | 64 | p_1st_hmin + Tree_meanH + Tree_density + altitude | boxcox | 60.8 | 54.4 | 16.6 | 5.5 |
COMBINED | 96 | NA | NA | NA | 73.0 | 18.8 | 7.6 |
Save data before next tutorial
The following lines save the data required for the area-based mapping step.
save(model_aba_stratified_mixed, model_aba, aba_point_metrics_fun, aba_res_chm,
class_points, h_points,
file = "./data/aba.model/output/models.rda"
)