Model evaluation with presence points and raster predictions

The Boyce index (Boyce et al. 2002) is often described as a presence-only metric for evaluating the predictions of species distribution (or ecological niche, or habitat suitability) models (e.g. Hirzel et al. 2006, Cianfrani et al. 2010, Bellard et al. 2013, Valavi et al. 2022). It measures the predicted-to-expected ratio of presences in each class of predicted values, i.e., whether classes / regions with higher predicted values have indeed higher proportions of presences than expected by chance. But, if there’s a proportion of presences in each class, then the rest of the class (i.e. the complementary proportion) must be… non-presences, which are equally necessary for the computation of this index (as for virtually any other model evaluation metric). We need both pixels with and pixels without presence records, i.e. both presences and (pseudo)absences, or the Boyce index cannot be computed: try using a raster map with values only in pixels with presence (e.g. my_raster <- mask(my_raster, my_presences)) and see what happens. The same applies to presence-background modelling methods like Maxent and ENFA, which use both the presence and the remaining (i.e. “non-presence”) pixels, and can’t produce proper predictions without non-presence pixels — you can also try this for yourself. So, they need the same data that presence-absence methods need. Whether or not the computation requires an explicit identification of the absences, or just presence points and a map of the entire study area, simply depends on how each method or metric is implemented in each particular software package.

The modEvA‘ R package computes the Boyce index and a range of other metrics that evaluate model predictions against presence/absence or presence/background data, including the area under the ROC curve (AUC), the area under the precision-recall curve, sensitivity, specificity, TSS, Cohen’s kappa, Hosmer-Lemeshow goodness-of-fit, Miller calibration statistics, and several others. For all these metrics, the input can be either 1) a model object of a range of implemented classes, or 2) a pair of numeric vectors with observed presence/absence and the corresponding predicted values, or 3) a set of presence point coordinates and a raster map of the predictions across the model evaluation region. Here some usage examples:

install.packages("modEvA", repos = "http://R-Forge.R-project.org")
library(modEvA)
library(terra)

# import a raster map with a predictor variable:
elev <- rast(system.file("ex/elev.tif", package = "terra"))

# import a species' occurrence data for this area:
gbif <- geodata::sp_occurrence(genus = "Dendrocopos", species = "major", ext = elev)
head(gbif)
unique(gbif$occurrenceStatus)  # remove absences if any!
pres_coords <- gbif[ , c("lon", "lat")]
plot(elev)
points(pres_coords, pch = 20, cex = 0.2)

# get a data frame of the pixels with and without presences:
dat <- fuzzySim::gridRecords(elev, pres_coords)
head(dat)
points(dat[dat$presence == 0, c("x", "y")], col = "red", cex = 0.5)
points(dat[dat$presence == 1, c("x", "y")], col = "blue", cex = 0.5)

# compute a species distribution model (e.g. a GLM) from these data:
mod <- glm(presence ~ elevation, family = binomial, data = dat)
summary(mod)

# get a raster map with the predictions from this model:
pred <- predict(elev, mod, type = "response")
plot(pred, main = "Woodpecker presences and predictions")
points(pres_coords, pch = 20, cex = 0.2)
# compute some model evaluation metrics
# using just these presence coordinates and raster predictions:

par(mfrow = c(2, 2), mar = c(5, 5, 2, 1))

Boyce(obs = pres_coords, pred = pred, main = "Boyce index")

AUC(obs = pres_coords, pred = pred, main = "ROC curve")

threshMeasures(obs = pres_coords, pred = pred, thresh = "maxTSS", measures = c("Sensitivity", "Specificity", "Precision", "TSS", "kappa"), standardize = FALSE, main = "Threshold-based")

AUC(obs = pres_coords, pred = pred, curve = "PR", main = "PR curve")

So, any evaluation metric can be computed using “only presence points”, as long as the raster map of predictions includes both pixels that do and pixels that don’t overlap these points, the latter of which are necessarily taken as available unoccupied pixels by all of these metrics (Boyce index included).

Note you may get slightly different results with modEvA::Boyce() and with ecospat::ecospat.boyce(), because ecospat.boyce removes duplicate classes by default, though both functions allow controlling this with argument ‘rm.dup.classes‘ or ‘rm.duplicate‘, respectively (see help file of either function for details). Bug reports and other feedback welcome on this version of package ‘modEvA‘!

REFERENCES

Boyce, M.S., P.R. Vernier, S.E. Nielsen & F.K.A. Schmiegelow (2002) Evaluating resource selection functions. Ecological Modelling 157: 281-300

Bellard, C., Thuiller, W., Leroy, B., Genovesi, P., Bakkenes, M., & Courchamp, F. (2013) Will climate change promote future invasions? Global Change Biology, 19(12), 3740. https://doi.org/10.1111/GCB.12344

Cianfrani, C., Le Lay, G., Hirzel, A. H., & Loy, A. (2010) Do habitat suitability models reliably predict the recovery areas of threatened species? Journal of Applied Ecology, 47(2), 421–430. https://doi.org/10.1111/J.1365-2664.2010.01781.X

Hirzel, A. H., Le Lay, G., Helfer, V., Randin, C., & Guisan, A. (2006) Evaluating the ability of habitat suitability models to predict species presences. Ecological Modelling, 199(2), 142-152. http://www.sciencedirect.com/science/article/pii/S0304380006002468

Valavi, R., Guillera-Arroita, G., Lahoz-Monfort, J. J., & Elith, J. (2022) Predictive performance of presence-only species distribution models: a benchmark study with reproducible code. Ecological Monographs, 92(1), e01486. https://doi.org/10.1002/ECM.1486

Select among correlated variables based on a given criterion

Correlations among variables are problematic in multivariate models, as they inflate the variance of coefficients and thus may bias the interpretation of the effects of those variables on the response. One of the strategies to circumvent this problem is to eliminate a priori one from each pair of correlated variables, but it is not always straightforward to pick the right variable among these. The corSelect function selects among correlated variables based either on their (stepwise) variance inflation factor (VIF); or on their relationship with the response variable, by building a bivariate model of each individual variable against the response and choosing, among each of two correlated variables, the one with the best p-value, AIC or BIC.

This function is now included in fuzzySim and depends on other functions in the package, such as FDR and multicol. It is also included as a new option within function multGLM in this package, so that multiple models can be built based on uncorrelated variables chosen appropriately for each species.

corSelect <- function(data, sp.cols = NULL, var.cols, cor.thresh = 0.8, select = "p.value", ...) {

# version 1.6 (6 Jul 2017)

if (length(sp.cols) > 1) stop ("Sorry, 'corSelect' is currently implemented for only one 'sp.col' at a time.")

univar.criteria <- c("VIF")
 bivar.criteria <- c("p.value", "AIC", "BIC")

if (!(select %in% c(univar.criteria, bivar.criteria))) stop ("Invalid 'select' criterion.")

if (!is.null(sp.cols) & select %in% bivar.criteria) {
 n.in <- nrow(data)
 data <- data[is.finite(data[ , sp.cols]), ]
 n.out <- nrow(data)
 if (n.out < n.in) warning (n.in - n.out, " observations removed due to missing data in 'sp.cols'; ", n.out, " observations actually evaluated.")
 }

cor.mat <- cor(data[ , var.cols], ...)
 cor.mat[upper.tri(cor.mat, diag = TRUE)] <- NA
 high.cor.mat <- bivar.mat <- numeric(0)

if (max(abs(cor.mat), na.rm = TRUE) >= cor.thresh) {
 high.cor.rowcol <- as.matrix(which(abs(cor.mat) >= cor.thresh, arr.ind = TRUE))
 high.cor.inds <- sort(unique(as.vector(high.cor.rowcol)))
 high.cor.mat <- data.frame(var1 = rownames(cor.mat)[high.cor.rowcol[ , "row"]], var2 = colnames(cor.mat)[high.cor.rowcol[ , "col"]])
 high.cor.mat$corr <- NULL
 for (r in 1:nrow(high.cor.mat)) high.cor.mat$corr[r] <- cor.mat[high.cor.rowcol[ ,"row"][r], high.cor.rowcol[ ,"col"][r]]
 high.cor.mat <- high.cor.mat[order(abs(high.cor.mat$corr), decreasing = TRUE), ]

if (is.null(sp.cols) & select %in% bivar.criteria) {
 message("Bivariate relationships not assessable without a response variable ('sp.cols'). Returning high pairwise correlations among 'var.cols'.")
 return (high.cor.mat)
 }

high.cor.vars <- unique(rownames(cor.mat[high.cor.inds, high.cor.inds]))

if (select %in% bivar.criteria) {
 bivar.mat <- FDR(data = data, sp.cols = sp.cols, var.cols = match(high.cor.vars, colnames(data)), simplif = TRUE)[ , c("p.value", "AIC", "BIC")]
 if (all.equal(order(bivar.mat[ , c("p.value")]), order(bivar.mat[ , c("AIC")]), order(bivar.mat[ , c("BIC")]))) message("Results identical using whether p-value, AIC or BIC to select variables.\n")
 else message("Results NOT identical using whether p-value, AIC or BIC to select variables.\n")
 } # end if select in bivar

data.remaining <- data[ , var.cols]
 while (max(abs(cor.mat), na.rm = TRUE) >= cor.thresh) {
 max.cor.ind <- which(abs(cor.mat) == max(abs(cor.mat), na.rm = TRUE), arr.ind = TRUE)
 v1 <- rownames(cor.mat)[max.cor.ind[1]]
 v2 <- colnames(cor.mat)[max.cor.ind[2]]
 if (select %in% univar.criteria) {
 vif <- multicol(data.remaining)
 targets <- vif[c(v1, v2), "VIF", drop = FALSE]
 } # end if univar
 else if (select %in% bivar.criteria) {
 targets <- bivar.mat[c(v1, v2), select, drop = FALSE]
 } # end if bivar

exclude <- which.max(as.matrix(targets))
 excl.name <- rownames(targets)[exclude]
 excl.ind <- match(excl.name, colnames(cor.mat))
 data.remaining <- data.remaining[ , -excl.ind, drop = FALSE]
 cor.mat <- cor.mat[-excl.ind, -excl.ind, drop = FALSE]
 if (all(is.na(cor.mat))) cor.mat <- numeric(0)
 if (length(cor.mat) == 0) break
 } # end while max

} # end if max > thresh

selected.vars <- colnames(cor.mat)
 selected.var.cols <- match(colnames(cor.mat), colnames(data))
 excluded.vars <- colnames(data)[var.cols][!(colnames(data)[var.cols] %in% colnames(cor.mat))]

vif2 <- multicol(data[ , selected.var.cols])

list(high.correlations = high.cor.mat,
 bivariate.significance = bivar.mat,
 excluded.vars = excluded.vars,
 selected.vars = selected.vars,
 selected.var.cols = selected.var.cols,
 strongest.remaining.corr = cor.mat[which.max(abs(cor.mat))],
 remaining.multicollinearity = vif2
 )
} # end corSelect function

 

You can install and load fuzzySim (>=version 1.5) and then check help(corSelect) for reproducible usage examples.

Several modEvA functions moved to fuzzySim

Just a heads-up to let you know that the modelling functions formerly included in the modEvA package have been moved to package fuzzySim for operative and editorial reasons. These functions are: multGLM, modelTrim, Fav, getPreds, FDR, percentTestData, integerCols, multConvert and multicol. The rotif.env sample dataset has also moved to fuzzySim. Package modEvA now has dataset rotif.mods and keeps only functions strictly related to model evaluation and analysis, and not to model building.

The new modEvA version (currently 0.9) also includes some bug fixes, including one that made the packaged varPart function calculate the ABCoverlap incorrectly – thanks to Jurica Levatic for spotting this.

Search or replace a text string in all files within a folder

Imagine you have your package skeleton, with its functions in separate files in the ‘R‘ folder and their help files in the ‘man‘ folder. Now imagine you change your mind about the name of a function or argument which may appear in several of these files. It can be painfully dull and error-prone to replace the name manually or file-by-file… The searchStringInFiles function below searches, and optionally replaces, a character string globally within the texts of all files in a given directory (it can also be used to search and replace text in files other than R skeletons).

searchStringInFiles <- function(path, pattern, replacement = NULL, recursive = TRUE, ...) {
  # version 2.0 (17 Apr 2015)
 
  odir <- getwd()
  setwd(path)
  files <- list.files(".", recursive = recursive)
 
  if (is.null(replacement)) {
    for (f in files) {
      if (length(grep(pattern, readLines(f))) > 0) {
        cat("\n", f, "\n", sep = "")
        cat(">>> contains '", pattern, "'!\n", sep = "")
      }  # end if length
    }  # end for f
 
  } else {
 
    for (f in files) {
      in.text <- readLines(f)
      out.text <- gsub(pattern = pattern, replacement = replacement, x = in.text, ...)
      cat(out.text, sep = "\n", file = f)
    }  # end for f
  }  # end else
 
  setwd(odir)
}  # end searchStringInFiles function

[presented with Pretty R]

Load the function and give it your (package) folder path, the character string you want to search for, and (optionally) the replacement for this string:

searchStringInFiles(path = “/home/user/Documents/myPackages/supeR”, pattern = “MYfunction”)

searchStringInFiles(path = “/home/user/Documents/myPackages/supeR”, pattern = “MYfunction”, replacement = “myFunction”)

Mind that global text replacements can get a bit out of control if not used and double-checked carefully. Beware! And backup your package before trying to pull off something like this. (Anyway, this can be undone with the same command after switching the pattern and replacement arguments, I think).

modEvA and fuzzySim packages on R-Forge

Just a quick post to let you know that package modEvA, containing most of the functions posted till today in this blog, and package fuzzySim, containing functions spCodes and splist2presabs and a fuzzy version of binarySimilarity and simMatrix, are finally available on R-Forge and should be accessed there from now on, as that’s where they’ll be updated. You can either install the packages directly with an R command, or download the source and/or binary files from R-Forge. Please refer to the packages pages for more information, and let me know if you bump into problems.

Check your package’s help files for non-ASCII characters

If you’re building an R package (e.g. following this simple tutorial) and have your package skeleton with the ‘man’ folder containing all your help files, it may be useful to check all those files at once for non-ASCII characters (e.g. tildes, dashes) which may not be correctly exported to other computers. The following function aplies the showNonASCIIfile R function to all files in a given folder:

multNonASCIIcheck <- function(path) {
# version 1.0 (4 Apr 2014)
# path: complete folder path to the "man" folder of your package skeleton
  odir <- getwd()
  setwd(path)
  files <- list.files()
  for (f in files) {
    cat("\n", f, "\n", sep = "")
    if(length(tools::showNonASCIIfile(f)) == 0) cat("OK\n")
  }
  setwd(odir)
}

[presented with Pretty R]

Load the function and give it your ‘man’ folder path to get the names of all files and the non-ASCII characters found, e.g.:

multNonASCIIcheck(“/home/user/Documents/myPackages/supeR/man”)

Then you can just open the problematic help files and correct the non-ASCII characters detected.