modTools: simple tools for simpler modelling

Featured

This website provides a series of free software tools, mostly written in R, that can be used for various tasks related to the analysis of species’ distribution and ecology (although many of them can be applied to other purposes). Most of them are now being included in R packages — please cite the appropriate package (mentioned in the blog post featuring each function) if you use these tools, which are released under a Creative Commons Attribution license. Please note that these tools are experimental (like R, they come with absolutely no warranty) and occasionally edited with corrections or improvements, so preferably check back for the latest version before you use a function.

Leave a comment here if you use these functions in publications — this way I can cite you in the blog, track how people are using the tools, and justify the time and effort dedicated to them. Thanks are due to Arpat Ozgul and François Guilhaumon, who helped me through the steep part of the R learning curve (to me it was actually more like an overhang). Comments and suggestions are very welcome, although a quick response cannot be guaranteed!

Plot outliers and their values

The ‘plot_outliers‘ function below draws a boxplot and a scatterplot of a numeric variable x and plots the values of the outliers (currently not offset, even if they overlap). For relatively small datasets, it can be a quick way to identify which outliers look reasonable and which are likely a result of transcription or measurement error, and thus should be either corrected or discarded.

plot_outliers <- function(x, val_col = "blue", ...) { 
  par_in <- par(no.readonly = TRUE) 
  par(mfrow = c(1, 2)) 
  bp <- boxplot(x, ...) 
  out <- bp$out 
  message(length(out), " outliers detected") 
  if (length(out) > 0) text(x = 0.5, y = bp$out, labels = round(out, 2), adj = 0, col = val_col)
  plot(x, pch = 20)
  if (length(out) > 0) text(x = 0.5, y = bp$out, labels = round(out, 2), adj = 0, col = val_col)
  par(par_in)
}

Usage examples:

plot_outliers(iris$Sepal.Width)

Additional arguments for the ‘boxplot‘ function can be provided, e.g.

plot_outliers(airquality$Ozone, notch = TRUE)
plot_outliers(airquality$Wind, col = "darkgreen", main = "wind")

This function is used in an article which we hope to submit soon.

fuzzySim updated to 3.0 on CRAN!

The newest version of R package fuzzySim (3.0) is now on CRAN! It includes new functions such as ‘sharedFav‘, favClass‘, ‘bioThreat‘ and ‘gridRecords; improvements to some functions, help files and examples; updated e-mail and citation information [ see citation(“fuzzySim”) ]; clarifications and typo corrections along the reference manual; and some bug fixes (after changes to base R and/or to function dependencies), e.g. to ‘getPreds‘ when applied to raster objects. You should now uninstall the old version of the package and install the new one:

remove.packages("fuzzySim")
install.packages("fuzzySim")

Among other new functionalities, fuzzySim now makes it easier to use variable selection and presence-absence modelling techniques on occurrence points + raster variables, as these are becoming the more common data formats in species distribution modelling (SDM) and ecological niche modelling (ENM). Here’s a worked example:

# download and plot predictor variables:
# (make sure their spatial resolution is not finer than the occurrence data)
library(raster)
worldclim <- getData("worldclim", var = "bio", res = 10)
plot(worldclim)
plot(worldclim[[1]])

# download and plot species occurrence data:
library(rgbif)
gbif <- occ_data(scientificName = "Galemys pyrenaicus", hasCoordinate = TRUE, limit = 5000)
# GBIF data often have errors, data cleaning (e.g. with package 'scrubr') should be performed!
# here we'll just remove records of absence or zero-abundance:
absence_rows <- which(gbif$data$occurrenceStatus == "absent" | gbif$data$organismQuantity == 0)
if (length(absence_rows) > 0) gbif$data <- gbif$data[-absence_rows, ]
presences <- gbif$data[ , c("decimalLongitude", "decimalLatitude")]
# plot the occurrence records on the map:
points(presences)

# crop variables to the extent of the occurrence data:
worldclim_crop <- crop(worldclim, extent(range(presences$decimalLongitude), range(presences$decimalLatitude)))
plot(worldclim_crop[[1]])
points(presences)

# model occurrence data as presence-absence in the cropped grid of pixels:
library(fuzzySim)

# first, get the centroid coordinates and variable values at pixels with and without presence records
# (remember, raster resolution should not be finer than occurrence data!)
gridded_presences <- gridRecords(rst = worldclim_crop, pres.coords = presences)
head(gridded_presences)
plot(worldclim_crop[[1]])
points(gridded_presences[gridded_presences$presence == 0, c("x", "y")], col = "red")
points(gridded_presences[gridded_presences$presence == 1, c("x", "y")], col = "blue", pch = 20)
# then, build a GLM with variable selection on these presence-absence data:
names(gridded_presences)
model_GLM <- multGLM(data = gridded_presences, sp.cols = "presence", var.cols = 5:23, id.col = "cells", FDR = TRUE, corSelect = TRUE, step = TRUE, trim = TRUE)
summary(model_GLM$models$presence)
head(model_GLM$predictions)

# finally, get and plot the model predictions (probability and favourability):
pred_GLM_raster <- getPreds(data = stack(worldclim_crop), models = model_GLM$models)
plot(pred_GLM_raster)

If you’re worried about using a presence-absence modelling algorithm on ‘presence-only’ records, you can compare these predictions with those of a widely used presence-background algorithm (Maxent) on the same data, to check that they are not far off:

library(maxnet)

# build a maxent model with all variables:
model_maxent_allvars <- maxnet(p = gridded_presences[ , "presence"], data = gridded_presences[ , 5:23], f = maxnet.formula(p = gridded_presences[ , "presence"], data = gridded_presences[ , 5:23], classes = "lq"))  # linear + quadratic features

# build a maxent model with only the variables selected by multGLM (folowing FDR, correlation, AIC, significance):
selected_vars <- model_GLM$variables$presence
model_maxent_selvars <- maxnet(p = gridded_presences[ , "presence"], data = gridded_presences[ , selected_vars], f = maxnet.formula(p = gridded_presences[ , "presence"], data = gridded_presences[ , selected_vars], classes = "lq"))

# compute and map maxent predictions:
pred_maxent_allvars_raster <- raster::predict(worldclim_crop, model_maxent_allvars, type = "cloglog", clamp = TRUE)
pred_maxent_selvars_raster <- raster::predict(worldclim_crop, model_maxent_selvars, type = "cloglog", clamp = TRUE)
par(mfrow = c(1, 2))
plot(pred_maxent_allvars_raster, main = "Maxent all vars")
plot(pred_maxent_selvars_raster, main = "Maxent selected vars")

This and other modelling methods will be taught next week at the CIBIO advanced course on ecological niche modelling. Check out the Training page for other upcoming courses!

Grid point occurrence records onto a raster

The ‘gridRecords‘ function, which has just been added to the ‘fuzzySim‘ package (from version 2.6 on), takes a raster stack and a set of spatial coordinates of a species’ presence (and optionally absence) records, and returns a data frame with the presences and absences, as well as the corresponding values of the rasters in the grid of pixels (cells). If absence coordinates are not supplied, all pixels without any presence point will be returned as absences. A precursor of this function was used in Báez et al. (2020) for getting unique presences and absences from point occurrence data at the spatial resolution of marine raster variables. The function can be especially useful for using point + raster data to compute presence-absence models with packages or functions that normally require data frames as input, such as glm, multGLM, gam or randomForest.

gridRecords <- function(rst,
                        pres.coords,
                        abs.coords = NULL,
                        na.rm = TRUE) {
  
  # version 2.0 (3 Feb 2020)
  
  if (!requireNamespace("raster")) stop("This function requires installing the 'raster' package first.")
  
  if (is.null(abs.coords)) {
    abs.coords <- raster::coordinates(rst)
  }

  p_extract <- raster::extract(rst, pres.coords, cellnumbers = TRUE, df = TRUE)[ , -1]
  a_extract <- raster::extract(rst, abs.coords, cellnumbers = TRUE, df = TRUE)[ , -1]
  
  p_extract <- unique(p_extract)
  a_extract <- unique(a_extract)
  
  a_extract <- a_extract[!(a_extract$cells %in% p_extract$cells), ]
  
  p_centroids <- raster::xyFromCell(rst, p_extract$cells)
  a_centroids <- raster::xyFromCell(rst, a_extract$cells)
  
  p_extract <- data.frame(presence = 1, p_centroids, p_extract)
  if (nrow(a_extract) > 0) {
    a_extract <- data.frame(presence = 0, a_centroids, a_extract)
  }
  
  result <- rbind(p_extract, a_extract)
  
  if (na.rm) {
    result_NA <- which(apply(result[ , 5:ncol(result)], MARGIN = 1, FUN = function(x) all(is.na(x))))
    if (length(result_NA) > 0) {
      result <- result[-result_NA, ]
    }
  }
  
  return(result)
}

Usage example:

library(raster)
library(fuzzySim)  # >= 2.6

# import a system raster with 3 layers and crop it to a smaller extent:
rst <- stack(system.file("external/rlogo.grd", package = "raster"))
ext <- extent(c(0, 15, 25, 40))
rst <- crop(rst, ext)
plot(rst)
plot(rst[[1]])

# generate some random presence and absence points:
set.seed(123)
presences <- sp::spsample(as(ext, "SpatialPolygons"), 50, type = "random")
absences <- sp::spsample(as(ext, "SpatialPolygons"), 50, type = "random")
points(presences, pch = 20, cex = 0.2, col = "black")
points(absences, pch = 20, cex = 0.2, col = "white")

# use 'gridRecords' on these random points:
gridded_pts <- gridRecords(rst, coordinates(presences), coordinates(absences))
head(gridded_pts)  # 'red', 'green' and 'blue' are the names of the layers in 'rst'

# plot them to check the result:
pres_coords <- gridded_pts[gridded_pts$presence == 1, c("x", "y")]
abs_coords <- gridded_pts[gridded_pts$presence == 0, c("x", "y")]
points(gridded_pts[ , c("x", "y")], pch = 4, cex = 0.6, col = gridded_pts$presence)
# you can also do it with only presence (no absence) records:
gridded_pres <- gridRecords(rst, coordinates(presences))
head(gridded_pres)
plot(rst[[1]])
points(presences, pch = 20, cex = 0.2, col = "black")
pres_coords <- gridded_pres[gridded_pres$presence == 1, c("x", "y")]
abs_coords <- gridded_pres[gridded_pres$presence == 0, c("x", "y")]
points(gridded_pres[ , c("x", "y")], pch = 4, cex = 0.6, col = gridded_pres$presence)

References
Baez J.C., Barbosa A.M., Pascual P., Ramos M.L. & Abascal F. (2020) Ensemble modelling of the potential distribution of the whale shark in the Atlantic Ocean. Ecology and Evolution, 10: 175-184

modEvA 2.0 now on CRAN!

The new version of modEvA (2.0) is now on CRAN! It can produce some new plots, such as the precision-recall curve and the histograms/densities of predicted values for presence and absence observations. It calculates some additional evaluation measures, such as the area under the precision-recall curve (function ‘AUC‘), mean precision (also function ‘AUC‘) and the F1 score (function ‘threshMeasures‘). These measures are now also among the outputs of functions ‘multModEv‘, ‘optiThresh‘ and ‘optiPair‘. I’ve also fixed some bugs that were present in the previous version on CRAN, such as an annoying error when using function ‘varPart‘ with only two factors. You should now uninstall the old version of the package and install the new one:

remove.packages("modEvA")
install.packages("modEvA")

Area under the precision-recall curve

The AUC function, in the modEvA package, initially computed only the area under the receiver operating characteristic (ROC) curve. Now, since modEvA version 1.7 (currently available on R-Forge), it also offers the option to compute the precision-recall curve, which may be better for comparing models based on imbalanced data (e.g. for rare species) — see e.g. Sofaer et al. (2019).

Usage example:

library(modEvA)
mod <- rotif.mods$models[["Ktropi"]]
par(mfrow = c(1, 2))
AUC(mod, main = "ROC curve")
AUC(mod, curve = "PR", main = "Precision-recall curve")

References

Sofaer, H.R., Hoeting, J.A. & Jarnevich, C.S. (2019). The area under the precision-recall curve as a performance metric for rare binary events. Methods in Ecology and Evolution, 10: 565-577

Plot the density of predicted values for presences and absences

The ‘predDensity’ function, included in the modEvA package since version 1.5 (currently available on R-Forge), produces a histogram and/or a kernel density plot of predicted values for observed presences and absences in a binomial GLM:

predDensity <- function (model = NULL, obs = NULL, pred = NULL, separate = TRUE, 
  type = c("both"), legend.pos = "topright") 
{
  if (!is.null(model)) {
    if (!("glm" %in% class(model)) || family(model)$family != 
      "binomial") 
      stop("'model' must be of class 'glm' and family 'binomial'.")
    if (!is.null(obs)) 
      message("Argument 'obs' ignored in favour of 'model'.")
    if (!is.null(pred)) 
      message("Argument 'pred' ignored in favour of 'model'.")
    obs <- model$y
    pred <- model$fitted.values
  }
  if (is.null(obs)) {
    if (is.null(pred)) 
      stop("You must provide either 'model' or 'pred'.")
    separate <- FALSE
    obs <- sample(c(0, 1), length(pred), replace = TRUE)
  }
  else {
    if (length(obs) != length(pred)) 
      stop("'obs' and 'pred' must have the same length.")
  }
  pred0 <- pred[obs == 0]
  pred1 <- pred[obs == 1]
  type <- match.arg(type, c("histogram", "density", "both"))
  rslt <- vector("list")
  if (type %in% c("density", "both")) {
    if (!separate) {
      dens <- density(pred)
      xrange <- range(dens$x, finite = TRUE)
      yrange <- range(dens$y, finite = TRUE)
      rslt[["density"]] <- dens
    }
    else {
      dens0 <- density(pred0)
      dens1 <- density(pred1)
      xrange <- range(dens0$x, dens1$x, finite = TRUE)
      yrange <- range(dens0$y, dens1$y, finite = TRUE)
      rslt[["density_obs1"]] <- dens1
      rslt[["density_obs0"]] <- dens0
    }
    plot(x = xrange, y = yrange, xlab = "Predicted value", 
      ylab = "Density", type = "n")
  }
  if (type %in% c("histogram", "both")) {
    hist0 <- hist(pred0, plot = FALSE)
    hist1 <- hist(pred1, plot = FALSE)
    if (type == "histogram") {
      yrange <- range(hist0$density, hist1$density, finite = TRUE)
      plot(x = c(0, 1), y = yrange, type = "n", xlab = "Predicted value", 
        ylab = "Density")
    }
    if (!separate) {
      histogram <- hist(c(pred0, pred1), freq = FALSE, 
        col = "grey20", add = TRUE)
      rslt[["histogram"]] <- histogram
    }
    else {
      hist(pred1, freq = FALSE, col = "grey20", add = TRUE)
      hist(pred0, freq = FALSE, col = "darkgrey", density = 40, 
        angle = 45, add = TRUE)
      rslt[["histogram_obs1"]] <- hist1
      rslt[["histogram_obs0"]] <- hist0
      if (legend.pos != "n" && type == "histogram") 
        legend(legend.pos, legend = c("absences", "presences"), 
          fill = c("darkgrey", "grey20"), border = NA, 
          density = c(40, NA), bty = "n")
    }
  }
  if (type %in% c("density", "both")) {
    if (!separate) {
      lines(dens, col = "black", lwd = 2)
    }
    else {
      lines(dens1, col = "black", lwd = 2)
      lines(dens0, col = "darkgrey", lty = 5, lwd = 2)
      if (legend.pos != "n" && type == "density") 
        legend(legend.pos, legend = c("absences", "presences"), 
          col = c("darkgrey", "black"), lty = c(5, 1), 
          bty = "n")
      if (legend.pos != "n" && type == "both") 
        legend(legend.pos, legend = c("absences", "presences"), 
          fill = c("darkgrey", "grey20"), border = NA, 
          lty = c(5, 1), col = c("darkgrey", "grey15"), 
          density = c(40, NA), bty = "n")
    }
  }
  return(rslt)
}

Usage example:

install.packages("modEvA", repos="http://R-Forge.R-project.org")
library(modEvA)
data(rotif.mods)
predDensity(model = rotif.mods$models[[3]])