modTools: simple tools for simpler modelling

Featured

This website provides a series of free software tools, mostly written in R, that can be used for various tasks related to the analysis of species’ distribution and ecology (although many of them can be applied to other purposes). Most of them are now being included in R packages — please cite the appropriate package (mentioned in the blog post featuring each function) if you use these tools, which are released under a Creative Commons Attribution license. Please note that these tools are experimental (like R, they come with absolutely no warranty) and occasionally edited with corrections or improvements, so preferably check back for the latest version before you use a function.

Leave a comment here if you use these functions in publications — this way I can cite you in the blog, track how people are using the tools, and justify the time and effort dedicated to them. Thanks are due to Arpat Ozgul and François Guilhaumon, who helped me through the steep part of the R learning curve (to me it was actually more like an overhang). Comments and suggestions are very welcome, although a quick response cannot be guaranteed!

Advertisements

New features in fuzzySim functions

Just two quick notes on recent improvements to functions in the fuzzySim package:

– Thanks to a bug report by José Carlos Guerrero, the multTSA function now also works when there are not multiple but only a single species to analyse;

– Thanks to a feature request and a bit of assistance by Alba Estrada, the getPreds function now also works if the data argument is a RasterStack rather than a data frame.

Check out the new versions of these functions in the latest version of fuzzySim!

Find pairs of highly correlated variables

Correlations among variables are problematic in multivariate models, and good practice commonly recommends eliminating a priori one from each pair of highly correlated variables. The corPairs function below takes a matrix of correlations among variables, plots a cluster dendrogram based on them and returns the pairs of variables with absolute correlations above a given threshold (by default, 0.8):

corPairs <- function(cor.mat, cor.thresh = 0.8, ...) {
  # A.M. Barbosa, version 1.0 (22 Feb 2018)
  dist_matrix <- as.dist(1 - abs(cor.mat))
  plot(hclust(dist_matrix, ...), ylab = "1 - absolute correlation", xlab = "", sub = "")
  abline(h = 1 - cor.thresh, col = "red")
  cor.mat[upper.tri(cor.mat, diag = TRUE)] <- NA
  high.cor.mat <- numeric(0)
  if (max(abs(cor.mat), na.rm = TRUE) >= cor.thresh) {
    high.cor.rowcol <- as.matrix(which(abs(cor.mat) >= cor.thresh, arr.ind = TRUE))
    high.cor.inds <- sort(unique(as.vector(high.cor.rowcol)))
    high.cor.mat <- data.frame(var1 = rownames(cor.mat)[high.cor.rowcol[ , "row"]], var2 = colnames(cor.mat)[high.cor.rowcol[ , "col"]])
    high.cor.mat$corr <- NULL
    for (r in 1:nrow(high.cor.mat)) 
      high.cor.mat$corr[r] <- cor.mat[high.cor.rowcol[ ,"row"][r], high.cor.rowcol[ ,"col"][r]]
    high.cor.mat <- high.cor.mat[order(abs(high.cor.mat$corr), decreasing = TRUE), ]
    return (high.cor.mat)
  } # end if max
  else return("No correlations exceed the threshold.")
} # end corPairs function

This function is closely related to (and will soon be included in) the corSelect function in the fuzzySim package (Barbosa, 2015), but operates on correlation matrices rather than the raw variables, and adds a cluster dendrogram.

 

REFERENCES

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853–858

Potential biodiversity and relative favourability

The two functions below allow calculating potential biodiversity and relative favourability, respectively, according to the procedures described by Real et al. (2017) — which should be cited if you use any of them — and also used by Estrada et al. (2018):

potentialBiodiv <- function(models = NULL, # a list of binomial glm model objects
data = NULL, # alternatively, data frame with observations and predictions
sp.cols = NULL, # index numbers of the columns with presence-absence data
prob.cols = NULL, # index numbers of the columns with probabilities
fav.cols = NULL, # alternatively, index numbers of columns with favourabilities
id.col = NULL, # optionally, index number of a column with row identifiers
zero.action = "increase" # what to do with zero favourability values when calculating the geometric mean; alternatives are "exclude" or "maintain"
) {
if (!(zero.action %in% c("increase", "exclude", "maintain"))) stop ("Invalid 'zero.action'.")
if (!is.null(models)) {
if (!all("glm" %in% sapply(models, class))) stop("All 'models' must be of class 'glm'.")
if (!all("binomial" %in% sapply(models, family))) stop("All 'models' must be of family 'binomial'.")
obs.data <- as.data.frame(lapply(models, `[[`, "y"))
prob.data <- as.data.frame(lapply(models, `[[`, "fitted.values"))
} else {
if (is.null(data) || is.null(sp.cols) || (is.null(prob.cols) && is.null(fav.cols))) stop ("You need to provide either 'models', or 'data' + 'sp.cols' + either 'prob.cols' or 'fav.cols'.")
obs.data <- data[ , sp.cols, drop = FALSE]
if (!is.null(prob.cols)) prob.data <- data[ , prob.cols, drop = FALSE]
} # end if !null models else
Nobs <- nrow(obs.data)
Nsp <- ncol(obs.data)
if (is.null(fav.cols)) {
favs <- matrix(NA, ncol = Nsp, nrow = Nobs)
for (s in 1:Nsp) {
#favs[ , s] <- fuzzySim::Fav(obs = obs.data[ , s], pred = prob.data[ , s])
n1 <- sum(obs.data[ , s] == 1, na.rm = TRUE)
n0 <- sum(obs.data[ , s] == 0, na.rm = TRUE)
prob <- prob.data[ , s]
favs[ , s] <- (prob/(1 - prob))/((n1/n0) + (prob/(1 - prob)))
}; rm(s)
} else { # if !null fav.cols
favs <- data[ , fav.cols, drop = FALSE]
} # end if null fav else
if (zero.action == "increase") favs[favs == 0] <- 2.2e-16
#.Machine$double.xmin # smallest computable positive number
else if (zero.action == "exclude") favs[favs == 0] <- NA
F_lns <- log(favs, base = exp(1))
F_ln_sums <- rowSums(F_lns) # eliminated 'na.rm = na.rm'
result <- data.frame(SpRich = rowSums(obs.data),
Fmin = apply(favs, 1, min),
Fmax = apply(favs, 1, max),
Fsum = rowSums(favs),
Fmean = rowMeans(favs),
Fgmean = exp((1 / Nsp) * F_ln_sums))
if (!is.null(data) && !is.null(id.col))
result <- data.frame(data[ , id.col, drop = FALSE], result)
result
}

 

relativeFav <- function(target, # matrix or data frame of favourablities in target period
ref) { # idem for reference period, with columns in the same order
stopifnot(ncol(target) == ncol(ref))
N <- ncol(target)
fav_ratios <- target / ref
logs <- log(fav_ratios, base = exp(1))
log_sums <- rowSums(logs)
exp((1 / N) * log_sums)
}

 

REFERENCES

Estrada A., Barbosa A.M., & Real R. (2018) Changes in potential mammal diversity in National Parks and their implications for conservation. Current Zoology, https://doi.org/10.1093/cz/zoy001

Real R., Barbosa A.M., & Bull J.W. (2017) Species Distributions, Quantum Theory, and the Enhancement of Biodiversity Measures. Systematic Biology, 66: 453-462

Trend surface analysis for multiple species

Trend surface analysis is a way to model the spatial structure in species distributions by regressing occurrence data on the spatial coordinates x and y, for a linear trend, and/or on polynomial terms of these coordinates (including the sums of powers and cross-products of the coordinates) for curvilinear trends (Legendre & Legendre, 1998; Borcard et al., 2011). Second- and third-degree polynomials are often used. The multTSA function below, which is included in the fuzzySim package, allows specifying the degree of the spatial polynomial to use. By default, it uses a 3rd-degree polynomial, performs backward stepwise AIC selection of the polynomial terms to include, and returns the presence probability value resulting from the spatial trend, although these options can be changed — for example, if you want the result to be a new spatial variable in the same scale of the input coordinates, you can set P = FALSE. This function can be used for one or multiple species at a time, in a similar way as the multGLM function in the same package.

multTSA <- function (data, sp.cols, coord.cols, id.col = NULL, degree = 3, step = TRUE, direction = "backward", P = TRUE, Favourability = FALSE, suffix = "_TS", save.models = FALSE) {
  # updated May 5 2018

  start.time <- Sys.time()

  coords.poly <- as.data.frame(poly(as.matrix(data[, coord.cols]), 
                                    degree = degree, raw = TRUE))
  n.poly.terms <- ncol(coords.poly)
  colnames(coords.poly) <- paste0(rep("poly", n.poly.terms), 
                                  c(1:n.poly.terms))
  sp.data <- as.matrix(data[, sp.cols])
  colnames(sp.data) <- colnames(data[, sp.cols, drop = FALSE])
  n.subjects <- length(sp.cols)
  if (save.models) 
    TSA.models <- vector("list", n.subjects)
  subj.count <- 0
  data.input <- data
  data <- cbind(data, coords.poly)
  for (s in 1:n.subjects) {
    subj.count <- subj.count + 1
    subj.name <- colnames(sp.data)[s]
    message("Computing TSA ", subj.count, " of ", n.subjects, 
            " (", subj.name, ") ...")
    model.formula <- as.formula(paste(subj.name, "~", paste(colnames(coords.poly), collapse = "+")))
    model.expr <- expression(with(data, glm(model.formula, family = binomial, data = data)))
    if (step) 
      model <- step(eval(model.expr), direction = direction, trace = 0)
    else model <- eval(model.expr)
    if (P || Favourability)  pred <- predict(model, coords.poly, type = "response")
    else  pred <- predict(model, coords.poly)
    if (Favourability) {
      pred <- Fav(obs = sp.data[, s], pred = pred)
    }
    data[, ncol(data) + 1] <- pred
    colnames(data)[ncol(data)] <- paste0(subj.name, suffix)
    if (save.models) {
      TSA.models[[subj.count]] <- model
      names(TSA.models)[[subj.count]] <- subj.name
    }
  }
  predictions <- data.frame(data[, id.col], data[, ((ncol(data.input) + 
                                                       1 + n.poly.terms):ncol(data)), drop = FALSE])
  if (!is.null(id.col)) {
    if (is.character(id.col)) 
      colnames(predictions)[1] <- id.col
    else colnames(predictions)[1] <- colnames(data)[id.col]
  }
  message("Finished!")
  timer(start.time)
  if (save.models) 
    return(list(predictions = data.frame(predictions), TSA.models = TSA.models))
  else return(predictions)
}

 

References

Borcard D., Gillet F. & Legendre P. (2011) Numerical Ecology with R. Springer, New York.

Legendre P. & Legendre L. (1998) Numerical Ecology. Elsevier, Amsterdam.

Recent updates to previous functions

Just a quick note to let you know of changes introduced to some functions in recent versions of their packages:

  • multGLM, FDR and corSelect (package fuzzySim) now also include BIC (Bayesian Information Criterion, also known as Schwarz criterion, SBC or SBIC) for variable selection, making it more parsimonious especially for large sample sizes
  • multGLM (package fuzzySim) now allows choosing which FDR correction to apply (if FDR is set to TRUE). It uses “fdr(=”BH”) by default, but check p.adjust for other available methods, namely “BY” which allows for non-independent data
  • varPart (package modEvA) now does not ask for model.type, includes in the output diagram the proportion of unexplained variance, and allows defining a title (main) for the diagram
  • several modEvA functions now automatically omit NAs, and emit a warning instead of an error if there are pred values outside the [0,1] interval (which some models include in the output)

The help files of these functions have also been updated with further details. It is advisable to unsinstall fuzzySim and modEvA and install their latest versions.

Range change based on continuous (fuzzy) values

This function, included in the fuzzySim package (version 1.6 just released), uses the fuzzyOverlay function in the previous post and then quantifies overall range change (including gain, loss, maintenance/stability and total change) based on the continuous predictions of two species distribution models (see Gutiérrez-Rodríguez et al., in press).

fuzzyRangeChange <- function(pred1, pred2, number = TRUE, prop = TRUE, na.rm = TRUE, round.digits = 2, measures = c("Gain", "Loss", "Stable_presence", "Stable_absence", "Balance"), plot = TRUE, col = colorRampPalette(c("white", "black"))(length(measures)), ...) {
 
  # version 1.4 (23 Mar 2016)
 
  stopifnot(ncol(pred1) == ncol(pred2),
            all(pred1[is.finite(pred1)] >= 0 && pred1[is.finite(pred1)] <= 1),
            all(pred2[is.finite(pred2)] >= 0 && pred2[is.finite(pred2)] <= 1)
  )
 
  if (!number & !prop) stop ("Nothing to calculate if both 'number' and 'prop' are FALSE.")
 
  values <- vector("numeric", length(measures))
  names(values) <- measures
  if ("Gain" %in% measures)  values["Gain"] <- sum(fuzzyOverlay(data.frame(pred1, pred2), op = "expansion", na.rm = na.rm), na.rm = na.rm)
  if ("Loss" %in% measures)  values["Loss"] <- sum(fuzzyOverlay(data.frame(pred1, pred2), op = "contraction", na.rm = na.rm), na.rm = na.rm)
  if ("Stable_presence" %in% measures)  values["Stable_presence"] <- sum(fuzzyOverlay(data.frame(pred1, pred2), op = "maintenance", na.rm = na.rm), na.rm = na.rm)
  if ("Stable_absence" %in% measures)  values["Stable_absence"] <- sum(fuzzyOverlay(1 - data.frame(pred1, pred2), op = "maintenance", na.rm = na.rm), na.rm = na.rm)
  if ("Balance" %in% measures)  values["Balance"] <- sum(fuzzyOverlay(data.frame(pred1, pred2), op = "change", na.rm = na.rm), na.rm = na.rm)
 
  result <- data.frame(Measure = measures, Number = values)
 
  if (prop) {
    if (na.rm) n <- length(na.omit(pred1))
    else n <- length(pred1)
    range.size <- sum(pred1, na.rm = na.rm)
    stable.abs <- result[result$Measure == "Stable_absence", "Number"]
    result$Proportion <- result[ , "Number"] / range.size
    result[result$Measure == "Stable_absence", "Proportion"] <- stable.abs / (n - range.size)
  }
 
  if (!number) {
    result <- result[ , - (ncol(result) - 1)]
  }
 
  if (plot) {
    barplot(result[ , ncol(result)], legend.text = rownames(result), col = col, ...)
    abline(h = 0)
  }
 
  result[ , "Measure"] <- NULL
  if (is.finite(round.digits))  result <- round(result, round.digits)
 
  result
}

[presented with Pretty R]

The function will produce numeric and (since FuzzySim 1.7.2) also graphical results quantifying the fuzzy overall changes:

FuzzyRangeChange

You can install and load fuzzySim (>= 1.6) and then check help(fuzzyRangeChange) for further  info and reproducible usage examples.

 

REFERENCES

Gutiérrez-Rodríguez J., Barbosa A.M. & Martínez-Solano I. (in press) Present and past climatic effects on the current distribution and genetic diversity of the Iberian spadefoot toad (Pelobates cultripes): an integrative approach. Journal of Biogeography.