Recent updates to previous functions

Just a quick note to let you know of changes introduced to some functions in recent versions of their packages:

  • multGLM, FDR and corSelect (package fuzzySim) now also include BIC (Bayesian Information Criterion, also known as Schwarz criterion, SBC or SBIC) for variable selection, making it more parsimonious especially for large sample sizes
  • multGLM now allow choosing the FDR correction to apply (if FDR is set to TRUE). It uses “fdr(=”BH”) by default, but check p.adjust for other available methods, namely “BY” which allows for non-independent data
  • varPart (package modEvA) now does not ask for model.type, includes in the output diagram the proportion of unexplained variance, and allows defining a title (main) for the diagram
  • several modEvA functions now automatically omit NAs, and emit a warning instead of an error if there are pred values outside the [0,1] interval

The help files of these functions have also been updated with further details. It is advisable to unsinstall fuzzySim and modEvA and install their latest versions.

Range change based on continuous (fuzzy) values

This function, included in the fuzzySim package (version 1.6 just released), uses the fuzzyOverlay function in the previous post and then quantifies overall range change (including gain, loss, maintenance/stability and total change) based on the continuous predictions of two species distribution models (see Gutiérrez-Rodríguez et al., in press).

fuzzyRangeChange <- function(pred1, pred2, number = TRUE, prop = TRUE, na.rm = TRUE, round.digits = 2, measures = c("Gain", "Loss", "Stable_presence", "Stable_absence", "Balance"), plot = TRUE, col = colorRampPalette(c("white", "black"))(length(measures)), ...) {
 
  # version 1.4 (23 Mar 2016)
 
  stopifnot(ncol(pred1) == ncol(pred2),
            all(pred1[is.finite(pred1)] >= 0 && pred1[is.finite(pred1)] <= 1),
            all(pred2[is.finite(pred2)] >= 0 && pred2[is.finite(pred2)] <= 1)
  )
 
  if (!number & !prop) stop ("Nothing to calculate if both 'number' and 'prop' are FALSE.")
 
  values <- vector("numeric", length(measures))
  names(values) <- measures
  if ("Gain" %in% measures)  values["Gain"] <- sum(fuzzyOverlay(data.frame(pred1, pred2), op = "expansion", na.rm = na.rm), na.rm = na.rm)
  if ("Loss" %in% measures)  values["Loss"] <- sum(fuzzyOverlay(data.frame(pred1, pred2), op = "contraction", na.rm = na.rm), na.rm = na.rm)
  if ("Stable_presence" %in% measures)  values["Stable_presence"] <- sum(fuzzyOverlay(data.frame(pred1, pred2), op = "maintenance", na.rm = na.rm), na.rm = na.rm)
  if ("Stable_absence" %in% measures)  values["Stable_absence"] <- sum(fuzzyOverlay(1 - data.frame(pred1, pred2), op = "maintenance", na.rm = na.rm), na.rm = na.rm)
  if ("Balance" %in% measures)  values["Balance"] <- sum(fuzzyOverlay(data.frame(pred1, pred2), op = "change", na.rm = na.rm), na.rm = na.rm)
 
  result <- data.frame(Measure = measures, Number = values)
 
  if (prop) {
    if (na.rm) n <- length(na.omit(pred1))
    else n <- length(pred1)
    range.size <- sum(pred1, na.rm = na.rm)
    stable.abs <- result[result$Measure == "Stable_absence", "Number"]
    result$Proportion <- result[ , "Number"] / range.size
    result[result$Measure == "Stable_absence", "Proportion"] <- stable.abs / (n - range.size)
  }
 
  if (!number) {
    result <- result[ , - (ncol(result) - 1)]
  }
 
  if (plot) {
    barplot(result[ , ncol(result)], legend.text = rownames(result), col = col, ...)
    abline(h = 0)
  }
 
  result[ , "Measure"] <- NULL
  if (is.finite(round.digits))  result <- round(result, round.digits)
 
  result
}

[presented with Pretty R]

The function will produce numeric and (since FuzzySim 1.7.2) also graphical results quantifying the fuzzy overall changes:

FuzzyRangeChange

You can install and load fuzzySim (>= 1.6) and then check help(fuzzyRangeChange) for further  info and reproducible usage examples.

 

REFERENCES

Gutiérrez-Rodríguez J., Barbosa A.M. & Martínez-Solano I. (in press) Present and past climatic effects on the current distribution and genetic diversity of the Iberian spadefoot toad (Pelobates cultripes): an integrative approach. Journal of Biogeography.

Model overlay operations based on fuzzy logic

Logical and set operations are useful for comparative distribution modelling, to assess consensus or mismatch between the predictions of different models, and to quantify differences between models obtained for different time periods. Fuzzy set theory (Zadeh 1965, Barbosa & Real 2012) allows performing such operations without converting the predictions from continuous to binary, with the inherent application of arbitrary thresholds and over-simplification of model predictions. The result is a continuous numerical value quantifying the intersection, union, sum, or other operation among model predictions, whether binary or continuous.

The fuzzyOverlay function, used e.g. by Gutiérrez-Rodríguez et al. (in press) and by Reino et al. (in press) and included in the fuzzySim package (>=1.6), requires a data frame (or a matrix) containing the model prediction columns to compare, an indication of which columns are to be compared (overlay.cols; by default they are all included), and an op indicating the operation to perform between them. Can be ‘consensus‘ for the arithmetic mean of predictions (or the fuzzy equivalent of the proportion of models that agree that the species occurs at each site), ‘fuzzy_and‘ or ‘intersection‘ for fuzzy intersection; ‘fuzzy_or‘ or ‘union‘ for fuzzy union; ‘prob_and‘ or ‘prob_or‘ for probabilistic and/or, respectively (see Details); ‘maintenance‘ for the values where all predictions for the same row (rounded to the number of digits specified in the next argument) are the same. If data has only two columns to compare, you can also calculate ‘xor‘ for exclusive or, ‘AnotB‘ for the the occurrence of the species in column 1 in detriment of that in column 2,’expansion‘ for the prediction increase in rows where column 2 has higher values than column 1, ‘contraction‘ for the prediction decrease in rows where column 2 has lower values than column 1, or ‘change‘ for a mix of the latter two, with positive values where there has been an increase and negative values where there was decrease in favourability from columns 1 to 2.

fuzzyOverlay <- function(data, 
                         overlay.cols = 1:ncol(data),
                         op = "intersection",
                         na.rm = FALSE,
                         round.digits = 2
                         ) {
 
  # version 1.2 (13 Nov 2015)
 
  data <- data[ , overlay.cols]
  stopifnot(all(data[!is.na(data), ] >= 0 && data[!is.na(data), ] <= 1))
 
  if (op == "consensus") rowSums(data, na.rm = na.rm) / ncol(data)
  else if (op %in% c("fuzzy_and", "intersection")) apply(data, MARGIN = 1, FUN = min, na.rm = na.rm)
  else if (op == "prob_and") apply(data, MARGIN = 1, FUN = prod, na.rm = na.rm)
  else if (op %in% c("fuzzy_or", "union")) apply(data, MARGIN = 1, FUN = max, na.rm = na.rm)
  else if (op == "prob_or") apply(data, MARGIN = 1, FUN = sum, na.rm = na.rm) - apply(data, MARGIN = 1, FUN = prod, na.rm = na.rm)
  else if (op == "maintenance") ifelse(round(data[ , 2], digits = round.digits) == round(data[ , 1], digits = round.digits), round(data[ , 1], digits = round.digits), 0)
 
  else if (op %in% c("xor", "AnotB", "expansion", "contraction", "change")) {
    if (ncol(data) != 2) stop ("This 'op' works only for 'data' with 2 columns.")
    if (op == "xor") pmax(pmin(data[,1], 1 - data[,2], na.rm = na.rm), pmin(1 - data[,1], data[,2], na.rm = na.rm), na.rm = na.rm)  # http://www.dmitry-kazakov.de/ada/fuzzy.htm#Fuzzy
    else if (op == "AnotB") pmin(data[,1], 1 - data[,2], na.rm = na.rm)
    else if (op == "expansion") ifelse(data[ , 2] > data[ , 1], 
                                       data[ , 2] - data[ , 1], 
                                       0)
    else if (op == "contraction") ifelse(data[ , 2] < data[ , 1], 
                                         data[ , 2] - data[ , 1], 
                                         0)
    else if (op == "change") data[ , 2] - data[ , 1]
  }
  else stop ("Invalid 'op' name.")
}

[presented with Pretty R]

You can install and load fuzzySim (>= 1.6) and then check help(fuzzyOverlay) for further  info and reproducible usage examples.

REFERENCES

Barbosa A.M. & Real R. (2012) Applying fuzzy logic to comparative distribution modelling: a case study with two sympatric amphibians. The Scientific World Journal, 2012, Article ID 428206

Gutiérrez-Rodríguez J., Barbosa A.M. & Martínez-Solano I. (in press) Present and past climatic effects on the current distribution and genetic diversity of the Iberian spadefoot toad (Pelobates cultripes): an integrative approach. Journal of Biogeography

Reino L., Ferreira M., Martínez-Solano I., Segurado P., Xu C. & Barbosa A.M. (in press) Favourable areas for co-occurrence of parapatric species: niche conservatism and niche divergence in Iberian tree frogs and midwife toads. Journal of Biogeography

Zadeh, L.A. (1965) Fuzzy sets. Information and Control, 8: 338-353

Assess model overlap with Schoener’s D, Hellinger distance and Warren’s I

Metrics for quantifying the similarity among ecological niche models are important for testing patterns of niche evolution. The modOverlap function, used e.g. by Reino et al. (in press) and now included in the fuzzySim package (>=1.5), calculates three such metrics: Shoener’s D statistic for niche overlap, Hellinger distance between probability distributions, and the I similarity statistic of Warren et al. (2008).

These formulas are also implemented e.g. within the niche.overlap function of R package phyloclim, but there they require input data in complex and software-specific formats. The function below calculates these indices from simply two vectors or columns containing the predictions of the two models to compare.

modOverlap <- function (pred1, pred2, na.rm = TRUE) 
{
    p1 <- pred1/sum(pred1, na.rm = na.rm)
    p2 <- pred2/sum(pred2, na.rm = na.rm)
    SchoenerD <- 1 - 0.5 * sum(abs(p1 - p2), na.rm = na.rm)
    HellingerDist <- sqrt(sum((sqrt(p1) - sqrt(p2))^2, na.rm = na.rm))
    WarrenI <- 1 - ((HellingerDist^2)/2)
    list(SchoenerD = SchoenerD, WarrenI = WarrenI, HellingerDist = HellingerDist)
}

[presented with Pretty R]

See also the fuzSim function in the fuzzySim package, which calculates fuzzy versions of classic binary similarity indices such as Jaccard, Baroni-Urbani & Buser, Sorensen and Simpson (Barbosa 2015).

 

REFERENCES

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Reino L., Ferreira M., Martínez-Solano I., Segurado P., Xu C. & Barbosa A.M. (in press) Favourable areas for co-occurrence of parapatric species: niche conservatism and niche divergence in Iberian tree frogs and midwife toads. Journal of Biogeography

Warren D.L., Glor R.E. & Turelli, M. (2008) Environmental niche equivalency versus conservatism: quantitative approaches to niche evolution. Evolution, 62: 2868-83 [plus ERRATUM]

Select among correlated variables based on a given criterion

Correlations among variables are problematic in multivariate models, as they inflate the variance of coefficients and thus may bias the interpretation of the effects of those variables on the response. One of the strategies to circumvent this problem is to eliminate a priori one from each pair of correlated variables, but it is not always straightforward to pick the right variable among these. The corSelect function selects among correlated variables based either on their (stepwise) variance inflation factor (VIF); or on their relationship with the response variable, by building a bivariate model of each individual variable against the response and choosing, among each of two correlated variables, the one with the best p-value, AIC or BIC.

This function is now included in fuzzySim and depends on other functions in the package, such as FDR and multicol. It is also included as a new option within function multGLM in this package, so that multiple models can be built based on uncorrelated variables chosen appropriately for each species.

corSelect <- function(data, sp.cols = NULL, var.cols, cor.thresh = 0.8, select = "p.value", ...) {

# version 1.6 (6 Jul 2017)

if (length(sp.cols) > 1) stop ("Sorry, 'corSelect' is currently implemented for only one 'sp.col' at a time.")

univar.criteria <- c("VIF")
 bivar.criteria <- c("p.value", "AIC", "BIC")

if (!(select %in% c(univar.criteria, bivar.criteria))) stop ("Invalid 'select' criterion.")

if (!is.null(sp.cols) & select %in% bivar.criteria) {
 n.in <- nrow(data)
 data <- data[is.finite(data[ , sp.cols]), ]
 n.out <- nrow(data)
 if (n.out < n.in) warning (n.in - n.out, " observations removed due to missing data in 'sp.cols'; ", n.out, " observations actually evaluated.")
 }

cor.mat <- cor(data[ , var.cols], ...)
 cor.mat[upper.tri(cor.mat, diag = TRUE)] <- NA
 high.cor.mat <- bivar.mat <- numeric(0)

if (max(abs(cor.mat), na.rm = TRUE) >= cor.thresh) {
 high.cor.rowcol <- as.matrix(which(abs(cor.mat) >= cor.thresh, arr.ind = TRUE))
 high.cor.inds <- sort(unique(as.vector(high.cor.rowcol)))
 high.cor.mat <- data.frame(var1 = rownames(cor.mat)[high.cor.rowcol[ , "row"]], var2 = colnames(cor.mat)[high.cor.rowcol[ , "col"]])
 high.cor.mat$corr <- NULL
 for (r in 1:nrow(high.cor.mat)) high.cor.mat$corr[r] <- cor.mat[high.cor.rowcol[ ,"row"][r], high.cor.rowcol[ ,"col"][r]]
 high.cor.mat <- high.cor.mat[order(abs(high.cor.mat$corr), decreasing = TRUE), ]

if (is.null(sp.cols) & select %in% bivar.criteria) {
 message("Bivariate relationships not assessable without a response variable ('sp.cols'). Returning high pairwise correlations among 'var.cols'.")
 return (high.cor.mat)
 }

high.cor.vars <- unique(rownames(cor.mat[high.cor.inds, high.cor.inds]))

if (select %in% bivar.criteria) {
 bivar.mat <- FDR(data = data, sp.cols = sp.cols, var.cols = match(high.cor.vars, colnames(data)), simplif = TRUE)[ , c("p.value", "AIC", "BIC")]
 if (all.equal(order(bivar.mat[ , c("p.value")]), order(bivar.mat[ , c("AIC")]), order(bivar.mat[ , c("BIC")]))) message("Results identical using whether p-value, AIC or BIC to select variables.\n")
 else message("Results NOT identical using whether p-value, AIC or BIC to select variables.\n")
 } # end if select in bivar

data.remaining <- data[ , var.cols]
 while (max(abs(cor.mat), na.rm = TRUE) >= cor.thresh) {
 max.cor.ind <- which(abs(cor.mat) == max(abs(cor.mat), na.rm = TRUE), arr.ind = TRUE)
 v1 <- rownames(cor.mat)[max.cor.ind[1]]
 v2 <- colnames(cor.mat)[max.cor.ind[2]]
 if (select %in% univar.criteria) {
 vif <- multicol(data.remaining)
 targets <- vif[c(v1, v2), "VIF", drop = FALSE]
 } # end if univar
 else if (select %in% bivar.criteria) {
 targets <- bivar.mat[c(v1, v2), select, drop = FALSE]
 } # end if bivar

exclude <- which.max(as.matrix(targets))
 excl.name <- rownames(targets)[exclude]
 excl.ind <- match(excl.name, colnames(cor.mat))
 data.remaining <- data.remaining[ , -excl.ind, drop = FALSE]
 cor.mat <- cor.mat[-excl.ind, -excl.ind, drop = FALSE]
 if (all(is.na(cor.mat))) cor.mat <- numeric(0)
 if (length(cor.mat) == 0) break
 } # end while max

} # end if max > thresh

selected.vars <- colnames(cor.mat)
 selected.var.cols <- match(colnames(cor.mat), colnames(data))
 excluded.vars <- colnames(data)[var.cols][!(colnames(data)[var.cols] %in% colnames(cor.mat))]

vif2 <- multicol(data[ , selected.var.cols])

list(high.correlations = high.cor.mat,
 bivariate.significance = bivar.mat,
 excluded.vars = excluded.vars,
 selected.vars = selected.vars,
 selected.var.cols = selected.var.cols,
 strongest.remaining.corr = cor.mat[which.max(abs(cor.mat))],
 remaining.multicollinearity = vif2
 )
} # end corSelect function

 

You can install and load fuzzySim (>=version 1.5) and then check help(corSelect) for reproducible usage examples.

Plot column(s) of a polygon vector map

NOTE: I recently had a tutorial on the cartography R package, which makes mapping columns of a data frame much less troublesome. You may want to look at that instead.

If you have a polygon vector map in R and want to quickly map the values in one or more columns of its attribute table, you can use the plotMapColumns function. There’s an option to rasterize the map before plotting, which may be considerably faster and is TRUE by default, but you’ll need to use an appropriate raster.upscale value. This is the number by which the range of coordinates should be divided to get the number of pixels for the maps to be plotted; it’s advised to first check your range(coordinates(map)) and see for yourself which raster.upscale divisor will make reasonably computable raster maps – e.g., for geographical lat-lon, an upscale factor of 1 will usually work (you’ll have at most 360 x 180 pixels; actually you may want to lower raster.upscale to 0.5 or 0.3 if you need more detailed maps); but for a UTM projection (whose coordinates can be much larger values) you may need an upscale.factor of 10000 to get a reasonably computable number of pixels.

plotMapColumns <- function(map, # SpatialPolygons object
                           columns, # index(es) of column(s) of map@data containing the values to plot (there will be one output map per column)
                           rasterize = TRUE, # number by which the difference between maximum and minimum coordinates should be divided to get the number of pixels (if rasterize = TRUE); it's advised to first calculate min and max coordinates and see for yourself which divisor will make reasonably computable raster maps (e.g., for geographical lat-lon an upscale factor of 1 may work, but for a UTM projection you may need an upscale of factor 10,000!)
                           raster.upscale = 1, 
                           ...) # additional arguments for (sp)plot function
  {

  stopifnot(raster.upscale > 0 | is.null(raster.upscale),
            require(raster) | rasterize == FALSE
            )
 
  if (!all(columns %in% 1:length(names(map)))) stop ("index out of bounds; 'columns' must exist in map@data.")
 
  if (rasterize) {
    xmin <- min(coordinates(map)[,1])
    xmax <- max(coordinates(map)[,1])
    ymin <- min(coordinates(map)[,2])
    ymax <- max(coordinates(map)[,2])
    wdth <- round(xmax - xmin)
    hght <- round(ymax - ymin)
 
    #if (raster.upscale == "auto") {
      #max.length <- max(wdth, hght)
      #if (max.length > 500) raster.upscale <-
    #}
 
    #if (!is.null(max.rast.dim)) {
    #  rast.dim <- wdth * hght
    #}
 
    wdth <- wdth / raster.upscale
    hght <- hght / raster.upscale
    message("plotting map(s) with ", wdth, "x", hght, " pixels; consider rising 'raster.upscale' if this is taking too long, or lowering it if the resulting maps are too coarse.")
    require(raster)
    rast <- raster(nrows = hght, ncols = wdth, xmn = xmin, xmx = xmax, ymn = ymin, ymx = ymax)
  }  # end if rasterize I
 
  #if (centroids) {
  #  attr.table <- map@data
  #  map <- SpatialPointsDataFrame(coordinates(map))
  #  map@data <- attr.table
  #  rast <- raster(map)
  #} else {
    require(sp)
  #}
 
  n.cols <- length(columns)
  col.count <- 0
  for (i in columns) {
    col.count <- col.count + 1
    message("Plotting column ", col.count, " of ", n.cols, "...")
    if (rasterize) {
      map.rast <- rasterize(x = map, y = rast, field = names(map)[i], fun = 'last')
      plot(map.rast, main = names(map)[i], ...)
    }  # end if rasterize II
    else {
      print(spplot(map, zcol = names(map)[i], main = names(map)[i], ...))
    }  # end else
  }  # end for i
  message("Finished!")
}

[presented with Pretty R]

Usage example:

# download, unzip and import a map of countries:
download.file("http://biogeo.ucdavis.edu/data/world/countries_shp.zip", destfile = "countries_shp.zip")
unzip("countries_shp.zip")
countries <- rgdal::readOGR(dsn = ".", layer= "countries")
 
# see the data in the attributes table:
head(countries)
names(countries)
 
# use plotMapColumns with and without rasterizing:
plotMapColumns(countries, columns = 17:18, rasterize = TRUE, raster.upscale = 1)
plotMapColumns(countries, columns = 18, rasterize = FALSE)  # slower

You can add arguments for the (sp)plot function, to get e.g. different colour schemes. The plotMapColumns function is not (yet) included in a package.

Analyse and compare stepwise model predictions

This function builds a generalized linear model with forward stepwise inclusion of variables, using AIC as the selection criterion, and gives the values predicted at each step, as well as their correlation with the final model predictions. This can be useful to see if a model with just the first few variables could already provide predictions very close to the final ones (see e.g. Fig. 3 in Muñoz et al., 2005). This function is now included in the fuzzySim package (Barbosa, 2015; version 1.3 just released). The correlation coefficient (cor.method) is “pearson” by default, but can be set to “spearman” or “kendall” as well.

stepByStep <- function(data, # name of dataset
                       sp.col, # index of species (target variable) column
                       var.cols, # indices of variables columns
                       family = binomial(link = "logit"),
                       Favourability = FALSE, # used only if family=binomial(logit)
                       trace = 0, # argument for 'step'
                       cor.method = "pearson") {
 
# version 1.3 (22 Jul 2015)
 
n.init <- nrow(data)
data <- na.omit(data)
na.loss <- n.init - nrow(data)
if (na.loss > 0) message(na.loss, " cases excluded due to missing values.")
 
response <- colnames(data)[sp.col]
null.model.formula <- as.formula(paste(response, "~", 1))
scope.formula <- as.formula(paste("~", paste(colnames(data)[var.cols], collapse = "+")))
mod <- step(glm(null.model.formula, family = family, data = data), scope = scope.formula, direction = "forward", trace = trace)
pred.final <- mod$fitted.values
 
if (!all(c("binomial", "logit") %in% mod$family)) Favourability <- FALSE
if (Favourability) fav.final <- Fav(model = mod)
 
model.vars.split <- sapply(mod$anova[ , 1], strsplit, split = " ")
model.vars <- lapply(model.vars.split, `[`, 2)
model.vars <- as.character(model.vars)[-1]
n.steps <- length(model.vars)
 
preds <- favs <- as.data.frame(matrix(nrow = nrow(data), ncol = n.steps))
cor.P <- cor.F <- vector("numeric", n.steps)
names(model.vars) <- names(preds) <- names(favs) <- names(cor.P) <- names(cor.F) <- paste0("step", 1:n.steps)
 
for (s in 1:n.steps) {
step.vars <- model.vars[1:s]
mod.step <- glm(as.formula(paste(response, "~", paste(step.vars, collapse = "+"))), family = family, data = data)
if (Favourability) {
favs[ , s] <- Fav(model = mod.step)
cor.F[s] <- cor(favs[ , s], fav.final, method = cor.method)
}
else {
preds[ , s] <- mod.step$fitted.values
cor.P[s] <- cor(preds[ , s], pred.final, method = cor.method)
}
}; rm(s)
 
if (Favourability) result <- list(predictions = favs, correlations = cor.F, variables = model.vars, model = mod)
else result <- list(predictions = preds, correlations = cor.P, variables = model.vars, model = mod)
 
return(result)
}

Usage example:

library(fuzzySim)
data(rotif.env)
 
stepByStep(data = rotif.env, sp.col = 18, var.cols = 5:17, cor.method = "spearman")
 
stepByStep(data = rotif.env, sp.col = 18, var.cols = 5:17, Favourability = TRUE, cor.method = "pearson")
 
stepByStep(data = rotif.env, sp.col = 18, var.cols = 5:17, family = poisson)

[presented with Pretty R]

Only forward selection is implemented for now. I may or may not eventually find the time to implement other selection methods.

REFERENCES

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Muñoz, A.R., Real R., BARBOSA A.M. & Vargas J.M. (2005) Modelling the distribution of Bonelli’s Eagle in Spain: Implications for conservation planning. Diversity and Distributions 11: 477-486