Crop or mask rasters from a zip file

Imagine you have a zipped folder with global raster maps representing different variables (e.g. the ones you can download from WorldClim, CliMond, ENVIREM, CHELSA or ecoClimate), and you need to extract just a particular region or country that you’ll be working with. The cropZippedRasters function, provided below, automates this process without the need to previously unzip the entire folder (which can require substantial disk space). The function extracts one raster at a time from the given .zip file, imports the raster to R, crops or masks it with a user-provided raster or polygon vector map, and then (by default) deletes the unzipped raster file before unzipping the next one, therefore requiring minimal disk space.

 


cropZippedRasters <- function(zip.file, cut.map, msk = FALSE, rast.file.ext = ".tif", aux.file.ext = NULL, delete.unzipped = TRUE)
  # zip.file: path to the zip containing the raster maps to crop or mask
  # cut.map: raster or SpatialPolygons map (in your R workspace) defining the region to crop or mask
  # msk: logical, whether to actually mask with the cut.map borders (slower) or to just crop to the extent of cut.map (the default, which can be considerably faster)
  # rast.file.ext: file extension for the raster maps on disk (e.g. ".tif", ".bil", ".rst")
  # aux.file.ext: file extension for the auxiliary files when they exist (e.g. ".hdr" for .bil raster files, or ".rdc" for Idrisi .rst files)
{

  require(raster)
  rast.files <- unzip(zip.file, list = TRUE) $ Name
  var.names <- unique(tools::file_path_sans_ext(rast.files))
  n.var <- length(var.names)
  cut.rasts <- vector("list", length(var.names))
  names(cut.rasts) <- var.names

  for (i in 1:n.var) {
    message("Importing map ", i, " of ", n.var)
    message("  - unzipping file...")
    unzip(zip.file, files = paste0(var.names[i], rast.file.ext))
    if (!is.null(aux.file.ext)) unzip(zip.file, files = paste0(var.names[i], aux.file.ext))
    var.rast <- raster(paste0(var.names[i], rast.file.ext))

    message("  - cropping/masking map...")
    var.cut <- crop(var.rast, cut.map, snap = "out")
    if (msk)  var.cut <- mask(var.cut, cut.map)
    cut.rasts[[i]] <- var.cut

    if (delete.unzipped) {
      message("  - deleting unzipped file...")
      unlink(list.files()[grep(pattern = paste0(var.names[i], rast.file.ext), x = list.files())])
      if (!is.null(aux.file.ext)) unlink(list.files()[grep(pattern = paste0(var.names[i], aux.file.ext), x = list.files())])
    }  # end if delete

  }  # end for i
  message("Finished!")
  return(stack(cut.rasts))
}

Mind that you need to have the raster package installed for this, and a vector or raster map of the region with which you want to crop or mask the raster variables. This function is closely related to the zonalFromZip function posted a while ago; I may eventually get round to combining them into a single function.

Usage example:

wclim_we <- cropZippedRasters(zip.file = "wc2.0_30s_bio.zip", cut.map = westeurope, rast.file.ext = ".tif", aux.file.ext = NULL)
plot(wclim_we)
Advertisements

New features in fuzzySim functions

Just two quick notes on recent improvements to functions in the fuzzySim package:

– Thanks to a bug report by José Carlos Guerrero, the multTSA function now also works when there are not multiple but only a single species to analyse;

– Thanks to a feature request and a bit of assistance by Alba Estrada, the getPreds function now also works if the data argument is a RasterStack rather than a data frame.

Check out the new versions of these functions in the latest version of fuzzySim!

Find pairs of highly correlated variables

Correlations among variables are problematic in multivariate models, and good practice commonly recommends eliminating a priori one from each pair of highly correlated variables. The corPairs function below takes a matrix of correlations among variables, plots a cluster dendrogram based on them and returns the pairs of variables with absolute correlations above a given threshold (by default, 0.8):

corPairs <- function(cor.mat, cor.thresh = 0.8, ...) {
  # A.M. Barbosa, version 1.0 (22 Feb 2018)
  dist_matrix <- as.dist(1 - abs(cor.mat))
  plot(hclust(dist_matrix, ...), ylab = "1 - absolute correlation", xlab = "", sub = "")
  abline(h = 1 - cor.thresh, col = "red")
  cor.mat[upper.tri(cor.mat, diag = TRUE)] <- NA
  high.cor.mat <- numeric(0)
  if (max(abs(cor.mat), na.rm = TRUE) >= cor.thresh) {
    high.cor.rowcol <- as.matrix(which(abs(cor.mat) >= cor.thresh, arr.ind = TRUE))
    high.cor.inds <- sort(unique(as.vector(high.cor.rowcol)))
    high.cor.mat <- data.frame(var1 = rownames(cor.mat)[high.cor.rowcol[ , "row"]], var2 = colnames(cor.mat)[high.cor.rowcol[ , "col"]])
    high.cor.mat$corr <- NULL
    for (r in 1:nrow(high.cor.mat)) 
      high.cor.mat$corr[r] <- cor.mat[high.cor.rowcol[ ,"row"][r], high.cor.rowcol[ ,"col"][r]]
    high.cor.mat <- high.cor.mat[order(abs(high.cor.mat$corr), decreasing = TRUE), ]
    return (high.cor.mat)
  } # end if max
  else return("No correlations exceed the threshold.")
} # end corPairs function

This function is closely related to (and will soon be included in) the corSelect function in the fuzzySim package (Barbosa, 2015), but operates on correlation matrices rather than the raw variables, and adds a cluster dendrogram.

 

REFERENCES

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853–858

Potential biodiversity and relative favourability

The two functions below allow calculating potential biodiversity and relative favourability, respectively, according to the procedures described by Real et al. (2017) — which should be cited if you use any of them — and also used by Estrada et al. (2018):

potentialBiodiv <- function(models = NULL, # a list of binomial glm model objects
data = NULL, # alternatively, data frame with observations and predictions
sp.cols = NULL, # index numbers of the columns with presence-absence data
prob.cols = NULL, # index numbers of the columns with probabilities
fav.cols = NULL, # alternatively, index numbers of columns with favourabilities
id.col = NULL, # optionally, index number of a column with row identifiers
zero.action = "increase" # what to do with zero favourability values when calculating the geometric mean; alternatives are "exclude" or "maintain"
) {
if (!(zero.action %in% c("increase", "exclude", "maintain"))) stop ("Invalid 'zero.action'.")
if (!is.null(models)) {
if (!all("glm" %in% sapply(models, class))) stop("All 'models' must be of class 'glm'.")
if (!all("binomial" %in% sapply(models, family))) stop("All 'models' must be of family 'binomial'.")
obs.data <- as.data.frame(lapply(models, `[[`, "y"))
prob.data <- as.data.frame(lapply(models, `[[`, "fitted.values"))
} else {
if (is.null(data) || is.null(sp.cols) || (is.null(prob.cols) && is.null(fav.cols))) stop ("You need to provide either 'models', or 'data' + 'sp.cols' + either 'prob.cols' or 'fav.cols'.")
obs.data <- data[ , sp.cols, drop = FALSE]
if (!is.null(prob.cols)) prob.data <- data[ , prob.cols, drop = FALSE]
} # end if !null models else
Nobs <- nrow(obs.data)
Nsp <- ncol(obs.data)
if (is.null(fav.cols)) {
favs <- matrix(NA, ncol = Nsp, nrow = Nobs)
for (s in 1:Nsp) {
#favs[ , s] <- fuzzySim::Fav(obs = obs.data[ , s], pred = prob.data[ , s])
n1 <- sum(obs.data[ , s] == 1, na.rm = TRUE)
n0 <- sum(obs.data[ , s] == 0, na.rm = TRUE)
prob <- prob.data[ , s]
favs[ , s] <- (prob/(1 - prob))/((n1/n0) + (prob/(1 - prob)))
}; rm(s)
} else { # if !null fav.cols
favs <- data[ , fav.cols, drop = FALSE]
} # end if null fav else
if (zero.action == "increase") favs[favs == 0] <- 2.2e-16
#.Machine$double.xmin # smallest computable positive number
else if (zero.action == "exclude") favs[favs == 0] <- NA
F_lns <- log(favs, base = exp(1))
F_ln_sums <- rowSums(F_lns) # eliminated 'na.rm = na.rm'
result <- data.frame(SpRich = rowSums(obs.data),
Fmin = apply(favs, 1, min),
Fmax = apply(favs, 1, max),
Fsum = rowSums(favs),
Fmean = rowMeans(favs),
Fgmean = exp((1 / Nsp) * F_ln_sums))
if (!is.null(data) && !is.null(id.col))
result <- data.frame(data[ , id.col, drop = FALSE], result)
result
}

 

relativeFav <- function(target, # matrix or data frame of favourablities in target period
ref) { # idem for reference period, with columns in the same order
stopifnot(ncol(target) == ncol(ref))
N <- ncol(target)
fav_ratios <- target / ref
logs <- log(fav_ratios, base = exp(1))
log_sums <- rowSums(logs)
exp((1 / N) * log_sums)
}

 

REFERENCES

Estrada A., Barbosa A.M., & Real R. (2018) Changes in potential mammal diversity in National Parks and their implications for conservation. Current Zoology, https://doi.org/10.1093/cz/zoy001

Real R., Barbosa A.M., & Bull J.W. (2017) Species Distributions, Quantum Theory, and the Enhancement of Biodiversity Measures. Systematic Biology, 66: 453-462

Trend surface analysis for multiple species

Trend surface analysis is a way to model the spatial structure in species distributions by regressing occurrence data on the spatial coordinates x and y, for a linear trend, and/or on polynomial terms of these coordinates (including the sums of powers and cross-products of the coordinates) for curvilinear trends (Legendre & Legendre, 1998; Borcard et al., 2011). Second- and third-degree polynomials are often used. The multTSA function below, which is included in the fuzzySim package, allows specifying the degree of the spatial polynomial to use. By default, it uses a 3rd-degree polynomial, performs backward stepwise AIC selection of the polynomial terms to include, and returns the presence probability value resulting from the spatial trend, although these options can be changed — for example, if you want the result to be a new spatial variable in the same scale of the input coordinates, you can set P = FALSE. This function can be used for one or multiple species at a time, in a similar way as the multGLM function in the same package.

multTSA <- function (data, sp.cols, coord.cols, id.col = NULL, degree = 3, step = TRUE, direction = "backward", P = TRUE, Favourability = FALSE, suffix = "_TS", save.models = FALSE) {
  # updated May 5 2018

  start.time <- Sys.time()

  coords.poly <- as.data.frame(poly(as.matrix(data[, coord.cols]), 
                                    degree = degree, raw = TRUE))
  n.poly.terms <- ncol(coords.poly)
  colnames(coords.poly) <- paste0(rep("poly", n.poly.terms), 
                                  c(1:n.poly.terms))
  sp.data <- as.matrix(data[, sp.cols])
  colnames(sp.data) <- colnames(data[, sp.cols, drop = FALSE])
  n.subjects <- length(sp.cols)
  if (save.models) 
    TSA.models <- vector("list", n.subjects)
  subj.count <- 0
  data.input <- data
  data <- cbind(data, coords.poly)
  for (s in 1:n.subjects) {
    subj.count <- subj.count + 1
    subj.name <- colnames(sp.data)[s]
    message("Computing TSA ", subj.count, " of ", n.subjects, 
            " (", subj.name, ") ...")
    model.formula <- as.formula(paste(subj.name, "~", paste(colnames(coords.poly), collapse = "+")))
    model.expr <- expression(with(data, glm(model.formula, family = binomial, data = data)))
    if (step) 
      model <- step(eval(model.expr), direction = direction, trace = 0)
    else model <- eval(model.expr)
    if (P || Favourability)  pred <- predict(model, coords.poly, type = "response")
    else  pred <- predict(model, coords.poly)
    if (Favourability) {
      pred <- Fav(obs = sp.data[, s], pred = pred)
    }
    data[, ncol(data) + 1] <- pred
    colnames(data)[ncol(data)] <- paste0(subj.name, suffix)
    if (save.models) {
      TSA.models[[subj.count]] <- model
      names(TSA.models)[[subj.count]] <- subj.name
    }
  }
  predictions <- data.frame(data[, id.col], data[, ((ncol(data.input) + 
                                                       1 + n.poly.terms):ncol(data)), drop = FALSE])
  if (!is.null(id.col)) {
    if (is.character(id.col)) 
      colnames(predictions)[1] <- id.col
    else colnames(predictions)[1] <- colnames(data)[id.col]
  }
  message("Finished!")
  timer(start.time)
  if (save.models) 
    return(list(predictions = data.frame(predictions), TSA.models = TSA.models))
  else return(predictions)
}

 

References

Borcard D., Gillet F. & Legendre P. (2011) Numerical Ecology with R. Springer, New York.

Legendre P. & Legendre L. (1998) Numerical Ecology. Elsevier, Amsterdam.

Recent updates to previous functions

Just a quick note to let you know of changes introduced to some functions in recent versions of their packages:

  • multGLM, FDR and corSelect (package fuzzySim) now also include BIC (Bayesian Information Criterion, also known as Schwarz criterion, SBC or SBIC) for variable selection, making it more parsimonious especially for large sample sizes
  • multGLM (package fuzzySim) now allows choosing which FDR correction to apply (if FDR is set to TRUE). It uses “fdr(=”BH”) by default, but check p.adjust for other available methods, namely “BY” which allows for non-independent data
  • varPart (package modEvA) now does not ask for model.type, includes in the output diagram the proportion of unexplained variance, and allows defining a title (main) for the diagram
  • several modEvA functions now automatically omit NAs, and emit a warning instead of an error if there are pred values outside the [0,1] interval (which some models include in the output)

The help files of these functions have also been updated with further details. It is advisable to unsinstall fuzzySim and modEvA and install their latest versions.