Calculate zonal statistics from rasters in multiple zip files

This is a wrapper for the zonalFromZip function published in the previous post, for when you have multiple zip files with multiple raster files each (as in the WorldClim paleo-climate database), and you want to extract zonal statistics for them all automatically. To use it, you’ll need to have the zonalFromZip function loaded, as well as the raster R package.

zonalsFromZips <- function(zip.files, zones.map, rast.file.ext = ".tif", aux.file.ext = NULL, verbosity = 1, ...) {
  # v2.0 (2018/07/11)
  results <- vector("list", length(zip.files))
  names(results) <- basename(tools::file_path_sans_ext(zip.files))
  for (f in 1:length(zip.files)) {
    message("\nUNZIPPING FILE ", f, " OF ", length(zip.files), " (", basename(zip.files[f]), ")...")
    results[[f]] <- zonalFromZip(zip.file = zip.files[f], zones.map = zones.map, rast.file.ext = rast.file.ext, aux.file.ext = aux.file.ext, verbosity = verbosity, ...)
  }; rm(f)
  message("\nFINISHED ALL!")
  return(results)
}  # end zonalsFromZips function

The result is a list of dataframes, each containing the zonal stats for one of the .zip files of rasters. Usage example:

LGM.zonals <- zonalsFromZips(zip.files = list.files("/home/joe/LGM", full.names = TRUE), zones.map = provinces)

[presented with Pretty R]

 

Calculate zonal statistics from rasters in a zip file

Imagine you have a zipped folder with a bunch of raster maps containing variables (e.g. the ones you can download from WorldClim or from CliMond), and you need to calculate zonal statistics from each of these rasters. The zonaFromZip function, provided below, automates this process without the need to unzip the folders. It extracts one raster at a time from the .zip, imports it to R, calculates zonal statistics for your zones raster map (the ‘mean’ function is used by default, but you can provide any other argument accepted by the zonal function of the R raster package), and then deletes the unzipped file before unzipping the next one, therefore requiring minimal disk space.

zonalFromZip <- function (zip.file, zones.map, fun, rast.file.ext = ".tif", aux.file.ext = NULL, delete.unzipped = TRUE, verbosity = 2, ...)
 # version 2.1 (updated 11/12/2018)
 # zip.file: path to the zip containing the raster maps to calculate zonal stats from
 # zones.map: map (in your R workspace) containing the spatial units to calculate zonal stats to; must be of class 'raster', 'SpatialPolygons' or 'SpatialPolygonsDataFrame' and have the same CRS (and resolution if raster) of the maps in 'zip.file'
 # fun: function to calculate (e.g. mean)
 # rast.file.ext: file extension of the raster maps in 'zip.file'
 # aux.file.ext: file extension for the auxiliary files (e.g. ".hdr" for .bil raster files, or ".rdc" for Idrisi .rst files)
 # ...: additional arguments to pass to the 'raster::zonal' (if 'zones.map' is a raster) or the 'raster::extract' function (if 'zones.map' is a SpatialPolygons* map), such as na.rm = TRUE
{
 require(raster)
 rast.files <- unzip(zip.file, list = TRUE) $ Name
 var.names <- unique(tools::file_path_sans_ext(rast.files))
 n.var <- length(var.names)
 zonal.stats <- vector("list", length(var.names))
 names(zonal.stats) <- var.names
 
 for (i in 1:n.var) {
 if (verbosity >= 1) message("Getting variable ", i, " of ", n.var)
 if (verbosity >= 2) message(" - unzipping file...")
 unzip(zip.file, files = paste0(var.names[i], rast.file.ext))
 if (!is.null(aux.file.ext)) unzip(zip.file, files = paste0(var.names[i], aux.file.ext))
 var.rast <- raster(paste0(var.names[i], rast.file.ext))
 if (!compareRaster(var.rast, zones.map, stopiffalse = FALSE)) {
 if (verbosity >= 2) message(" - cropping to zones raster...")
 var.rast <- crop(var.rast, zones.map)
 }
 
 if (verbosity >= 2) message(" - calculating zonal stats...")
 if (class(zones.map) %in% c("raster", "RasterLayer"))
 zonal.stats[[i]] <- raster::zonal(var.rast, zones.map, fun = fun, ...)
 else if (class(zones.map) %in% c("SpatialPolygons", "SpatialPolygonsDataFrame"))
 zonal.stats[[i]] <- raster::extract(var.rast, zones.map, fun = fun, df = TRUE, ...)
 else stop("'zones.map' must be of class 'raster', 'RasterLayer', 'SpatialPolygons' or 'SpatialPolygonsDataFrame'")
 
 if (verbosity >= 2) message(" - deleting unzipped file...")
 if (delete.unzipped) {
 unlink(list.files()[grep(pattern = paste0(var.names[i], rast.file.ext), x = list.files())])
 if (!is.null(aux.file.ext)) unlink(list.files()[grep(pattern = paste0(var.names[i], aux.file.ext), x = list.files())])
 } # end if delete
 } # end for i
 
 if (verbosity >= 1) message("Converting results to data frame...")
 zonal.stats <- as.data.frame(zonal.stats)
 zonal.stats <- subset(zonal.stats, select = c(1, seq(2, ncol(zonal.stats), 2)))
 colnames(zonal.stats)[1] <- "zone"
 colnames(zonal.stats)[-1] <- var.names
 if (verbosity >= 1) message("Finished!")
 return(zonal.stats)
}

Mind that you need the raster R package installed for this, and a raster map of the spatial units (zones) to which you want to extract the raster variables. Usage examples:

LGM.CCSM4.utm10 <- zonalFromZip(zip.file = "LGM/CCSM4.zip", zones.map = utm10, rast.file.ext = ".tif", aux.file.ext = NULL)
head(LGM.CCSM4.utm10)
 
WClim.utm10 <- zonalFromZip(zip.file = "bio_2-5m_bil.zip", zones.map = utm10, rast.file.ext = ".bil", aux.file.ext = ".hdr")
head(WClim.utm10)

Example for several .zip files within a folder at once (DEPRECATED – see next post’s zonalsFromZips function instead):

for (f in list.files("LGM")) {  
# "LGM" is the folder containing the zip files to extract zonal stats from
  name <- paste("LGM", tools::file_path_sans_ext(f), "utm10", sep = ".")
  zonstat <- zonalFromZip(zip.file = paste("LGM", f, sep = "/"), zones.map = utm10, raster.ext = ".tif", fun = "mean")
  assign(name, zonstat)
}

Classify integer columns

The integerCols function below detects which numeric columns in a data frame contain only whole numbers, and converts those columns to integer class, so that they take up less space. It uses the multConvert function, which must be loaded too. Both functions are included in the latest version of the fuzzySim package (Barbosa, 2015).

integerCols <- function(data) {
  is.wholenumber <- function(x, tol = .Machine$double.eps ^ 0.5) {
    abs(x - round(x)) < tol
  }  # from ?is.integer examples
  all.cols <- 1:ncol(data)
  integer.cols <- rep(0, ncol(data))
  for (i in all.cols) {
    x <- na.omit(data[ , i])
    if(!is.numeric(x)) next
    if(!all(is.finite(x))) next
    if(min(is.wholenumber(x) == 1)) integer.cols[i] <- 1
  }
  multConvert(data, conversion = as.integer, cols = all.cols[integer.cols == 1])
}

[presented with Pretty R]

Usage example:

dat <- data.frame(
  a = 1:10,
  b = as.numeric(1:10),
  c = seq(0.1, 1, 0.1),
  d = letters[1:10]
)
 
str(dat)  # b is classified as 'numeric' although it contains only whole numbers
 
dat2 <- integerCols(dat)
 
str(dat2)  # b now classified as 'integer'

References:

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Extract a “jumping window” from a vector

Sometimes we need to extract and analyse parts of a dataset (e.g. a vector, or the columns or rows of a data frame) along a  moving (aka sliding, rolling, running) window. The jumping.window function, included in the newly released DeadCanMove package (Barbosa et al. 2014), extracts a moving window with (currently) no overlap between windows, and with the possibility of a gap (whose size is defined by the user) between windows:

jumping.window <- function(x, window.size, gap.size) {
  window.indices <- integer(0)
  for (i in 1 : window.size) {
    window.indices <- c(window.indices, seq(from = i, to = length(x), by = window.size + gap.size))
  }
  window.indices <- sort(window.indices)
  sampl.windows <- vector[window.indices]
  return(sampl.windows)
}  # end jumping.window function

[presented with Pretty R]

Usage examples:

Paste the whole text of the function above in R, press enter, and then paste the following commands to check out how the function works:

jumping.window(x = 1:20, window.size = 1, gap.size = 0)
 
jumping.window(x = 1:20, window.size = 1, gap.size = 1)
 
jumping.window(x = 6:52, window.size = 5, gap.size = 10)
 
jumping.window(x = c(1, 3, 4, 6, 9, 10, 11), window.size = 3, gap.size = 2)
 
jumping.window(x = c("air", "bird", "cloud", "deep", "elephant", "goat"), window.size = 2, gap.size = 2)
 
 
# to extract jumping-window columns from a table:
 
head(CO2)
CO2jump <- CO2[ , jumping.window(x = 1:ncol(CO2), window.size = 2, gap.size = 1)]
head(CO2jump)

References:

Barbosa A.M., Marques J.T., Santos S.M., Lourenço A., Medinas D., Beja P. & Mira A. (2014) DeadCanMove: Assess how spatial roadkill patterns change with temporal sampling scheme. R package version 0.1 (available at http://deadcanmove.r-forge.r-project.org)

Convert geographical coordinates from degree-minute-second to decimal degrees

If you have latitude and/or longitude data in sexagesimal degrees, in the form degreesº minutes’ seconds” hemisfere (e.g. 41° 34′ 10.956″ N  or  8° 37′ 47.1036″ W, with or without spaces in between), the dms2dec function can convert them to decimal degrees, which are usually required for mapping. This function is not included in a package, but it’s in the “Supporting Information” of Zanolla et al. (2018), so please cite this paper if you use the function.

dms2dec <- function(dms, separators = c("º", "°", "\'", "\"")) {
  # version 1.0 (25 Sep 3013)
  # dms: a vector (or column) of latitude or longitude in degrees-minutes-seconds-hemisfere, e.g. 41° 34' 10.956" N (with or without spaces)
  # separators: the characters that are separating degrees, minutes and seconds in dms

  dms <- as.character(dms)
  dms <- gsub(pattern = " ", replacement = "", x = dms)
  for (s in separators) dms <- gsub(pattern = s, replacement = "_splitHere_", x = dms)

  splits <- strsplit(dms, split = "_splitHere_")
  n <- length(dms)
  deg <- min <- sec <- hem <- vector("character", n)

  for (i in 1:n) {
    deg[i] <- splits[[i]][1]
    min[i] <- splits[[i]][2]
    sec[i] <- splits[[i]][3]
    hem[i] <- splits[[i]][4]
  }

  dec <- as.numeric(deg) + (as.numeric(min) / 60) + (as.numeric(sec) / 3600)
  sign <- ifelse (hem %in% c("N", "E"), 1, -1)
  dec <- sign * dec
  return(dec)
}  # end dms2dec function 

[presented with Pretty R]

Usage example, from a table called mydata:

LatLonTable

mydata$latitude.decimal <- dms2dec(mydata$Latitude)

mydata$longitude.decimal <- dms2dec(mydata$Longitude)

REFERENCES:

Zanolla M., Altamirano M., Carmona R., De La Rosa J., Souza-Egipsy V., Sherwood A., Tsiamis K., Barbosa A.M., Muñoz A.R. & Andreakis N. (2018) Assessing global range expansion in a cryptic species complex: insights from the red seaweed genus Asparagopsis (Florideophyceae). Journal of Phycology, 54: 12-24

Turn a list of species by site into a species presence-absence table

Imagine you have a data frame with the species that are present in each locality, such as this:

locality      species
Cuba          Pancake bush
Sumatra       Pancake bush
Greenland     Red dwarf
South America Pancake bush
South America Tree whale
Antarctica    Red dwarf

…but what you need is a data frame with one species per column and their presence (1) or absence (0) in each of the sites (rows), such as this:

    locality Pancake.bush Red.dwarf Tree.whale
     Africa            1         0          0
 Antarctica            1         1          0
       Asia            1         0          0
  Australia            1         1          0
     Baffin            1         1          1
      Banks            1         1          0

You can use R’s table function for this, but it can take a bit of fiddling to get the result in an easily manageable format. The splist2presabs function, now included in the fuzzySim package (Barbosa 2014), does this in one step:

splist2presabs <- function(data, sites.col, sp.col, keep.n = FALSE) {
  # version 1.1 (7 May 2013)
  # data: a matrix or data frame with your localities and species (each in a different column)
  # sites.col: the name or index number of the column containing the localities
  # sp.col: the name or index number of the column containing the species names or codes
  # keep.n: logical, whether to get in the resulting table the number of times each species appears in each locality; if false (the default), only the presence (1) or absence (0) are recorded

  stopifnot(
    length(sites.col) == 1,
    length(sp.col) == 1,
    sites.col != sp.col,
    sites.col %in% 1 : ncol(data) | sites.col %in% names(data),
    sp.col %in% 1 : ncol(data) | sp.col %in% names(data),
    is.logical(keep.n)
  )

  presabs <- table(data[ , c(sites.col, sp.col)])
  presabs <- as.data.frame(unclass(presabs))
  if (!keep.n)  presabs[presabs > 1] <- 1
  presabs <- data.frame(row.names(presabs), presabs)
  names(presabs)[1] <- names(subset(data, select = sites.col))
  rownames(presabs) <- NULL
  return(presabs)
}  # end splist2presabs function

To try an example, load the splist2presabs function (above) and then do the following:

# get a set of localities and some fake species:
loc <- names(islands)
spp <- c("Pancake bush", "Tree whale", "Red dwarf")
# combine them in a data frame:
data <- data.frame(locality = sample(loc, 100, replace = TRUE), species = sample(spp, 100, replace = TRUE))
# take a look at the data:
head(data)
# turn these into a presence-absence data frame and check it out:
atlas <- splist2presabs(data, sites.col = 1, sp.col = 2)
head(atlas)
# if you'd rather have columns with shorter names, load the spCodes function and do:
data$spcode <- spCodes(data$species)
head(data)
atlas <- splist2presabs(data, sites.col = 1, sp.col = 3)
head(atlas)

[presented with Pretty R]

REFERENCES:

Barbosa A.M. (2014) fuzzySim: Fuzzy similarity in species’ distributions. R package, version 0.1.

Create species name codes

Often we need to work with data sets that have species in column names. Species names are often larger than the maximum size allowed, for example, in .dbf tables to associate with maps in a GIS. We thus need to create codes for the species names, which should be short but also unambiguous (you should recognize the species through the species code) and unique (there must not be two species with the same code). The spCodes function, now included in the fuzzySim package (Barbosa, 2014), combines a specified number of characters of the genus and species (and optionally also the subspecies) names, and checks if the codes are unique:

spCodes <- function(species, nchar.gen = 3, nchar.sp = 3, nchar.ssp = 0, sep.species = " ", sep.spcode = "") {
  # version 1.8 (7 Mai 2013)

  species <- as.character(species)
  splits <- strsplit(species, split = sep.species)
  gen <- sp <- vector("character", length(splits))
  for (i in 1:length(splits)) {
    gen[i] <- splits[[i]][1]
    sp[i] <- splits[[i]][2]
    if (is.na(sp[i]))  sp[i] <- ""
  }  # end for i

  abbrev <- function (string, nchar) {
    abv <- substr(string, start = 1, stop = nchar)
    return(abv)
  }  # end abbrev function
  gen.code <- abbrev(gen, nchar.gen)
  sp.code <- abbrev(sp, nchar.sp)
  spcode <- paste(gen.code, sp.code, sep = sep.spcode)

  if (nchar.ssp > 0) {
    ssp <- vector("character", length(splits))
    for (i in 1:length(splits)) {
      ssp[i] <- splits[[i]][3]
      if (is.na(ssp[i]))  ssp[i] <- ""
    }
    ssp.code <- abbrev(ssp, nchar.ssp)
    spcode <- paste(spcode, ssp.code, sep = sep.spcode)
  }

  if (length(unique(species)) != length(unique(spcode))) {
    data <- cbind(species, spcode)
    unique.data <- data[!duplicated(data[ , c("species")]), ]
    ordered.data <- unique.data[order(unique.data[, "spcode"]), ]
    duplicates <- duplicated(ordered.data[ , "spcode"])
    for (i in 1:length(duplicates)) {
      if (isTRUE(duplicates[i])) duplicates[i-1] <- TRUE
    }  # to include the original as well as the duplicate
    ordered.data.duplicates <- ordered.data[duplicates, ]
    print(ordered.data.duplicates)
    stop("Resulting species codes are not unique (see above); try a different combination of nchar.gen, nchar.sp (and nchar.ssp), or check that you have specified the correct sep.species.")
  }
  message("OK - no duplicated spcodes found.")
  return(spcode)
}  # end spCodes function

[Created by Pretty R at inside-R.org]

If you have a vector named “species” with your species names, with the genus and the specific name separated by a white space, load the function above and then type:

spcode <- spCodes(species, nchar.gen = 3, nchar.sp = 3, nchar.ssp = 0, sep.species = " ", sep.spcode = "")

This will create species codes with the first 3 characters from the genus and the first 3 characters from the specific name (although you can set any other numbers of characters, such as 1 + 5). If the resulting spcodes are not unique (i.e., if the number of species names is not the same as the number of resulting species codes), you will get an error message showing the duplicates and asking you to use another combination of numbers of characters for the genus and specific name.

If your input species binomials are separated by a character other than a space (e.g. a dot or an underscore), you can specify this with the sep.species argument; but note that all species names must be separated by the same character (e.g., ONE white space) for the function to work correctly. By default, there will be no character separating the two parts of the spcode, but you can use the sep.spcode argument to define one (e.g., sep.spcode = “_”).

If you have a table called mydata with species names in columns 2 to 30, for example, you can turn them into species codes like this:

spcodes <- spCodes(names(mydata)[2:30], nchar.gen = 3, nchar.sp = 3, sep.species = " ", sep.spcode = "")

names(mydata)[2:30] <- spcodes


REFERENCES:

Barbosa A.M. (2014) fuzzySim: Fuzzy similarity in species’ distributions. R package, version 1.0.

Determine the percentage of data to reserve for model testing

Based on the work of Schaafsma & van Vark (1979), Huberty (1994) provided a heuristic (‘rule of thumb’) for determining an adequate proportion of data to set aside for testing species presence/absence models, based on the number of predictor variables that are used (Fielding & Bell 1997). The percentTestData function, now included in the fuzzySim package (Barbosa 2015), calculates this proportion as a percentage:

percentTestData <- function(nvar) {
  # v1.1 (28 Mar 2013)
  # heuristic ('rule of thumb') of Huberty (1994; in Fielding & Bell 1997) to determine the test/training data ratio for presence-absence models
  # nvar: number of predictor variables
  huberty <- 1 / (1 + sqrt(nvar - 1))
  return(round(100 * huberty, 1))
}

[presented with Pretty R]

For example, if you’re building a model based on 15 variables, load the percentTestData function and then just type:

percentTestData(15)

References

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Huberty C.J. (1994) Applied Discriminant Analysis. Wiley, New York, 466 pp.
Schaafsma W. & van Vark G.N. (1979) Classification and discrimination problems with applications. Part IIa. Statistica Neerlandica 33: 91-126
Fielding A. H. & Bell J. F. (1997) A review of methods for the assessment of prediction errors in conservation presence/absence models. Environmental Conservation 24: 38-49

Multiple conversions

Sometimes we need to change the data type (class, mode) of a variable in R. There are various possible conversions, performed by functions like as.integer, as.factor or as.character. If we need to perform the same conversion on a number of variables (columns) in a data frame, we can convert them all simultaneously using the multConvert function, now included in the fuzzySim package (Barbosa, 2015).

multConvert <- function(data, conversion, cols = 1:ncol(data)) {
  for(i in cols) {
    data[ , i] <- conversion(data[ , i])
  }
  return(data)
}

[presented with Pretty R]

By default it converts all the columns in the data frame, but you can specify just a few colums — for example:

mydata <- multConvert(data = mydata, conversion = as.integer, cols = c(2:4, 6, 8:11))

multConvert can also be used to apply other kinds of transformations. For example, if you need to divide some of your columns by 100, just write a function to do this and then use multConvert to apply this function to any group of columns:

div100 <- function(x) { x / 100 }
mydata <- multConvert(mydata, conversion = div100, cols = c(2, 4, 5))

REFERENCES:

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858