modTools: simple tools for simpler modelling

Featured

This website provides a series of free software tools, mostly written in R, that can be used for various tasks related to the analysis of species’ distribution and ecology (although many of them can be applied to other purposes). Most of them are now being included in R packages — please cite the appropriate package (mentioned in the blog post featuring each function) if you use these tools, which are released under a Creative Commons Attribution license. Please note that these tools are experimental (like R, they come with absolutely no warranty) and occasionally edited with corrections or improvements, so preferably check back for the latest version before you use a function.

Leave a comment here if you use these functions in publications — this way I can cite you in the blog, track how people are using the tools, and justify the time and effort dedicated to them. Thanks are due to Arpat Ozgul and François Guilhaumon, who helped me through the steep part of the R learning curve (to me it was actually more like an overhang). Comments and suggestions are very welcome, although a quick response cannot be guaranteed!

Model evaluation with presence points and raster predictions

The Boyce index (Boyce et al. 2002) is often described as a presence-only metric for evaluating the predictions of species distribution (or ecological niche, or habitat suitability) models (e.g. Hirzel et al. 2006, Cianfrani et al. 2010, Bellard et al. 2013, Valavi et al. 2022). It measures the predicted-to-expected ratio of presences in each class of predicted values, i.e., whether classes / regions with higher predicted values have indeed higher proportions of presences than expected by chance. But, if there’s a proportion of presences in each class, then the rest of the class (i.e. the complementary proportion) must be… non-presences, which are equally necessary for the computation of this index (as for any other model evaluation metric). We need both pixels with and pixels without presence records, i.e. both presences and (pseudo)absences, or the Boyce index cannot be computed: try using a raster map with values only in pixels with presence and see what happens. The same applies to presence-background modelling methods like Maxent and ENFA, which use both the presence and the remaining (i.e. “non-presence”) pixels, and can’t produce proper predictions without non-presence pixels — you can also try for yourself. So, they need the same type of data that presence-absence methods need. Whether or not the computation requires an explicit identification of the absences, or just presence points and a map of the entire study area, simply depends on how each method or metric is implemented in each particular software package.

The development version of the ‘modEvA‘ R package, currently available on R-Forge, computes the Boyce index and a range of other metrics that evaluate model predictions against presence/absence or presence/background data, including the area under the ROC or under the precision-recall curve (AUC), sensitivity, specificity, TSS, Cohen’s kappa, Hosmer-Lemeshow goodness-of-fit, Miller calibration statistics, and several others. For all these metrics, the input can be either 1) a model object, or 2) a couple of numeric vectors with observed presence/absence and the corresponding predicted values, or 3) a set of presence point coordinates and a raster map of the predictions across the model evaluation region. Here some examples:

install.packages("modEvA", repos = "http://R-Forge.R-project.org")
library(modEvA)
library(terra)

# import a raster map with a predictor variable:
elev <- rast(system.file("ex/elev.tif", package = "terra"))

# import a species' occurrence data for this area:
gbif <- geodata::sp_occurrence(genus = "Dendrocopos", species = "major", ext = elev)
head(gbif)
unique(gbif$occurrenceStatus)  # remove absences if any!
pres_coords <- gbif[ , c("lon", "lat")]
plot(elev)
points(pres_coords, pch = 20, cex = 0.2)

# get a data frame of the pixels with and without presences:
dat <- fuzzySim::gridRecords(elev, pres_coords)
head(dat)
points(dat[dat$presence == 0, c("x", "y")], col = "red", cex = 0.5)
points(dat[dat$presence == 1, c("x", "y")], col = "blue", cex = 0.5)

# compute a species distribution model (e.g. a GLM) from these data:
mod <- glm(presence ~ elevation, family = binomial, data = dat)
summary(mod)

# get a raster map with the predictions from this model:
pred <- predict(elev, mod, type = "response")
plot(pred, main = "Woodpecker presences and predictions")
points(pres_coords, pch = 20, cex = 0.2)
# compute some model evaluation metrics
# using just these presence coordinates and raster predictions:

par(mfrow = c(2, 2), mar = c(5, 5, 2, 1))

Boyce(obs = pres_coords, pred = pred, main = "Boyce index")

AUC(obs = pres_coords, pred = pred, main = "ROC curve")

threshMeasures(obs = pres_coords, pred = pred, thresh = "maxTSS", measures = c("Sensitivity", "Specificity", "Precision", "TSS", "kappa"), standardize = FALSE, main = "Threshold-based")

AUC(obs = pres_coords, pred = pred, curve = "PR", main = "PR curve")

So, any evaluation metric can be computed using “only presence points”, as long as the raster map of predictions includes both pixels that do and pixels that don’t overlap these points, the latter of which are necessarily taken as available unoccupied pixels.

Note you may get slightly different results with ‘modEvA::Boyce‘ and with ‘ecospat::ecospat.boyce‘ because ‘ecospat.boyce‘ removes duplicate classes by default, though both functions allow controlling this with argument ‘rm.dup.classes‘ or ‘rm.duplicate‘, respectively (see help file of either function for details). Bug reports and other feedback welcome on this version of package ‘modEvA‘, before I submit it to CRAN!

REFERENCES

Boyce, M.S., P.R. Vernier, S.E. Nielsen & F.K.A. Schmiegelow (2002) Evaluating resource selection functions. Ecological Modelling 157: 281-300

Bellard, C., Thuiller, W., Leroy, B., Genovesi, P., Bakkenes, M., & Courchamp, F. (2013) Will climate change promote future invasions? Global Change Biology, 19(12), 3740. https://doi.org/10.1111/GCB.12344

Cianfrani, C., Le Lay, G., Hirzel, A. H., & Loy, A. (2010) Do habitat suitability models reliably predict the recovery areas of threatened species? Journal of Applied Ecology, 47(2), 421–430. https://doi.org/10.1111/J.1365-2664.2010.01781.X

Hirzel, A. H., Le Lay, G., Helfer, V., Randin, C., & Guisan, A. (2006) Evaluating the ability of habitat suitability models to predict species presences. Ecological Modelling, 199(2), 142-152. http://www.sciencedirect.com/science/article/pii/S0304380006002468

Valavi, R., Guillera-Arroita, G., Lahoz-Monfort, J. J., & Elith, J. (2022) Predictive performance of presence-only species distribution models: a benchmark study with reproducible code. Ecological Monographs, 92(1), e01486. https://doi.org/10.1002/ECM.1486

Degree-minute-second to decimal coordinates

The ‘dms2dec’ function, posted here a while ago to convert longitude-latitude coordinates from degree-minute-second to decimal format, has recently been updated to accomodate more cases that I and my course participants run into. The function is pasted below and is also available on GitHub, from where it can be sourced directly from R with source("https://raw.githubusercontent.com/AMBarbosa/unpackaged/master/dms2dec", encoding = "UTF-8"). Feedback welcome on whether it is working properly!

dms2dec <- function(dms, separators = c("º", "°", "\'", "’", "’’", "\"", "\'\'", "\\?")) {
  
  # version 1.4 (2 Feb 2022)
  # dms: a vector of latitude or longitude in degrees-minutes-seconds-hemisfere, e.g. 41° 34' 10.956" N (with or without spaces)
  # separators: the characters that are separating degrees, minutes and seconds in 'dms'; mind these are taken in the order in which they appear and not interpreted individually, i.e. 7'3º will be taken as 7 degrees, 3 minutes! input data are assumed to be properly formatted
  
  dms <- as.character(dms)
  dms <- gsub(pattern = " ", replacement = "", x = dms)
  for (s in separators) dms <- gsub(pattern = s, replacement = "_splitHere_", x = dms)
  
  splits <- strsplit(dms, split = "_splitHere_")
  n <- length(dms)
  deg <- min <- sec <- hem <- vector("character", n)
  
  for (i in 1:n) {
    deg[i] <- splits[[i]][1]
    min[i] <- splits[[i]][2]
    
    if (length(splits[[i]]) < 4) {
      hem[i] <- splits[[i]][3]
    } else {
      sec[i] <- splits[[i]][3]
      hem[i] <- splits[[i]][4]
    }
  }
  
  dec <- colSums(rbind(as.numeric(deg), (as.numeric(min) / 60), (as.numeric(sec) / 3600)), na.rm = TRUE)
  sign <- ifelse (hem %in% c("N", "E"), 1, -1)
  hem_miss <- which(is.na(hem))
  if (length(hem_miss) > 0) {
    warning("Hemisphere not specified at position(s) ", hem_miss, ", so the sign of the resulting coordinates may be wrong.")
  }
  dec <- sign * dec
  return(dec)
}  # end dms2dec function

Usage example:

    mydata$lat_dd <- dms2dec(mydata$lat_dms)

    mydata$lon_dd <- dms2dec(mydata$lon_dms)

Package modEvA 3.0 is now on CRAN!

This version of R package “model Evaluation and Analysis” includes some bug fixes (thanks to Huijie Qiao, Ying-Ju Tessa Chen, Oswald van Ginkel and Alba Estrada), some new functions (predPlot, confusionLabel, and mod2obspred, which is now used internally by several others), and it implements more classes (‘gam’, ‘gbm’, ‘randomForest’, ‘bart’) besides ‘glm’ for the ‘model‘ argument in most functions. There are also a few argument additions and improvements — e.g., varPart now has an option to plot the circles in colour (thanks to Oswald van Ginkel). You can read the package NEWS file for details.

You can now install the newest version of modEvA from CRAN and try out some of these new features:

install.packages("modEvA")
library(modEvA)

# load some other packages to make different models:
library(gam)
library(gbm)


# take a sample dataset and create a numeric binary response variable:
data(kyphosis)
head(kyphosis)
kyphosis$Kyphosis <- ifelse(kyphosis$Kyphosis == "present", 1, 0)

# make different models with this response variable:
mod_glm <- glm(Kyphosis ~ Age + Number + Start, family = binomial, data = kyphosis)
mod_gam <- gam(Kyphosis ~ s(Age) + s(Number) + s(Start), family = binomial, data = kyphosis)
mod_gbm <- gbm(Kyphosis ~ Age + Number + Start, distribution = "bernoulli", data = kyphosis)

# get different evaluation metrics/plots directly from the model objects:
# e.g., density of predictions for presences and absences:
predPlot(model = mod_glm, main = "GLM")
predPlot(model = mod_gam, main = "GAM")
predPlot(model = mod_gbm, main = "GBM")
predDensity(model = mod_glm, main = "GLM")
predDensity(model = mod_gam, main = "GAM")
predDensity(model = mod_gbm, main = "GBM")

# (area under the) ROC and Precision-Recall curves:
AUC(model = mod_glm, main = "GLM")
AUC(model = mod_gam, main = "GAM")
AUC(model = mod_gbm, main = "GBM")
AUC(model = mod_glm, curve = "PR", main = "GLM")
AUC(model = mod_gam, curve = "PR", main = "GAM")
AUC(model = mod_gbm, curve = "PR", main = "GBM")

You can try also other functions such as ‘threshMeasures‘, ‘MillerCalib‘ or ‘HLfit‘. And check out the colour version of ‘varPart‘:

varPart(A = 0.456, B = 0.315, AB = 0.852, A.name = "Spatial", B.name = "Climatic", col = TRUE)
varPart(A = 0.456, B = 0.315, C = 0.281, AB = 0.051, BC = 0.444,
AC = 0.569, ABC = 0.624, A.name = "Spatial", B.name = "Human",
C.name = "Climatic", col = TRUE)

Feel free to send me any bug reports! Feature requests/suggestions are also welcome, though I can’t promise a timely response… And remember to look for the latest package updates in the development page on R-Forge!

Mapping the confusion matrix

The ‘confusionLabel‘ function below labels the predictions of a binary response model according to their confusion matrix categories, i.e., it classifies each prediction into a false positive, false negative, true positive or true negative, given a user-defined threshold value:

confusionLabel <- function(model, # placeholder, not yet implemented
                           obs,  # vector of observed 0 and 1 values
                           pred, # vector of predicted probabilities
                           thresh) # threshold to separate predictions
{
  if (length(obs) != length(pred)) stop("'obs' and 'pred' must be of the same length (and in the same order)")
  res <- rep("", length(obs))
  res[pred >= thresh & obs == 1] <- "TruePos"
  res[pred < thresh & obs == 0] <- "TrueNeg"
  res[pred >= thresh & obs == 0] <- "FalsePos"
  res[pred < thresh & obs == 1] <- "FalseNeg"
  res
}

Usage example:

confusionLabel(obs = myspecies_presabs, pred = myspecies_P, thresh = 0.23)

The result is a vector with the same length as ‘obs’ and ‘pred’, containing the label for each value. This allows you to then map the confusion matrix, like this:

This function will be included in the next version of the ‘modEvA‘ package, which is soon to be released.

Downloading and cleaning GBIF data with R

Many students and workshop participants ask me for a (semi-)automated way to 1) download species occurrence data from GBIF into R, and 2) clean such data from common errors. The following script does that, while calling the user’s attention to the need for properly citing the data sources (not just GBIF, which is a repository for many sources), and for carefully mapping and inspecting them for additional problems which are not picked up by current tools.

#######################################
## DOWNLOAD AND CLEAN DATA FROM GBIF ##
#######################################

library(rgbif)
library(scrubr)
library(maps)


# IF YOU HAVE ONLY ONE SPECIES OR TAXONOMIC GROUP ----

myspecies <- "Galemys pyrenaicus"  # instead of a species it can be another taxonomic group, e.g. "Mammalia" or "Talpidae"

# download GBIF occurrence data for this taxon; this takes time if there are many data points!
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 20000)


# take a look at the downloaded data:
gbif_data
# if "Records found" is larger than "Records returned", you need to increase the 'limit' argument above -- see help(occ_data) for options and limitations


# if your taxon is widespread but you want to work on a particular region, you can download records within a specified window of coordinates:
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 20000, decimalLongitude = "-10, 10", decimalLatitude = "35, 55")  # note that coordinate ranges must be specified this way: "smaller, larger" (e.g. "-5, -2")

gbif_data

# get the DOIs for citing these data properly:
gbif_citation(gbif_data)
# note: if you need or prefer only one DOI for the entire dataset, download the dataset directly from www.gbif.org and then import the .csv to R. It is very important to properly cite the data sources! GBIF is not a source, just a repository for many people who put in very hard work to collect these data and make them available. See also the new "derived datasets" tool (https://www.gbif.org/citation-guidelines#derivedDatasets) for obtaining a DOI based on the datasetKey column.

# check how the data are organized:
names(gbif_data)
names(gbif_data$meta)
names(gbif_data$data)

# get the columns that matter for mapping and cleaning the occurrence data:
myspecies_coords <- gbif_data$data[ , c("decimalLongitude", "decimalLatitude", "individualCount", "occurrenceStatus", "coordinateUncertaintyInMeters", "institutionCode", "references")]
head(myspecies_coords)

# map the occurrence data:
map("world", xlim = range(myspecies_coords$decimalLongitude), ylim = range(myspecies_coords$decimalLatitude))  # if the map doesn't appear right at first, run this command again
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], pch = ".")

# you may notice (especially if you zoom in, e.g. by specifying a smaller range of coordinates under 'xlim' and 'ylim' above) that many points are too regularly spaced to be exact locations of species sightings; rather, such points are likely to be centroids of (relatively large) grid cells on which particular surveys was based, so remember to adjust the spatial resolution of your analysis accordingly!

# also, these data are likely to contain species absences and location errors, so jump to "CLEAN THE DATASET" section below - this is VERY IMPORTANT!!!


# IF YOU HAVE MORE THAN ONE TAXONOMIC GROUP ----

myspecies <- c("Galemys pyrenaicus", "Chioglossa lusitanica")

# download GBIF occurrence data for these species; this may take a long time if there are many data points!
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 500)  # decrease the 'limit' if you just want to see how many records there are without waiting all the time that it will take to download the whole dataset

# take a look at the downloaded data:
gbif_data
# if, for any species, "Records found" is larger than "Records returned", you need to increase the 'limit' argument above -- see help(occ_data) for options and limitations

# get the DOI for citing these data properly:
gbif_citation(gbif_data)  # unfortunately it is more complicated to obtain with R a proper citation for a dataset with multiple species. To get a DOI for these data, download the dataset directly from www.gbif.org and then import the .csv to R. It is very important to properly cite the data sources! GBIF is not a source, just a repository for many people who put in very hard work to collect these data and make them available. See also the new "derived datasets" tool (https://www.gbif.org/citation-guidelines#derivedDatasets) for obtaining a DOI based on the datasetKey column.

# if your species are widespread but you want to work on a particular region, you can download records within a specified window of coordinates:
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 20000, decimalLongitude = "-10, 10", decimalLatitude = "35, 55")  # note that coordinate ranges must be specified this way: "smaller, larger" (e.g. "-5, -2")

gbif_data

# check how the data are organized:
names(gbif_data)
names(gbif_data[[myspecies[1]]])
names(gbif_data[[myspecies[1]]]$meta)
names(gbif_data[[myspecies[1]]]$data)


# create and fill a list with only the 'data' section for each species:
myspecies_coords_list <- vector("list", length(myspecies))
names(myspecies_coords_list) <- myspecies
for (s in myspecies) {
  coords <- gbif_data[[s]]$data[ , c("decimalLongitude", "decimalLatitude", "individualCount", "occurrenceStatus", "coordinateUncertaintyInMeters", "institutionCode", "references")]
  myspecies_coords_list[[s]] <- data.frame(species = s, coords)
}
lapply(myspecies_coords_list, head)

# collapse the list into a data frame:
myspecies_coords <- as.data.frame(do.call(rbind, myspecies_coords_list), row.names = 1:sum(sapply(myspecies_coords_list, nrow)))
head(myspecies_coords)
tail(myspecies_coords)

# map the occurrence data:
map("world", xlim = range(myspecies_coords$decimalLongitude), ylim = range(myspecies_coords$decimalLatitude))  # if the map doesn't appear right at first, run this command again
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], col = myspecies_coords$species, pch = ".")

# you may notice (especially if you zoom in, e.g. by specifying a smaller range of coordinates under 'xlim' and 'ylim' above) that many points are too regularly spaced to be exact locations of species sightings; rather, such points are likely to be centroids of (relatively large) grid cells on which particular surveys were based, so remember to adjust the spatial resolution of your analysis accordingly!


# CLEAN THE DATASET! ----

# mind that data often contain errors, so careful inspection and cleaning are necessary! 
# here we'll first remove records of absence or zero-abundance (if any):
names(myspecies_coords)
sort(unique(myspecies_coords$individualCount))  # notice if some points correspond to zero abundance
sort(unique(myspecies_coords$occurrenceStatus))  # check for different indications of "absent", which could be in different languages! and remember that R is case-sensitive
absence_rows <- which(myspecies_coords$individualCount == 0 | myspecies_coords$occurrenceStatus %in% c("absent", "Absent", "ABSENT", "ausente", "Ausente", "AUSENTE"))
length(absence_rows)
if (length(absence_rows) > 0) {
  myspecies_coords <- myspecies_coords[-absence_rows, ]
}

# let's do some further data cleaning with functions of the 'scrubr' package (but note this cleaning is not exhaustive!)
nrow(myspecies_coords)
myspecies_coords <- coord_incomplete(coord_imprecise(coord_impossible(coord_unlikely(myspecies_coords))))
nrow(myspecies_coords)

# map the cleaned occurrence data:
map("world", xlim = range(myspecies_coords$decimalLongitude), ylim = range(myspecies_coords$decimalLatitude))  # if the map doesn't appear right at first, run this command again
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], col = myspecies_coords$species, pch = ".")
# possible erroneous points e.g. on the Equator (lat and lon = 0) should have disappeared now

# also eliminate presences with reported coordinate uncertainty (location error, spatial resolution) larger than 5 km (5000 m):
myspecies_coords <- coord_uncertain(myspecies_coords, coorduncertainityLimit = 5000)
nrow(myspecies_coords)
# but note that this will only get rid of records where coordinate uncertainty is adequately reported, which may not always be the case! Careful mapping and visual inspection is necessary

# map the cleaned occurrence records with a different colour on top of the raw ones:
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], pch = 20, cex = 0.5, col = "turquoise")

Plot predicted values for presences vs. absences

Diagnostic plots are always a nice, visually explicit way to assess our data and predictions, and I’ve just added a new one to the modEvA R package. ThepredPlotfunction plots predicted values separated into observed presences and absences, and coloured according to whether they’re above or below a given prediction threshold (the default threshold is species prevalence, i.e. proportion of presences, in the observed data). It shows whether there’s a visible split in the predicted values for presences and absences (see also functions predDensity and plotGLM in the same package, for alternative/complementary ways of visualizing this). The plot imitates (with permission from the author) one of the graphical outputs of the ‘summary‘ of models built with the ‘embarcadero‘ package (Carlson, 2020), but it can be applied to any ‘glm‘ object or any set of observed and predicted values, and it allows specifying a user-defined prediction threshold. The ‘predPlot‘ function is now included in package modEvA (version >= 2.1, currently available on R-Forge).

predPlot <- function(model = NULL, obs = NULL, pred = NULL, thresh = "preval", main = "Classified predicted values", legend.pos = "bottomright") {
  # version 1.0 (20 Jan 2021)
  
  if (!is.null(model)) {
    if (!("glm" %in% class(model)) || family(model)$family != "binomial") stop("'model' must be of class 'glm' and family 'binomial'.")
    if (!is.null(obs)) message("Argument 'obs' ignored in favour of 'model'.")
    if (!is.null(pred)) message("Argument 'pred' ignored in favour of 'model'.")
    obs <- model$y
    pred <- model$fitted.values
  } else {
    if (is.null(obs) || is.null(pred)) stop ("You must provide either 'model' or a combination of 'obs' and 'pred'.")
    if (!is.numeric(obs) || !is.numeric(pred)) stop ("'obs' and 'pred' must be numeric.")
    if (length(obs) != length(pred)) stop("'obs' and 'pred' must have the same length.")
  }
  
  if (!(thresh == "preval" || (is.numeric(thresh) && thresh >= 0 && thresh <= 1))) stop ("'thresh' must be either 'preval' or a numeric value between 0 and 1.")
  if (thresh == "preval")  thresh <- prevalence(obs)

  pred0 <- pred[obs == 0]
  pred1 <- pred[obs == 1]
  
  opar <- par(no.readonly = TRUE)
  on.exit(par(opar))
  par(mar = c(5, 5.2, 3, 1))

  plot(x = c(0, 1), y = c(-0.5, 1.5), xlab = "Predicted value", type = "n", ylab = "", yaxt = "n", main = main)
  axis(side = 2, at = c(0, 1), tick = FALSE, labels = c("Observed\nabsences", "Observed\npresences"), las = 1)
  abline(v = thresh, lty = 2)
  points(x = pred0, y = sapply(rep(0, length(pred0)), jitter, 10), col = ifelse(pred0 < thresh, "grey", "black"))
  points(x = pred1, y = sapply(rep(1, length(pred1)), jitter, 10), col = ifelse(pred1 < thresh, "grey", "black"))
  
  if (!is.na(legend.pos) && legend.pos != "n")  legend(legend.pos, legend = c("Predicted presence", "Predicted absence"), pch = 1, col = c("black", "grey"))
}

Usage examples:

library(modEvA)

# load sample models:
data(rotif.mods)

# choose a particular model to play with:
mod <- rotif.mods$models[[1]]

# make some predPlots:
predPlot(model = mod)
predPlot(model = mod, thresh = 0.5)

# you can also use 'predPlot' with vectors of observed and predicted values instead of a model object:
myobs <- mod$y
mypred <- mod$fitted.values
predPlot(obs = myobs, pred = mypred)

References

Carlson C.J. (2020) embarcadero: Species distribution modelling with Bayesian additive regression trees in R. Methods in Ecology and Evolution, 11: 850-858.

Functions for time tracking and management

Especially since I had to start working as a freelancer, it became essential to keep an accurate record of how much time I spend on each task and project. I checked out some software tools available online (namely TimeCamp, which I’ve used for a while), but I decided to see if I could write my own R functions for tracking time use, especially when working offline (some of us still like or need to work offline sometimes!), and for producing some related plots and analyses afterwards. Here’s what I came up with.

I first need to have a file named “PROJECTS.csv“, with a column named “project” and a column named “task“, and the names of the projects and tasks I’ll be working on, like this:

typeprojecttask
consultingelephantsanalysis
consultingelephantswriting
consultingelephantsrevising
trainingR-GISpreparing
trainingR-GISteaching
experimentingtimemanprogramming
experimentingtimemandebugging

I can manually add projects and tasks to this file at any time. Then I use the ‘start_task‘ and ‘stop_task‘ functions below to add entries to another file, which is named “DONT-TOUCH_timeman_data.csv“, has five columns (proj, task, start, stop, time) and should not be moved or modified manually:

projects <- read.csv("PROJECTS.csv")
tail(projects)

timeman_data <- read.csv("DONT-TOUCH_timeman_data.csv", as.is = TRUE)
tail(timeman_data)

start_task <- function(proj, task, time = NULL) {
  if (nrow(timeman_data) > 0 && is.na(timeman_data[nrow(timeman_data), ncol(timeman_data)]))
    stop ("previous task must be stopped before starting another")
  if (!(proj %in% unique(projects$project)))
    stop ("proj must exist in 'PROJECTS.csv'")
  if (!(task %in% unique(projects$task)))
    stop ("task must exist in 'PROJECTS.csv'")
  #if (!(proj %in% unique(timeman_data$projecto)))
    #timeman_data[nrow(timeman_data) + 1, "projecto"] <- proj
  if (is.null(time))
    time <- Sys.time()
  timeman_data[nrow(timeman_data) + 1, 1:3] <- c(proj, task, time)
  timeman_data <<- timeman_data
}

stop_task <- function(time = NULL) {
  if (nrow(timeman_data) > 0 && !is.na(timeman_data[nrow(timeman_data), ncol(timeman_data)]))
    stop ("task must be started before being stopped")
  if (is.null(time))
    time <- Sys.time()
  #this_row <- which.max(timeman_data$projecto == proj)
  this_row <- nrow(timeman_data)
  timeman_data[this_row, "stop"] <- as.character(time)
  timeman_data[this_row, "time"] <- difftime(timeman_data[this_row, "stop"], timeman_data[this_row, "start"])
  timeman_data <<- timeman_data
  write.csv(timeman_data, "DONT-TOUCH_timeman_data.csv", row.names = FALSE)
  cat("Showing last rows in tasks table:\n\n")
  tail(timeman_data)
}

current_task <- function() {
  timeman_data[nrow(timeman_data), ]
}

I need to use ‘start_task‘ every time I start working on a task, and then ‘stop_task‘ every time I stop. This adds an entry with the starting, stopping, and total time spent in each instance:

unique(projects$proj)
unique(projects$task)
unique(timeman_data$proj)

# ADD START:
start_task(proj = "timeman", task = "debugging")
tail(timeman_data)

# ADD STOP:
stop_task()
tail(timeman_data)

I can then apply all sorts of R analyses and plots to the time-tracking data, which is recorded in the ‘timeman_data‘ data frame and CSV file. However, these functions are certainly far from ideal, as they use “unsafe” operators (like ‘<<-‘) and they rely on a file remaining untouched in its folder. It’s plain to see how things could easily go wrong at any moment…. Suggestions welcome on how this could be improved and made safer! While this was certainly an interesting exercise, I think I’ll just stick to TimeCamp for the time being 🙂

Plot outliers and their values

The ‘plot_outliers‘ function below draws a boxplot and a scatterplot of a numeric variable x and plots the values of the outliers (currently not offset, even if they overlap). For relatively small datasets, it can be a quick way to identify which outliers look reasonable and which are likely a result of transcription or measurement error, and thus should be either corrected or discarded.

plot_outliers <- function(x, val_col = "blue", ...) { 
  par_in <- par(no.readonly = TRUE) 
  par(mfrow = c(1, 2)) 
  bp <- boxplot(x, ...) 
  out <- bp$out 
  message(length(out), " outliers detected") 
  if (length(out) > 0) text(x = 0.5, y = bp$out, labels = round(out, 2), adj = 0, col = val_col)
  plot(x, pch = 20)
  if (length(out) > 0) text(x = 0.5, y = bp$out, labels = round(out, 2), adj = 0, col = val_col)
  par(par_in)
}

Usage examples:

plot_outliers(iris$Sepal.Width)

Additional arguments for the ‘boxplot‘ function can be provided, e.g.

plot_outliers(airquality$Ozone, notch = TRUE)
plot_outliers(airquality$Wind, col = "darkgreen", main = "wind")

This function is used in an article which we hope to submit soon.

fuzzySim updated to 3.0 on CRAN!

The newest version of R package fuzzySim (3.0) is now on CRAN! It includes new functions such as ‘sharedFav‘, favClass‘, ‘bioThreat‘ and ‘gridRecords; improvements to some functions, help files and examples; updated e-mail and citation information [ see citation(“fuzzySim”) ]; clarifications and typo corrections along the reference manual; and some bug fixes (after changes to base R and/or to function dependencies), e.g. to ‘getPreds‘ when applied to raster objects.

Among other new functionalities, fuzzySim now makes it easier to use variable selection and presence-absence modelling techniques on occurrence points + raster variables, as these are becoming the more common data formats in species distribution modelling (SDM) and ecological niche modelling (ENM). Here’s a worked example:

# download and plot predictor variables:
# (make sure their spatial resolution is not finer than the occurrence data)
library(raster)
worldclim <- getData("worldclim", var = "bio", res = 10)
plot(worldclim)
plot(worldclim[[1]])

# download and plot species occurrence data:
library(rgbif)
gbif <- occ_data(scientificName = "Galemys pyrenaicus", hasCoordinate = TRUE, limit = 5000)
# GBIF data often have errors, data cleaning (e.g. with package 'scrubr') should be performed!
# here we'll just remove records of absence or zero-abundance:
absence_rows <- which(gbif$data$occurrenceStatus == "absent" | gbif$data$organismQuantity == 0)
if (length(absence_rows) > 0) gbif$data <- gbif$data[-absence_rows, ]
presences <- gbif$data[ , c("decimalLongitude", "decimalLatitude")]
# plot the occurrence records on the map:
points(presences)

# crop variables to the extent of the occurrence data (mind that this is just a quick way, certainly not the best one, of delimiting the adequate region for analysis):
worldclim_crop <- crop(worldclim, extent(range(presences$decimalLongitude), range(presences$decimalLatitude)))
plot(worldclim_crop[[1]])
points(presences)

# model occurrence data as presence-absence in the cropped grid of pixels:
library(fuzzySim)
library(terra)

# first, get the centroid coordinates and variable values at pixels with and without presence records
# (remember, raster resolution should not be finer than occurrence data!)
gridded_presences <- gridRecords(rst = rast(worldclim_crop), pres.coords = presences)
head(gridded_presences)
plot(worldclim_crop[[1]])
points(gridded_presences[gridded_presences$presence == 0, c("x", "y")], col = "red")
points(gridded_presences[gridded_presences$presence == 1, c("x", "y")], col = "blue", pch = 20)
# then, build a GLM with variable selection on these presence-absence data:
names(gridded_presences)
model_GLM <- multGLM(data = gridded_presences, sp.cols = "presence", var.cols = 5:23, id.col = "cell", FDR = TRUE, corSelect = TRUE, step = TRUE, trim = TRUE)
summary(model_GLM$models$presence)
head(model_GLM$predictions)

# finally, get and plot the model predictions (probability and favourability):
pred_GLM_raster <- getPreds(data = stack(worldclim_crop), models = model_GLM$models)
plot(pred_GLM_raster)

If you’re worried about using a presence-absence modelling algorithm on ‘presence-only’ records, you can compare these predictions with those of a widely used presence-background algorithm (Maxent) on the same data, to check that they are not far off:

library(maxnet)
?maxnet

# build a maxent model with all variables:
model_maxent_allvars <- maxnet(p = gridded_presences[ , "presence"], data = gridded_presences[ , 5:23], addsamplestobackground = TRUE, f = maxnet.formula(p = gridded_presences[ , "presence"], data = gridded_presences[ , 5:23], classes = "lq"))  # "lq" = linear + quadratic features

# build a maxent model with only the variables selected by multGLM (folowing FDR, correlation, AIC, significance):
selected_vars <- model_GLM$variables$presence
model_maxent_selvars <- maxnet(p = gridded_presences[ , "presence"], data = gridded_presences[ , selected_vars], addsamplestobackground = TRUE, f = maxnet.formula(p = gridded_presences[ , "presence"], data = gridded_presences[ , selected_vars], classes = "lq"))

# compute and map maxent predictions:
pred_maxent_allvars_raster <- raster::predict(worldclim_crop, model_maxent_allvars, type = "cloglog", clamp = TRUE)
pred_maxent_selvars_raster <- raster::predict(worldclim_crop, model_maxent_selvars, type = "cloglog", clamp = TRUE)
par(mfrow = c(1, 2))
plot(pred_maxent_allvars_raster, main = "Maxent all vars")
plot(pred_maxent_selvars_raster, main = "Maxent selected vars")

Compare both Maxent prediction maps with the “presence_F” map above (favourability, which removes the effect of species prevalence from presence probability, so it’s the same as presence probability if prevalence were 50% — the default prevalence assumed by Maxent). This and other modelling methods will be taught next week at the CIBIO advanced course on ecological niche modelling. Check out the Courses page for other upcoming courses!