modTools: simple tools for simpler modelling

Featured

This website provides a series of free software tools, mostly written in R, that can be used for various tasks related to the analysis of species’ distribution and ecology (although many of them can be applied to other purposes). Most of them are now being included in R packages — please cite the appropriate package (mentioned in the blog post featuring each function) if you use these tools, which are released under a Creative Commons Attribution license. Please note that these tools are experimental (like R, they come with absolutely no warranty) and occasionally edited with corrections or improvements, so preferably check back for the latest version before you use a function.

Leave a comment here if you use these functions in publications — this way I can cite you in the blog, track how people are using the tools, and justify the time and effort dedicated to them. Thanks are due to Arpat Ozgul and François Guilhaumon, who helped me through the steep part of the R learning curve (to me it was actually more like an overhang). Comments and suggestions are very welcome, although a quick response cannot be guaranteed!

Downloading and cleaning GBIF data with R

Many students and workshop participants ask me for a (semi-)automated way to 1) download species occurrence data from GBIF into R, and 2) clean such data from common errors. The following script does that, while calling the user’s attention to the need for properly citing the data sources (not just GBIF, which is a repository for many sources), and for carefully mapping and inspecting them for additional problems which are not picked up by current tools.

#######################################
## DOWNLOAD AND CLEAN DATA FROM GBIF ##
#######################################

library(rgbif)
library(scrubr)
library(maps)


# IF YOU HAVE ONLY ONE SPECIES OR TAXONOMIC GROUP ----

myspecies <- c("Galemys pyrenaicus")  # instead of a species it can be another taxonomic group, e.g. "Mammalia" or "Talpidae"

# download GBIF occurrence data for this taxon; this takes time if there are many data points!
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 20000)


# take a look at the downloaded data:
gbif_data
# if "Records found" is larger than "Records returned", you need to increase the 'limit' argument above -- see help(occ_data) for options and limitations


# if your taxon is widespread but you want to work on a particular region, you can download records within a specified window of coordinates:
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 20000, decimalLongitude = "-10, 10", decimalLatitude = "35, 55")  # note that coordinate ranges must be specified this way: "smaller, larger" (e.g. "-5, -2")

gbif_data

# get the DOIs for citing these data properly:
gbif_citation(gbif_data)
# note: if you need or prefer only one DOI for the entire dataset, download the dataset directly from www.gbif.org and then import the .csv to R. It is very important to properly cite the data sources! GBIF is not a source, just a repository for many people who put in very hard work to collect these data and make them available

# check how the data are organized:
names(gbif_data)
names(gbif_data$meta)
names(gbif_data$data)

# get the columns that matter for mapping and cleaning the occurrence data:
myspecies_coords <- gbif_data$data[ , c("decimalLongitude", "decimalLatitude", "individualCount", "occurrenceStatus", "coordinateUncertaintyInMeters", "institutionCode", "references")]
head(myspecies_coords)

# map the occurrence data:
map("world", xlim = range(myspecies_coords$decimalLongitude), ylim = range(myspecies_coords$decimalLatitude))  # if the map doesn't appear right at first, run this command again
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], pch = ".")

# you may notice (especially if you zoom in, e.g. by specifying a smaller range of coordinates under 'xlim' and 'ylim' above) that many points are too regularly spaced to be exact locations of species sightings; rather, such points are likely to be centroids of (relatively large) grid cells on which particular surveys was based, so remember to adjust the spatial resolution of your analysis accordingly!

# also, these data are likely to contain species absences and location errors, so jump to "CLEAN THE DATASET" section below - this is VERY IMPORTANT!!!


# IF YOU HAVE MORE THAN ONE TAXONOMIC GROUP ----

myspecies <- c("Galemys pyrenaicus", "Chioglossa lusitanica")

# download GBIF occurrence data for these species; this may take a long time if there are many data points!
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 500)  # decrease the 'limit' if you just want to see how many records there are without waiting all the time that it will take to download the whole dataset

# take a look at the downloaded data:
gbif_data
# if, for any species, "Records found" is larger than "Records returned", you need to increase the 'limit' argument above -- see help(occ_data) for options and limitations

# get the DOI for citing these data properly:
gbif_citation(gbif_data)  # unfortunately it is more complicated to obtain with R a proper citation for a dataset with multiple species. To get a DOI for these data, download the dataset directly from www.gbif.org and then import the .csv to R. It is very important to properly cite the data sources! GBIF is not a source, just a repository for many people who put in very hard work to collect these data and make them available

# if your species are widespread but you want to work on a particular region, you can download records within a specified window of coordinates:
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 20000, decimalLongitude = "-10, 10", decimalLatitude = "35, 55")  # note that coordinate ranges must be specified this way: "smaller, larger" (e.g. "-5, -2")

gbif_data

# check how the data are organized:
names(gbif_data)
names(gbif_data[[myspecies[1]]])
names(gbif_data[[myspecies[1]]]$meta)
names(gbif_data[[myspecies[1]]]$data)


# create and fill a list with only the 'data' section for each species:
myspecies_coords_list <- vector("list", length(myspecies))
names(myspecies_coords_list) <- myspecies
for (s in myspecies) {
  coords <- gbif_data[[s]]$data[ , c("decimalLongitude", "decimalLatitude", "individualCount", "occurrenceStatus", "coordinateUncertaintyInMeters", "institutionCode", "references")]
  myspecies_coords_list[[s]] <- data.frame(species = s, coords)
}
lapply(myspecies_coords_list, head)

# collapse the list into a data frame:
myspecies_coords <- as.data.frame(do.call(rbind, myspecies_coords_list), row.names = 1:sum(sapply(myspecies_coords_list, nrow)))
head(myspecies_coords)
tail(myspecies_coords)

# map the occurrence data:
map("world", xlim = range(myspecies_coords$decimalLongitude), ylim = range(myspecies_coords$decimalLatitude))  # if the map doesn't appear right at first, run this command again
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], col = myspecies_coords$species, pch = ".")

# you may notice (especially if you zoom in, e.g. by specifying a smaller range of coordinates under 'xlim' and 'ylim' above) that many points are too regularly spaced to be exact locations of species sightings; rather, such points are likely to be centroids of (relatively large) grid cells on which particular surveys were based, so remember to adjust the spatial resolution of your analysis accordingly!


# CLEAN THE DATASET! ----

# mind that data often contain errors, so careful inspection and cleaning are necessary! 
# here we'll first remove records of absence or zero-abundance (if any):
names(myspecies_coords)
sort(unique(myspecies_coords$individualCount))  # notice if some points correspond to zero abundance
sort(unique(myspecies_coords$occurrenceStatus))  # check for different indications of "absent", which could be in different languages! and remember that R is case-sensitive
absence_rows <- which(myspecies_coords$individualCount == 0 | myspecies_coords$occurrenceStatus %in% c("absent", "Absent", "ABSENT", "ausente", "Ausente", "AUSENTE"))
length(absence_rows)
if (length(absence_rows) > 0) {
  myspecies_coords <- myspecies_coords[-absence_rows, ]
}

# let's do some further data cleaning with functions of the 'scrubr' package (but note this cleaning is not exhaustive!)
nrow(myspecies_coords)
myspecies_coords <- coord_incomplete(coord_imprecise(coord_impossible(coord_unlikely(myspecies_coords))))
nrow(myspecies_coords)

# map the cleaned occurrence data:
map("world", xlim = range(myspecies_coords$decimalLongitude), ylim = range(myspecies_coords$decimalLatitude))  # if the map doesn't appear right at first, run this command again
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], col = myspecies_coords$species, pch = ".")
# possible erroneous points e.g. on the Equator (lat and lon = 0) should have disappeared now

# also eliminate presences with reported coordinate uncertainty (location error, spatial resolution) larger than 5 km (5000 m):
myspecies_coords <- coord_uncertain(myspecies_coords, coorduncertainityLimit = 5000)
nrow(myspecies_coords)
# but note that this will only get rid of records where coordinate uncertainty is adequately reported, which may not always be the case! Careful mapping and visual inspection is necessary

# map the cleaned occurrence records with a different colour on top of the raw ones:
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], pch = 20, cex = 0.5, col = "turquoise")

Plot predicted values for presences vs. absences

Diagnostic plots are always a nice, visually explicit way to assess our data and predictions, and I’ve just added a new one to the modEvA R package. ThepredPlotfunction plots predicted values separated into observed presences and absences, and coloured according to whether they’re above or below a given prediction threshold (the default threshold is species prevalence, i.e. proportion of presences, in the observed data). It shows whether there’s a visible split in the predicted values for presences and absences (see also functions predDensity and plotGLM in the same package, for alternative/complementary ways of visualizing this). The plot imitates (with permission from the author) one of the graphical outputs of the ‘summary‘ of models built with the ‘embarcadero‘ package (Carlson, 2020), but it can be applied to any ‘glm‘ object or any set of observed and predicted values, and it allows specifying a user-defined prediction threshold. The ‘predPlot‘ function is now included in package modEvA (version >= 2.1, currently available on R-Forge).

predPlot <- function(model = NULL, obs = NULL, pred = NULL, thresh = "preval", main = "Classified predicted values", legend.pos = "bottomright") {
  # version 1.0 (20 Jan 2021)
  
  if (!is.null(model)) {
    if (!("glm" %in% class(model)) || family(model)$family != "binomial") stop("'model' must be of class 'glm' and family 'binomial'.")
    if (!is.null(obs)) message("Argument 'obs' ignored in favour of 'model'.")
    if (!is.null(pred)) message("Argument 'pred' ignored in favour of 'model'.")
    obs <- model$y
    pred <- model$fitted.values
  } else {
    if (is.null(obs) || is.null(pred)) stop ("You must provide either 'model' or a combination of 'obs' and 'pred'.")
    if (!is.numeric(obs) || !is.numeric(pred)) stop ("'obs' and 'pred' must be numeric.")
    if (length(obs) != length(pred)) stop("'obs' and 'pred' must have the same length.")
  }
  
  if (!(thresh == "preval" || (is.numeric(thresh) && thresh >= 0 && thresh <= 1))) stop ("'thresh' must be either 'preval' or a numeric value between 0 and 1.")
  if (thresh == "preval")  thresh <- prevalence(obs)

  pred0 <- pred[obs == 0]
  pred1 <- pred[obs == 1]
  
  opar <- par(no.readonly = TRUE)
  on.exit(par(opar))
  par(mar = c(5, 5.2, 3, 1))

  plot(x = c(0, 1), y = c(-0.5, 1.5), xlab = "Predicted value", type = "n", ylab = "", yaxt = "n", main = main)
  axis(side = 2, at = c(0, 1), tick = FALSE, labels = c("Observed\nabsences", "Observed\npresences"), las = 1)
  abline(v = thresh, lty = 2)
  points(x = pred0, y = sapply(rep(0, length(pred0)), jitter, 10), col = ifelse(pred0 < thresh, "grey", "black"))
  points(x = pred1, y = sapply(rep(1, length(pred1)), jitter, 10), col = ifelse(pred1 < thresh, "grey", "black"))
  
  if (!is.na(legend.pos) && legend.pos != "n")  legend(legend.pos, legend = c("Predicted presence", "Predicted absence"), pch = 1, col = c("black", "grey"))
}

Usage examples:

library(modEvA)

# load sample models:
data(rotif.mods)

# choose a particular model to play with:
mod <- rotif.mods$models[[1]]

# make some predPlots:
predPlot(model = mod)
predPlot(model = mod, thresh = 0.5)

# you can also use 'predPlot' with vectors of observed and predicted values instead of a model object:
myobs <- mod$y
mypred <- mod$fitted.values
predPlot(obs = myobs, pred = mypred)

References

Carlson C.J. (2020) embarcadero: Species distribution modelling with Bayesian additive regression trees in R. Methods in Ecology and Evolution, 11: 850-858.

Functions for time tracking and management

Especially since I had to start working as a freelancer, it became essential to keep an accurate record of how much time I spend on each task and project. I checked out some software tools available online (namely TimeCamp, which I’ve used for a while), but I decided to see if I could write my own R functions for tracking time use, especially when working offline (some of us still like or need to work offline sometimes!), and for producing some related plots and analyses afterwards. Here’s what I came up with.

I first need to have a file named “PROJECTS.csv“, with a column named “project” and a column named “task“, and the names of the projects and tasks I’ll be working on, like this:

typeprojecttask
consultingelephantsanalysis
consultingelephantswriting
consultingelephantsrevising
trainingR-GISpreparing
trainingR-GISteaching
experimentingtimemanprogramming
experimentingtimemandebugging

I can manually add projects and tasks to this file at any time. Then I use the ‘start_task‘ and ‘stop_task‘ functions below to add entries to another file, which is named “DONT-TOUCH_timeman_data.csv“, has five columns (proj, task, start, stop, time) and should not be moved or modified manually:

projects <- read.csv("PROJECTS.csv")
tail(projects)

timeman_data <- read.csv("DONT-TOUCH_timeman_data.csv", as.is = TRUE)
tail(timeman_data)

start_task <- function(proj, task, time = NULL) {
  if (nrow(timeman_data) > 0 && is.na(timeman_data[nrow(timeman_data), ncol(timeman_data)]))
    stop ("previous task must be stopped before starting another")
  if (!(proj %in% unique(projects$project)))
    stop ("proj must exist in 'PROJECTS.csv'")
  if (!(task %in% unique(projects$task)))
    stop ("task must exist in 'PROJECTS.csv'")
  #if (!(proj %in% unique(timeman_data$projecto)))
    #timeman_data[nrow(timeman_data) + 1, "projecto"] <- proj
  if (is.null(time))
    time <- Sys.time()
  timeman_data[nrow(timeman_data) + 1, 1:3] <- c(proj, task, time)
  timeman_data <<- timeman_data
}

stop_task <- function(time = NULL) {
  if (nrow(timeman_data) > 0 && !is.na(timeman_data[nrow(timeman_data), ncol(timeman_data)]))
    stop ("task must be started before being stopped")
  if (is.null(time))
    time <- Sys.time()
  #this_row <- which.max(timeman_data$projecto == proj)
  this_row <- nrow(timeman_data)
  timeman_data[this_row, "stop"] <- as.character(time)
  timeman_data[this_row, "time"] <- difftime(timeman_data[this_row, "stop"], timeman_data[this_row, "start"])
  timeman_data <<- timeman_data
  write.csv(timeman_data, "DONT-TOUCH_timeman_data.csv", row.names = FALSE)
  cat("Showing last rows in tasks table:\n\n")
  tail(timeman_data)
}

current_task <- function() {
  timeman_data[nrow(timeman_data), ]
}

I need to use ‘start_task‘ every time I start working on a task, and then ‘stop_task‘ every time I stop. This adds an entry with the starting, stopping, and total time spent in each instance:

unique(projects$proj)
unique(projects$task)
unique(timeman_data$proj)

# ADD START:
start_task(proj = "timeman", task = "debugging")
tail(timeman_data)

# ADD STOP:
stop_task()
tail(timeman_data)

I can then apply all sorts of R analyses and plots to the time-tracking data, which is recorded in the ‘timeman_data‘ data frame and CSV file. However, these functions are certainly far from ideal, as they use “unsafe” operators (like ‘<<-‘) and they rely on a file remaining untouched in its folder. It’s plain to see how things could easily go wrong at any moment…. Suggestions welcome on how this could be improved and made safer! While this was certainly an interesting exercise, I think I’ll just stick to TimeCamp for the time being 🙂

Plot outliers and their values

The ‘plot_outliers‘ function below draws a boxplot and a scatterplot of a numeric variable x and plots the values of the outliers (currently not offset, even if they overlap). For relatively small datasets, it can be a quick way to identify which outliers look reasonable and which are likely a result of transcription or measurement error, and thus should be either corrected or discarded.

plot_outliers <- function(x, val_col = "blue", ...) { 
  par_in <- par(no.readonly = TRUE) 
  par(mfrow = c(1, 2)) 
  bp <- boxplot(x, ...) 
  out <- bp$out 
  message(length(out), " outliers detected") 
  if (length(out) > 0) text(x = 0.5, y = bp$out, labels = round(out, 2), adj = 0, col = val_col)
  plot(x, pch = 20)
  if (length(out) > 0) text(x = 0.5, y = bp$out, labels = round(out, 2), adj = 0, col = val_col)
  par(par_in)
}

Usage examples:

plot_outliers(iris$Sepal.Width)

Additional arguments for the ‘boxplot‘ function can be provided, e.g.

plot_outliers(airquality$Ozone, notch = TRUE)
plot_outliers(airquality$Wind, col = "darkgreen", main = "wind")

This function is used in an article which we hope to submit soon.

fuzzySim updated to 3.0 on CRAN!

The newest version of R package fuzzySim (3.0) is now on CRAN! It includes new functions such as ‘sharedFav‘, favClass‘, ‘bioThreat‘ and ‘gridRecords; improvements to some functions, help files and examples; updated e-mail and citation information [ see citation(“fuzzySim”) ]; clarifications and typo corrections along the reference manual; and some bug fixes (after changes to base R and/or to function dependencies), e.g. to ‘getPreds‘ when applied to raster objects. You should now uninstall the old version of the package and install the new one:

remove.packages("fuzzySim")
install.packages("fuzzySim")

Among other new functionalities, fuzzySim now makes it easier to use variable selection and presence-absence modelling techniques on occurrence points + raster variables, as these are becoming the more common data formats in species distribution modelling (SDM) and ecological niche modelling (ENM). Here’s a worked example:

# download and plot predictor variables:
# (make sure their spatial resolution is not finer than the occurrence data)
library(raster)
worldclim <- getData("worldclim", var = "bio", res = 10)
plot(worldclim)
plot(worldclim[[1]])

# download and plot species occurrence data:
library(rgbif)
gbif <- occ_data(scientificName = "Galemys pyrenaicus", hasCoordinate = TRUE, limit = 5000)
# GBIF data often have errors, data cleaning (e.g. with package 'scrubr') should be performed!
# here we'll just remove records of absence or zero-abundance:
absence_rows <- which(gbif$data$occurrenceStatus == "absent" | gbif$data$organismQuantity == 0)
if (length(absence_rows) > 0) gbif$data <- gbif$data[-absence_rows, ]
presences <- gbif$data[ , c("decimalLongitude", "decimalLatitude")]
# plot the occurrence records on the map:
points(presences)

# crop variables to the extent of the occurrence data (mind that this is just one possible way of cropping/masking to the adequate region of analysis):
worldclim_crop <- crop(worldclim, extent(range(presences$decimalLongitude), range(presences$decimalLatitude)))
plot(worldclim_crop[[1]])
points(presences)

# model occurrence data as presence-absence in the cropped grid of pixels:
library(fuzzySim)

# first, get the centroid coordinates and variable values at pixels with and without presence records
# (remember, raster resolution should not be finer than occurrence data!)
gridded_presences <- gridRecords(rst = worldclim_crop, pres.coords = presences)
head(gridded_presences)
plot(worldclim_crop[[1]])
points(gridded_presences[gridded_presences$presence == 0, c("x", "y")], col = "red")
points(gridded_presences[gridded_presences$presence == 1, c("x", "y")], col = "blue", pch = 20)
# then, build a GLM with variable selection on these presence-absence data:
names(gridded_presences)
model_GLM <- multGLM(data = gridded_presences, sp.cols = "presence", var.cols = 5:23, id.col = "cells", FDR = TRUE, corSelect = TRUE, step = TRUE, trim = TRUE)
summary(model_GLM$models$presence)
head(model_GLM$predictions)

# finally, get and plot the model predictions (probability and favourability):
pred_GLM_raster <- getPreds(data = stack(worldclim_crop), models = model_GLM$models)
plot(pred_GLM_raster)

If you’re worried about using a presence-absence modelling algorithm on ‘presence-only’ records, you can compare these predictions with those of a widely used presence-background algorithm (Maxent) on the same data, to check that they are not far off:

library(maxnet)

?maxnet
# argument "p" should be "a vector of 1 (for presence) or 0 (for background)", not 0 for (pseudo)absence as above
# so let's add to the table a replicate of the presence rows and turn them to zeros, to include them in the background for this species:
nrow(gridded_presences)
n_pres <- sum(gridded_presences[ , "presence"])
n_pres
pres0 <- subset(gridded_presences, presence == 1)
# convert the replicated presences to "0" for background:
pres0[ , "presence"] <- 0
# join the original table to the newly created background for this species:
pb <- rbind(gridded_presences, pres0)
# check that everything went as expected:
nrow(pb) == nrow(gridded_presences) + n_pres  # should be TRUE
sum(gridded_presences[ , "presence"]) == sum(pb[ , "presence"])  # should also be TRUE

# build a maxent model with all variables:
model_maxent_allvars <- maxnet(p = gridded_presences[ , "presence"], data = gridded_presences[ , 5:23], f = maxnet.formula(p = gridded_presences[ , "presence"], data = gridded_presences[ , 5:23], classes = "lq"))  # linear + quadratic features

# build a maxent model with only the variables selected by multGLM (folowing FDR, correlation, AIC, significance):
selected_vars <- model_GLM$variables$presence
model_maxent_selvars <- maxnet(p = gridded_presences[ , "presence"], data = gridded_presences[ , selected_vars], f = maxnet.formula(p = gridded_presences[ , "presence"], data = gridded_presences[ , selected_vars], classes = "lq"))

# compute and map maxent predictions:
pred_maxent_allvars_raster <- raster::predict(worldclim_crop, model_maxent_allvars, type = "cloglog", clamp = TRUE)
pred_maxent_selvars_raster <- raster::predict(worldclim_crop, model_maxent_selvars, type = "cloglog", clamp = TRUE)
par(mfrow = c(1, 2))
plot(pred_maxent_allvars_raster, main = "Maxent all vars")
plot(pred_maxent_selvars_raster, main = "Maxent selected vars")

This and other modelling methods will be taught next week at the CIBIO advanced course on ecological niche modelling. Check out the Training page for other upcoming courses!

Grid point occurrence records onto a raster

The ‘gridRecords‘ function, which has just been added to the ‘fuzzySim‘ package (from version 2.6 on), takes a raster stack and a set of spatial coordinates of a species’ presence (and optionally absence) records, and returns a data frame with the presences and absences, as well as the corresponding values of the rasters in the grid of pixels (cells). If absence coordinates are not supplied, all pixels without any presence point will be returned as absences. A precursor of this function was used in Báez et al. (2020) for getting unique presences and absences from point occurrence data at the spatial resolution of marine raster variables. The function can be especially useful for using point + raster data to compute presence-absence models with packages or functions that normally require data frames as input, such as glm, multGLM, gam or randomForest.

gridRecords <- function(rst,
                        pres.coords,
                        abs.coords = NULL,
                        na.rm = TRUE) {
  
  # version 2.0 (3 Feb 2020)
  
  if (!requireNamespace("raster")) stop("This function requires installing the 'raster' package first.")
  
  if (is.null(abs.coords)) {
    abs.coords <- raster::coordinates(rst)
  }

  p_extract <- raster::extract(rst, pres.coords, cellnumbers = TRUE, df = TRUE)[ , -1]
  a_extract <- raster::extract(rst, abs.coords, cellnumbers = TRUE, df = TRUE)[ , -1]
  
  p_extract <- unique(p_extract)
  a_extract <- unique(a_extract)
  
  a_extract <- a_extract[!(a_extract$cells %in% p_extract$cells), ]
  
  p_centroids <- raster::xyFromCell(rst, p_extract$cells)
  a_centroids <- raster::xyFromCell(rst, a_extract$cells)
  
  p_extract <- data.frame(presence = 1, p_centroids, p_extract)
  if (nrow(a_extract) > 0) {
    a_extract <- data.frame(presence = 0, a_centroids, a_extract)
  }
  
  result <- rbind(p_extract, a_extract)
  
  if (na.rm) {
    result_NA <- which(apply(result[ , 5:ncol(result)], MARGIN = 1, FUN = function(x) all(is.na(x))))
    if (length(result_NA) > 0) {
      result <- result[-result_NA, ]
    }
  }
  
  return(result)
}

Usage example:

library(raster)
library(fuzzySim)  # >= 2.6

# import a system raster with 3 layers and crop it to a smaller extent:
rst <- stack(system.file("external/rlogo.grd", package = "raster"))
ext <- extent(c(0, 15, 25, 40))
rst <- crop(rst, ext)
plot(rst)
plot(rst[[1]])

# generate some random presence and absence points:
set.seed(123)
presences <- sp::spsample(as(ext, "SpatialPolygons"), 50, type = "random")
absences <- sp::spsample(as(ext, "SpatialPolygons"), 50, type = "random")
points(presences, pch = 20, cex = 0.2, col = "black")
points(absences, pch = 20, cex = 0.2, col = "white")

# use 'gridRecords' on these random points:
gridded_pts <- gridRecords(rst, coordinates(presences), coordinates(absences))
head(gridded_pts)  # 'red', 'green' and 'blue' are the names of the layers in 'rst'

# plot them to check the result:
pres_coords <- gridded_pts[gridded_pts$presence == 1, c("x", "y")]
abs_coords <- gridded_pts[gridded_pts$presence == 0, c("x", "y")]
points(gridded_pts[ , c("x", "y")], pch = 4, cex = 0.6, col = gridded_pts$presence)
# you can also do it with only presence (no absence) records:
gridded_pres <- gridRecords(rst, coordinates(presences))
head(gridded_pres)
plot(rst[[1]])
points(presences, pch = 20, cex = 0.2, col = "black")
pres_coords <- gridded_pres[gridded_pres$presence == 1, c("x", "y")]
abs_coords <- gridded_pres[gridded_pres$presence == 0, c("x", "y")]
points(gridded_pres[ , c("x", "y")], pch = 4, cex = 0.6, col = gridded_pres$presence)

References
Baez J.C., Barbosa A.M., Pascual P., Ramos M.L. & Abascal F. (2020) Ensemble modelling of the potential distribution of the whale shark in the Atlantic Ocean. Ecology and Evolution, 10: 175-184

modEvA 2.0 now on CRAN!

The new version of modEvA (2.0) is now on CRAN! It can produce some new plots, such as the precision-recall curve and the histograms/densities of predicted values for presence and absence observations. It calculates some additional evaluation measures, such as the area under the precision-recall curve (function ‘AUC‘), mean precision (also function ‘AUC‘) and the F1 score (function ‘threshMeasures‘). These measures are now also among the outputs of functions ‘multModEv‘, ‘optiThresh‘ and ‘optiPair‘. I’ve also fixed some bugs that were present in the previous version on CRAN, such as an annoying error when using function ‘varPart‘ with only two factors. You should now uninstall the old version of the package and install the new one:

remove.packages("modEvA")
install.packages("modEvA")

Area under the precision-recall curve

The AUC function, in the modEvA package, initially computed only the area under the receiver operating characteristic (ROC) curve. Now, since modEvA version 1.7 (currently available on R-Forge), it also offers the option to compute the precision-recall curve, which may be better for comparing models based on imbalanced data (e.g. for rare species) — see e.g. Sofaer et al. (2019).

Usage example:

library(modEvA)
mod <- rotif.mods$models[["Ktropi"]]
par(mfrow = c(1, 2))
AUC(mod, main = "ROC curve")
AUC(mod, curve = "PR", main = "Precision-recall curve")

References

Sofaer, H.R., Hoeting, J.A. & Jarnevich, C.S. (2019). The area under the precision-recall curve as a performance metric for rare binary events. Methods in Ecology and Evolution, 10: 565-577

Plot the density of predicted values for presences and absences

The ‘predDensity’ function, included in the modEvA package since version 1.5 (currently available on R-Forge), produces a histogram and/or a kernel density plot of predicted values for observed presences and absences in a binomial GLM:

predDensity <- function (model = NULL, obs = NULL, pred = NULL, separate = TRUE, 
  type = c("both"), legend.pos = "topright") 
{
  if (!is.null(model)) {
    if (!("glm" %in% class(model)) || family(model)$family != 
      "binomial") 
      stop("'model' must be of class 'glm' and family 'binomial'.")
    if (!is.null(obs)) 
      message("Argument 'obs' ignored in favour of 'model'.")
    if (!is.null(pred)) 
      message("Argument 'pred' ignored in favour of 'model'.")
    obs <- model$y
    pred <- model$fitted.values
  }
  if (is.null(obs)) {
    if (is.null(pred)) 
      stop("You must provide either 'model' or 'pred'.")
    separate <- FALSE
    obs <- sample(c(0, 1), length(pred), replace = TRUE)
  }
  else {
    if (length(obs) != length(pred)) 
      stop("'obs' and 'pred' must have the same length.")
  }
  pred0 <- pred[obs == 0]
  pred1 <- pred[obs == 1]
  type <- match.arg(type, c("histogram", "density", "both"))
  rslt <- vector("list")
  if (type %in% c("density", "both")) {
    if (!separate) {
      dens <- density(pred)
      xrange <- range(dens$x, finite = TRUE)
      yrange <- range(dens$y, finite = TRUE)
      rslt[["density"]] <- dens
    }
    else {
      dens0 <- density(pred0)
      dens1 <- density(pred1)
      xrange <- range(dens0$x, dens1$x, finite = TRUE)
      yrange <- range(dens0$y, dens1$y, finite = TRUE)
      rslt[["density_obs1"]] <- dens1
      rslt[["density_obs0"]] <- dens0
    }
    plot(x = xrange, y = yrange, xlab = "Predicted value", 
      ylab = "Density", type = "n")
  }
  if (type %in% c("histogram", "both")) {
    hist0 <- hist(pred0, plot = FALSE)
    hist1 <- hist(pred1, plot = FALSE)
    if (type == "histogram") {
      yrange <- range(hist0$density, hist1$density, finite = TRUE)
      plot(x = c(0, 1), y = yrange, type = "n", xlab = "Predicted value", 
        ylab = "Density")
    }
    if (!separate) {
      histogram <- hist(c(pred0, pred1), freq = FALSE, 
        col = "grey20", add = TRUE)
      rslt[["histogram"]] <- histogram
    }
    else {
      hist(pred1, freq = FALSE, col = "grey20", add = TRUE)
      hist(pred0, freq = FALSE, col = "darkgrey", density = 40, 
        angle = 45, add = TRUE)
      rslt[["histogram_obs1"]] <- hist1
      rslt[["histogram_obs0"]] <- hist0
      if (legend.pos != "n" && type == "histogram") 
        legend(legend.pos, legend = c("absences", "presences"), 
          fill = c("darkgrey", "grey20"), border = NA, 
          density = c(40, NA), bty = "n")
    }
  }
  if (type %in% c("density", "both")) {
    if (!separate) {
      lines(dens, col = "black", lwd = 2)
    }
    else {
      lines(dens1, col = "black", lwd = 2)
      lines(dens0, col = "darkgrey", lty = 5, lwd = 2)
      if (legend.pos != "n" && type == "density") 
        legend(legend.pos, legend = c("absences", "presences"), 
          col = c("darkgrey", "black"), lty = c(5, 1), 
          bty = "n")
      if (legend.pos != "n" && type == "both") 
        legend(legend.pos, legend = c("absences", "presences"), 
          fill = c("darkgrey", "grey20"), border = NA, 
          lty = c(5, 1), col = c("darkgrey", "grey15"), 
          density = c(40, NA), bty = "n")
    }
  }
  return(rslt)
}

Usage example:

install.packages("modEvA", repos="http://R-Forge.R-project.org")
library(modEvA)
data(rotif.mods)
predDensity(model = rotif.mods$models[[3]])