Removing absences from GBIF datasets

I often come across GBIF users who are unaware that the records available for a given taxon are not necessarily all presences: there’s a column named “occurrenceStatus” whose value can be “PRESENT” or “ABSENT”! The absence records can, of course, be removed with simple operations in R or even omitted from the download, but many users overlook or accidentally skip this crucial step, and then end up analysing species distributions with some absence records being used as presences.

To reduce the chances of this happening, I just added new arguments to the fuzzySim function cleanCoords() (which was explained in a previous post) to allow removing absence records too, when the user wants to filter only the presences. Here’s a usage example using aardvark occurrence records downloaded with the geodata package. Note that this requires installing the latest version of fuzzySim (4.9.9):

occ <- geodata::sp_occurrence(genus = "Orycteropus", species = "afer",
       fixnames = FALSE)
# NOTE: as per the function help file, if you use GBIF data, remember to check the data use agreement and follow guidance on how to properly cite the actual data sources!

names(occ)

occ_clean <- fuzzySim::cleanCoords(occ,
             coord.cols = c("decimalLongitude", "decimalLatitude"),
             uncert.col = "coordinateUncertaintyInMeters",
             uncert.limit = 10000,
             abs.col = "occurrenceStatus")

# 764 rows in input data
# 576 rows after 'rm.dup'
# 575 rows after 'rm.equal'
# 575 rows after 'rm.imposs'
# 575 rows after 'rm.missing.any'
# 575 rows after 'rm.zero.any'
# 570 rows after 'rm.imprec.any'
# 465 rows after 'rm.uncert' (with uncert.limit=10000 and uncert.na.pass=TRUE)
# 461 rows after 'rm.abs'

As you can see, besides some common biodiversity data issues such as duplicated or erroneous coordinates, additional records were removed because they represented absences. Hopefully this can help prevent their incorrect use as species presences. Feedback welcome!

Getting continent, mainland and island maps in R

Maps of continents, mainlands and islands can be useful, for example, for selecting areas — and then cropping or masking variables — for modelling a species’ distribution. Here’s a way to obtain such maps using the ‘geodata’ and ‘terra’ R packages:

# load required packages:
library(terra)
library(geodata)

# import a world countries map:
countries <- world(resolution = 5, path = "maps")  # you may choose a smaller (more detailed) resolution for the polygon borders, and a different folder path to save the imported map
head(countries)

# import a table with country codes and continents:
cntry_codes <- country_codes()
head(cntry_codes)

# add this table to the countries map attributes:
head(countries)
head(cntry_codes[ , 1:4])
countries <- merge(countries, cntry_codes, by.x = "GID_0", by.y = "ISO3", all.x = TRUE)
head(countries)

# plot the countries map coloured according to "continent":
plot(countries, "continent", lwd = 0.2, main = "Countries by continent")

# dissolve (aggregate) countries into a continents map:
continents <- aggregate(countries, by = "continent")
values(continents)
plot(continents, "continent", lwd = 0.2)
# note that each continent (colour) is a multi-part polygon including mainland and islands - see also:
plot(continents[1, ])

# disaggregate continent polygons, to then separate islands and mainlands:
continents <- disagg(continents)

# get a map of just the continent mainlands (largest polygons):
unique(continents$continent)
largest <- (order(expanse(continents), decreasing = TRUE))[1:length(unique(continents$continent))]
mainlands <- continents[largest, ]
plot(mainlands, "continent", lwd = 0.2, main = "Continent mainlands")

# get a map of just the islands (i.e. polygons except mainlands):
islands <- erase(continents, mainlands)
plot(islands, "continent", lwd = 0.2, main = "World islands")
# you can then crop and mask a raster map to given islands or continents, e.g.:
elevation <- elevation_global(res = 10, path = "maps")  # you may choose a smaller (more detailed) resolution or pixel size
africa_mainland <- subset(mainlands, mainlands$continent == "Africa")
elev_afr_mainland <- crop(elevation, africa_mainland, mask = TRUE)
plot(elev_afr_mainland, main = "Elevation in mainland Africa")

You can also use the geodata::gadm() function to download polygons for particular countries (instead of the whole world), and then apply similar procedures to separate islands from mainlands.

Safe-and-simple cleaning of species occurrences

In my species distribution modelling courses, for a quick and safe removal of the most common biodiversity database errors, I’ve so far used functions from the scrubr R package, namely ‘coord_incomplete’, ‘coord_impossible’, ‘coord_unlikely’, ‘coord_imprecise’ and ‘coord_uncertain’. There are other R packages for species occurrence data cleaning (e.g. CoordinateCleaner, biogeo, or the GUI-based bdclean), but they either require (somewhat laborious) data formatting, and/or they can eliminate many valid records (e.g. near region centroids or towns or institutions, or wrongly considered outliers) if their argument values are not mindfully tailored and their results not carefully inspected. As I need students to have at least most of the worst data errors cleaned before modelling, but can’t spend much of their time and attention on something that’s not the main focus of the course, scrubr was my go-to option… until it was archived, both on CRAN and on GitHub! While I always advise students to do a more thorough data inspection for real work after the course, I implemented those no-fuss, safe-and-simple essential cleaning procedures in a new cleanCoords function in package fuzzySim. This function removes (any or all of) duplicate, missing, impossible, unlikely, imprecise or overly uncertain spatial coordinates.

Here’s a usage example, which requires having a recent version of fuzzySim and also the geodata package for downloading some GBIF records to clean:

library(fuzzySim)
library(geodata)

# download some species occurrences from GBIF:
occ <- geodata::sp_occurrence(genus = "Orycteropus", species = "afer", fixnames = FALSE)

# clean the occurrences:
names(occ)
occ_clean <- fuzzySim::cleanCoords(occ, coord.cols = c("decimalLongitude", "decimalLatitude"), uncert.col = "coordinateUncertaintyInMeters", uncert.limit = 10000)  # 10 km tolerance

# 761 rows in input data
# 574 rows after 'rm.dup'
# 573 rows after 'rm.equal'
# 573 rows after 'rm.imposs'
# 573 rows after 'rm.missing.any'
# 573 rows after 'rm.zero.any'
# 568 rows after 'rm.imprec.any'
# 462 rows after 'rm.uncert' (with uncert.limit=10000 and uncert.na.pass=TRUE)

Run help(cleanCoords) to see the arguments that you can turn on or off. As I always like my students to look at the results of every step of the analysis before moving on, I added a plot argument which is TRUE by default. So you’ll also get a figure like this, with the selected presences in blue and the excluded ones in red:

Feedback welcome!

Lollipop chart

According to modern recommendations in data viz, lollipop charts are generally a better alternative to bar charts, as they reduce the visual distortion caused by the length of the bars, making it easier to compare the values. So, in the next versions of the ‘modEvA‘ and ‘fuzzySim‘ packages, functions that produce bar plots will instead (by default) produce lollipop charts, using the new ‘lollipop’ function which will be included in ‘modEvA‘. I know ‘ggplot2‘ produces great lollipop charts already, but I like to keep my package dependencies to a minimum, or else they become much harder to maintain… So here’s the new function:

#' Title Lollipop chart
#'
#' @param x a numeric vector.
#' @param names a vector of the same length as 'x' with the names to be plotted below the lollipops. If this argument is left NULL and 'x' has names, then these will be used.
#' @param ymin numeric value for the lower limit of the y axis. The default is zero. If set to NA, the minimum of 'x' will be used.
#' @param sticks logical value indicating whether the sticks of the lollipops should be drawn. The default is TRUE.
#' @param col colour for the lollipops.
#' @param grid logical, whether or not to add a grid to the plot. The default is TRUE.
#' @param cex numeric value indicating the size of the lollipops. Will be passed as 'cex' to 'points' and as 'lwd' to 'arrows' (the lines or lollipop sticks).
#' @param cex.axis numeric value indicating the size of the x and y axis labels.
#' @param las argument to pass to 'plot' indicating the orientation of the axis labels.
#' @param ... additional arguments that can be used for the plot, e.g. 'main'.
#'
#' @return This function produces a lollipop chart of the values in 'x'.
#' @export
#'
#' @examples lollipop(mtcars[,1], names = rownames(mtcars), las = 2, ylab = names(mtcars)[1], cex.axis = 0.6, main = "Lollipop chart")

lollipop <- function(x, names = NULL, ymin = 0, sticks = TRUE, col = "royalblue", grid = TRUE, cex = 1, cex.axis = 1, las = 1, ...) {
  # version 1.0 (5 Jan 2023)

  if (is.na(ymin))  ymin <- min(x, na.rm = TRUE)
  plot(c(ymin, x), axes = FALSE, type = "n", xlab = "", ...)
  if (grid) grid()
  if (is.null(names)) names <- names(x)
  axis(1, at = 1:length(x), labels = names, las = las, cex.axis = cex.axis)
  axis(2, ylim = c(ymin, max(x, na.rm = TRUE)), cex.axis = cex.axis)
  points(x, pch = 20, col = col, cex = cex)
  if (sticks)  arrows(x0 = 1:length(x), x1 = 1:length(x), y0 = 0, y1 = x, length = 0, col = col, lwd = cex)
}

.

And here a use case with the ‘mtcars’ dataset of R, and a bar plot for comparison:

barplot(mtcars[,1], names = rownames(mtcars), las = 2, ylab = names(mtcars)[1], cex.names = 0.6, main = "Bar chart")
lollipop(mtcars[,1], names = rownames(mtcars), las = 2, ylab = names(mtcars)[1], cex.axis = 0.6, main = "Lollipop chart")

.

‘ymin’ is zero by default, but you can set it to NA for the minimum value:


lollipop(mtcars[,1], names = rownames(mtcars), las = 2, ylab = names(mtcars)[1], main = "Lollipop chart not starting at zero", ymin = NA)

Feedback welcome before I update the packages!

Model evaluation with presence points and raster predictions

The Boyce index (Boyce et al. 2002) is often described as a presence-only metric for evaluating the predictions of species distribution (or ecological niche, or habitat suitability) models (e.g. Hirzel et al. 2006, Cianfrani et al. 2010, Bellard et al. 2013, Valavi et al. 2022). It measures the predicted-to-expected ratio of presences in each class of predicted values, i.e., whether classes / regions with higher predicted values have indeed higher proportions of presences than expected by chance. But, if there’s a proportion of presences in each class, then the rest of the class (i.e. the complementary proportion) must be… non-presences, which are equally necessary for the computation of this index (as for virtually any other model evaluation metric). We need both pixels with and pixels without presence records, i.e. both presences and (pseudo)absences, or the Boyce index cannot be computed: try using a raster map with values only in pixels with presence (e.g. my_raster <- mask(my_raster, my_presences)) and see what happens. The same applies to presence-background modelling methods like Maxent and ENFA, which use both the presence and the remaining (i.e. “non-presence”) pixels, and can’t produce proper predictions without non-presence pixels — you can also try this for yourself. So, they need the same data that presence-absence methods need. Whether or not the computation requires an explicit identification of the absences, or just presence points and a map of the entire study area, simply depends on how each method or metric is implemented in each particular software package.

The modEvA‘ R package computes the Boyce index and a range of other metrics that evaluate model predictions against presence/absence or presence/background data, including the area under the ROC curve (AUC), the area under the precision-recall curve, sensitivity, specificity, TSS, Cohen’s kappa, Hosmer-Lemeshow goodness-of-fit, Miller calibration statistics, and several others. For all these metrics, the input can be either 1) a model object of a range of implemented classes, or 2) a pair of numeric vectors with observed presence/absence and the corresponding predicted values, or 3) a set of presence point coordinates and a raster map of the predictions across the model evaluation region. Here some usage examples:

install.packages("modEvA", repos = "http://R-Forge.R-project.org")
library(modEvA)
library(terra)

# import a raster map with a predictor variable:
elev <- rast(system.file("ex/elev.tif", package = "terra"))

# import a species' occurrence data for this area:
gbif <- geodata::sp_occurrence(genus = "Dendrocopos", species = "major", ext = elev)
head(gbif)
unique(gbif$occurrenceStatus)  # remove absences if any!
pres_coords <- gbif[ , c("lon", "lat")]
plot(elev)
points(pres_coords, pch = 20, cex = 0.2)

# get a data frame of the pixels with and without presences:
dat <- fuzzySim::gridRecords(elev, pres_coords)
head(dat)
points(dat[dat$presence == 0, c("x", "y")], col = "red", cex = 0.5)
points(dat[dat$presence == 1, c("x", "y")], col = "blue", cex = 0.5)

# compute a species distribution model (e.g. a GLM) from these data:
mod <- glm(presence ~ elevation, family = binomial, data = dat)
summary(mod)

# get a raster map with the predictions from this model:
pred <- predict(elev, mod, type = "response")
plot(pred, main = "Woodpecker presences and predictions")
points(pres_coords, pch = 20, cex = 0.2)
# compute some model evaluation metrics
# using just these presence coordinates and raster predictions:

par(mfrow = c(2, 2), mar = c(5, 5, 2, 1))

Boyce(obs = pres_coords, pred = pred, main = "Boyce index")

AUC(obs = pres_coords, pred = pred, main = "ROC curve")

threshMeasures(obs = pres_coords, pred = pred, thresh = "maxTSS", measures = c("Sensitivity", "Specificity", "Precision", "TSS", "kappa"), standardize = FALSE, main = "Threshold-based")

AUC(obs = pres_coords, pred = pred, curve = "PR", main = "PR curve")

So, any evaluation metric can be computed using “only presence points”, as long as the raster map of predictions includes both pixels that do and pixels that don’t overlap these points, the latter of which are necessarily taken as available unoccupied pixels by all of these metrics (Boyce index included).

Note you may get slightly different results with modEvA::Boyce() and with ecospat::ecospat.boyce(), because ecospat.boyce removes duplicate classes by default, though both functions allow controlling this with argument ‘rm.dup.classes‘ or ‘rm.duplicate‘, respectively (see help file of either function for details). Bug reports and other feedback welcome on this version of package ‘modEvA‘!

REFERENCES

Boyce, M.S., P.R. Vernier, S.E. Nielsen & F.K.A. Schmiegelow (2002) Evaluating resource selection functions. Ecological Modelling 157: 281-300

Bellard, C., Thuiller, W., Leroy, B., Genovesi, P., Bakkenes, M., & Courchamp, F. (2013) Will climate change promote future invasions? Global Change Biology, 19(12), 3740. https://doi.org/10.1111/GCB.12344

Cianfrani, C., Le Lay, G., Hirzel, A. H., & Loy, A. (2010) Do habitat suitability models reliably predict the recovery areas of threatened species? Journal of Applied Ecology, 47(2), 421–430. https://doi.org/10.1111/J.1365-2664.2010.01781.X

Hirzel, A. H., Le Lay, G., Helfer, V., Randin, C., & Guisan, A. (2006) Evaluating the ability of habitat suitability models to predict species presences. Ecological Modelling, 199(2), 142-152. http://www.sciencedirect.com/science/article/pii/S0304380006002468

Valavi, R., Guillera-Arroita, G., Lahoz-Monfort, J. J., & Elith, J. (2022) Predictive performance of presence-only species distribution models: a benchmark study with reproducible code. Ecological Monographs, 92(1), e01486. https://doi.org/10.1002/ECM.1486

Degree-minute-second to decimal coordinates

The ‘dms2dec’ function, posted here a while ago to convert longitude-latitude coordinates from degree-minute-second to decimal format, has recently been updated to accomodate more cases that I and my course participants run into. The function is pasted below and is also available on GitHub, from where it can be sourced directly from R with source("https://raw.githubusercontent.com/AMBarbosa/unpackaged/master/dms2dec", encoding = "UTF-8"). Feedback welcome on whether it is working properly!

dms2dec <- function(dms, separators = c("º", "°", "\'", "’", "’’", "\"", "\'\'", "\\?")) {
  
  # version 1.4 (2 Feb 2022)
  # dms: a vector of latitude or longitude in degrees-minutes-seconds-hemisfere, e.g. 41° 34' 10.956" N (with or without spaces)
  # separators: the characters that are separating degrees, minutes and seconds in 'dms'; mind these are taken in the order in which they appear and not interpreted individually, i.e. 7'3º will be taken as 7 degrees, 3 minutes! input data are assumed to be properly formatted
  
  dms <- as.character(dms)
  dms <- gsub(pattern = " ", replacement = "", x = dms)
  for (s in separators) dms <- gsub(pattern = s, replacement = "_splitHere_", x = dms)
  
  splits <- strsplit(dms, split = "_splitHere_")
  n <- length(dms)
  deg <- min <- sec <- hem <- vector("character", n)
  
  for (i in 1:n) {
    deg[i] <- splits[[i]][1]
    min[i] <- splits[[i]][2]
    
    if (length(splits[[i]]) < 4) {
      hem[i] <- splits[[i]][3]
    } else {
      sec[i] <- splits[[i]][3]
      hem[i] <- splits[[i]][4]
    }
  }
  
  dec <- colSums(rbind(as.numeric(deg), (as.numeric(min) / 60), (as.numeric(sec) / 3600)), na.rm = TRUE)
  sign <- ifelse (hem %in% c("N", "E"), 1, -1)
  hem_miss <- which(is.na(hem))
  if (length(hem_miss) > 0) {
    warning("Hemisphere not specified at position(s) ", hem_miss, ", so the sign of the resulting coordinates may be wrong.")
  }
  dec <- sign * dec
  return(dec)
}  # end dms2dec function

Usage example:

    mydata$lat_dd <- dms2dec(mydata$lat_dms)

    mydata$lon_dd <- dms2dec(mydata$lon_dms)

Package modEvA 3.0 is now on CRAN!

This version of R package “model Evaluation and Analysis” includes some bug fixes (thanks to Huijie Qiao, Ying-Ju Tessa Chen, Oswald van Ginkel and Alba Estrada), some new functions (predPlot, confusionLabel, and mod2obspred, which is now used internally by several others), and it implements more classes (‘gam’, ‘gbm’, ‘randomForest’, ‘bart’) besides ‘glm’ for the ‘model‘ argument in most functions. There are also a few argument additions and improvements — e.g., varPart now has an option to plot the circles in colour (thanks to Oswald van Ginkel). You can read the package NEWS file for details.

You can now install the newest version of modEvA from CRAN and try out some of these new features:

install.packages("modEvA")
library(modEvA)

# load some other packages to make different models:
library(gam)
library(gbm)


# take a sample dataset and create a numeric binary response variable:
data(kyphosis)
head(kyphosis)
kyphosis$Kyphosis <- ifelse(kyphosis$Kyphosis == "present", 1, 0)

# make different models with this response variable:
mod_glm <- glm(Kyphosis ~ Age + Number + Start, family = binomial, data = kyphosis)
mod_gam <- gam(Kyphosis ~ s(Age) + s(Number) + s(Start), family = binomial, data = kyphosis)
mod_gbm <- gbm(Kyphosis ~ Age + Number + Start, distribution = "bernoulli", data = kyphosis)

# get different evaluation metrics/plots directly from the model objects:
# e.g., density of predictions for presences and absences:
predPlot(model = mod_glm, main = "GLM")
predPlot(model = mod_gam, main = "GAM")
predPlot(model = mod_gbm, main = "GBM")
predDensity(model = mod_glm, main = "GLM")
predDensity(model = mod_gam, main = "GAM")
predDensity(model = mod_gbm, main = "GBM")

# (area under the) ROC and Precision-Recall curves:
AUC(model = mod_glm, main = "GLM")
AUC(model = mod_gam, main = "GAM")
AUC(model = mod_gbm, main = "GBM")
AUC(model = mod_glm, curve = "PR", main = "GLM")
AUC(model = mod_gam, curve = "PR", main = "GAM")
AUC(model = mod_gbm, curve = "PR", main = "GBM")

You can try also other functions such as ‘threshMeasures‘, ‘MillerCalib‘ or ‘HLfit‘. And check out the colour version of ‘varPart‘:

varPart(A = 0.456, B = 0.315, AB = 0.852, A.name = "Spatial", B.name = "Climatic", col = TRUE)
varPart(A = 0.456, B = 0.315, C = 0.281, AB = 0.051, BC = 0.444,
AC = 0.569, ABC = 0.624, A.name = "Spatial", B.name = "Human",
C.name = "Climatic", col = TRUE)

Feel free to send me any bug reports! Feature requests/suggestions are also welcome, though I can’t promise a timely response… And remember to look for the latest package updates in the development page on R-Forge!

Mapping the confusion matrix

The ‘confusionLabel‘ function below labels the predictions of a binary response model according to their confusion matrix categories, i.e., it classifies each prediction into a false positive, false negative, true positive or true negative, given a user-defined threshold value:

confusionLabel <- function(model, # placeholder, not yet implemented
                           obs,  # vector of observed 0 and 1 values
                           pred, # vector of predicted probabilities
                           thresh) # threshold to separate predictions
{
  if (length(obs) != length(pred)) stop("'obs' and 'pred' must be of the same length (and in the same order)")
  res <- rep("", length(obs))
  res[pred >= thresh & obs == 1] <- "TruePos"
  res[pred < thresh & obs == 0] <- "TrueNeg"
  res[pred >= thresh & obs == 0] <- "FalsePos"
  res[pred < thresh & obs == 1] <- "FalseNeg"
  res
}

Usage example:

confusionLabel(obs = myspecies_presabs, pred = myspecies_P, thresh = 0.23)

The result is a vector with the same length as ‘obs’ and ‘pred’, containing the label for each value. This allows you to then map the confusion matrix, like this:

This function will be included in the next version of the ‘modEvA‘ package, which is soon to be released.

Downloading and cleaning GBIF data with R

Many students and workshop participants ask me for a (semi-)automated way to 1) download species occurrence data from GBIF into R, and 2) clean such data from common errors. The following script does that, while calling the user’s attention to the need for properly citing the data sources (not just GBIF, which is a repository for many sources), and for carefully mapping and inspecting them for additional problems which are not picked up by current tools.

NOTE [April 2023]: This script is a bit outdated given recent developments in R — see e.g. newer posts on GBIF data cleaning and removal of absences.

#######################################
## DOWNLOAD AND CLEAN DATA FROM GBIF ##
#######################################

library(rgbif)
library(scrubr)
library(maps)


# IF YOU HAVE ONLY ONE SPECIES OR TAXONOMIC GROUP ----

myspecies <- "Galemys pyrenaicus"  # instead of a species it can be another taxonomic group, e.g. "Mammalia" or "Talpidae"

# download GBIF occurrence data for this taxon; this takes time if there are many data points!
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 20000)


# take a look at the downloaded data:
gbif_data
# if "Records found" is larger than "Records returned", you need to increase the 'limit' argument above -- see help(occ_data) for options and limitations


# if your taxon is widespread but you want to work on a particular region, you can download records within a specified window of coordinates:
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 20000, decimalLongitude = "-10, 10", decimalLatitude = "35, 55")  # note that coordinate ranges must be specified this way: "smaller, larger" (e.g. "-5, -2")

gbif_data

# get the DOIs for citing these data properly:
gbif_citation(gbif_data)
# note: if you need or prefer only one DOI for the entire dataset, download the dataset directly from www.gbif.org and then import the .csv to R. It is very important to properly cite the data sources! GBIF is not a source, just a repository for many people who put in very hard work to collect these data and make them available. See also the new "derived datasets" tool (https://www.gbif.org/citation-guidelines#derivedDatasets) for obtaining a DOI based on the datasetKey column.

# check how the data are organized:
names(gbif_data)
names(gbif_data$meta)
names(gbif_data$data)

# get the columns that matter for mapping and cleaning the occurrence data:
myspecies_coords <- gbif_data$data[ , c("decimalLongitude", "decimalLatitude", "occurrenceStatus", "coordinateUncertaintyInMeters", "institutionCode", "references")]
head(myspecies_coords)

# map the occurrence data:
map("world", xlim = range(myspecies_coords$decimalLongitude), ylim = range(myspecies_coords$decimalLatitude))  # if the map doesn't appear right at first, run this command again
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], pch = ".")

# you may notice (especially if you zoom in, e.g. by specifying a smaller range of coordinates under 'xlim' and 'ylim' above) that many points are too regularly spaced to be exact locations of species sightings; rather, such points are likely to be centroids of (relatively large) grid cells on which particular surveys was based, so remember to adjust the spatial resolution of your analysis accordingly!

# also, these data are likely to contain species absences and location errors, so jump to "CLEAN THE DATASET" section below - this is VERY IMPORTANT!!!


# IF YOU HAVE MORE THAN ONE TAXONOMIC GROUP ----

myspecies <- c("Galemys pyrenaicus", "Chioglossa lusitanica")

# download GBIF occurrence data for these species; this may take a long time if there are many data points!
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 500)  # decrease the 'limit' if you just want to see how many records there are without waiting all the time that it will take to download the whole dataset

# take a look at the downloaded data:
gbif_data
# if, for any species, "Records found" is larger than "Records returned", you need to increase the 'limit' argument above -- see help(occ_data) for options and limitations

# get the DOI for citing these data properly:
gbif_citation(gbif_data)  # unfortunately it is more complicated to obtain with R a proper citation for a dataset with multiple species. To get a DOI for these data, download the dataset directly from www.gbif.org and then import the .csv to R. It is very important to properly cite the data sources! GBIF is not a source, just a repository for many people who put in very hard work to collect these data and make them available. See also the new "derived datasets" tool (https://www.gbif.org/citation-guidelines#derivedDatasets) for obtaining a DOI based on the datasetKey column.

# if your species are widespread but you want to work on a particular region, you can download records within a specified window of coordinates:
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 20000, decimalLongitude = "-10, 10", decimalLatitude = "35, 55")  # note that coordinate ranges must be specified this way: "smaller, larger" (e.g. "-5, -2")

gbif_data

# check how the data are organized:
names(gbif_data)
names(gbif_data[[myspecies[1]]])
names(gbif_data[[myspecies[1]]]$meta)
names(gbif_data[[myspecies[1]]]$data)


# create and fill a list with only the 'data' section for each species:
myspecies_coords_list <- vector("list", length(myspecies))
names(myspecies_coords_list) <- myspecies
for (s in myspecies) {
  # get the columns that matter for mapping and cleaning the occurrence data:
  coords <- gbif_data[[s]]$data[ , c("decimalLongitude", "decimalLatitude", "occurrenceStatus", "coordinateUncertaintyInMeters", "institutionCode", "references")]
  if (!is.null(coords)) myspecies_coords_list[[s]] <- data.frame(species = s, coords)
}
lapply(myspecies_coords_list, head)

# collapse the list into a data frame:
myspecies_coords <- as.data.frame(do.call(rbind, myspecies_coords_list), row.names = 1:sum(sapply(myspecies_coords_list, nrow)))
head(myspecies_coords)
tail(myspecies_coords)

# map the occurrence data:
map("world", xlim = range(myspecies_coords$decimalLongitude), ylim = range(myspecies_coords$decimalLatitude))  # if the map doesn't appear right at first, run this command again
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], col = as.factor(myspecies_coords$species), pch = ".")

# you may notice (especially if you zoom in, e.g. by specifying a smaller range of coordinates under 'xlim' and 'ylim' above) that many points are too regularly spaced to be exact locations of species sightings; rather, such points are likely to be centroids of (relatively large) grid cells on which particular surveys were based, so remember to adjust the spatial resolution of your analysis accordingly!


# CLEAN THE DATASET! ----

# mind that data often contain errors, so careful inspection and cleaning are necessary! 
# here we'll first remove records of absence (if any):
names(myspecies_coords)
sort(unique(myspecies_coords$occurrenceStatus))  # check for different indications of "absent", which could be in different languages! and remember that R is case-sensitive
absence_rows <- which(myspecies_coords$occurrenceStatus %in% c("absent", "Absent", "ABSENT", "ausente", "Ausente", "AUSENTE"))
length(absence_rows)
if (length(absence_rows) > 0) {
  myspecies_coords <- myspecies_coords[-absence_rows, ]
}

# let's do some further data cleaning with functions of the 'scrubr' package (but note this cleaning is not exhaustive!)
nrow(myspecies_coords)
myspecies_coords <- coord_incomplete(coord_imprecise(coord_impossible(coord_unlikely(myspecies_coords))))
nrow(myspecies_coords)

# map the cleaned occurrence data:
map("world", xlim = range(myspecies_coords$decimalLongitude), ylim = range(myspecies_coords$decimalLatitude))  # if the map doesn't appear right at first, run this command again
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], col = as.factor(myspecies_coords$species), pch = ".")
# possible erroneous points e.g. on the Equator (lat and lon = 0) should have disappeared now

# also eliminate presences with reported coordinate uncertainty (location error, spatial resolution) larger than 5 km (5000 m):
myspecies_coords <- coord_uncertain(myspecies_coords, coorduncertainityLimit = 5000)
nrow(myspecies_coords)
# but note that this will only get rid of records where coordinate uncertainty is adequately reported, which may not always be the case! Careful mapping and visual inspection is necessary

# map the cleaned occurrence records with a different colour on top of the raw ones:
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], pch = 20, cex = 0.5, col = "turquoise")

Plot predicted values for presences vs. absences

Diagnostic plots are always a nice, visually explicit way to assess our data and predictions, and I’ve just added a new one to the modEvA R package. ThepredPlotfunction plots predicted values separated into observed presences and absences, and coloured according to whether they’re above or below a given prediction threshold (the default threshold is species prevalence, i.e. proportion of presences, in the observed data). It shows whether there’s a visible split in the predicted values for presences and absences (see also functions predDensity and plotGLM in the same package, for alternative/complementary ways of visualizing this). The plot imitates (with permission from the author) one of the graphical outputs of the ‘summary‘ of models built with the ‘embarcadero‘ package (Carlson, 2020), but it can be applied to any ‘glm‘ object or any set of observed and predicted values, and it allows specifying a user-defined prediction threshold. The ‘predPlot‘ function is now included in package modEvA (version >= 2.1, currently available on R-Forge).

predPlot <- function(model = NULL, obs = NULL, pred = NULL, thresh = "preval", main = "Classified predicted values", legend.pos = "bottomright") {
  # version 1.0 (20 Jan 2021)
  
  if (!is.null(model)) {
    if (!("glm" %in% class(model)) || family(model)$family != "binomial") stop("'model' must be of class 'glm' and family 'binomial'.")
    if (!is.null(obs)) message("Argument 'obs' ignored in favour of 'model'.")
    if (!is.null(pred)) message("Argument 'pred' ignored in favour of 'model'.")
    obs <- model$y
    pred <- model$fitted.values
  } else {
    if (is.null(obs) || is.null(pred)) stop ("You must provide either 'model' or a combination of 'obs' and 'pred'.")
    if (!is.numeric(obs) || !is.numeric(pred)) stop ("'obs' and 'pred' must be numeric.")
    if (length(obs) != length(pred)) stop("'obs' and 'pred' must have the same length.")
  }
  
  if (!(thresh == "preval" || (is.numeric(thresh) && thresh >= 0 && thresh <= 1))) stop ("'thresh' must be either 'preval' or a numeric value between 0 and 1.")
  if (thresh == "preval")  thresh <- prevalence(obs)

  pred0 <- pred[obs == 0]
  pred1 <- pred[obs == 1]
  
  opar <- par(no.readonly = TRUE)
  on.exit(par(opar))
  par(mar = c(5, 5.2, 3, 1))

  plot(x = c(0, 1), y = c(-0.5, 1.5), xlab = "Predicted value", type = "n", ylab = "", yaxt = "n", main = main)
  axis(side = 2, at = c(0, 1), tick = FALSE, labels = c("Observed\nabsences", "Observed\npresences"), las = 1)
  abline(v = thresh, lty = 2)
  points(x = pred0, y = sapply(rep(0, length(pred0)), jitter, 10), col = ifelse(pred0 < thresh, "grey", "black"))
  points(x = pred1, y = sapply(rep(1, length(pred1)), jitter, 10), col = ifelse(pred1 < thresh, "grey", "black"))
  
  if (!is.na(legend.pos) && legend.pos != "n")  legend(legend.pos, legend = c("Predicted presence", "Predicted absence"), pch = 1, col = c("black", "grey"))
}

Usage examples:

library(modEvA)

# load sample models:
data(rotif.mods)

# choose a particular model to play with:
mod <- rotif.mods$models[[1]]

# make some predPlots:
predPlot(model = mod)
predPlot(model = mod, thresh = 0.5)

# you can also use 'predPlot' with vectors of observed and predicted values instead of a model object:
myobs <- mod$y
mypred <- mod$fitted.values
predPlot(obs = myobs, pred = mypred)

References

Carlson C.J. (2020) embarcadero: Species distribution modelling with Bayesian additive regression trees in R. Methods in Ecology and Evolution, 11: 850-858.