Safe-and-simple cleaning of species occurrences

In my species distribution modelling courses, for a quick and safe removal of the most common biodiversity database errors, I’ve so far used functions from the scrubr R package, namely ‘coord_incomplete’, ‘coord_impossible’, ‘coord_unlikely’, ‘coord_imprecise’ and ‘coord_uncertain’. There are other R packages for species occurrence data cleaning (e.g. CoordinateCleaner, biogeo, or the GUI-based bdclean), but they either require (somewhat laborious) data formatting, and/or they can eliminate many valid records (e.g. near region centroids or towns or institutions, or wrongly considered outliers) if their argument values are not mindfully tailored and their results not carefully inspected. As I need students to have at least most of the worst data errors cleaned before modelling, but can’t spend much of their time and attention on something that’s not the main focus of the course, scrubr was my go-to option… until it was archived, both on CRAN and on GitHub! While I always advise students to do a more thorough data inspection for real work after the course, I implemented those no-fuss, safe-and-simple essential cleaning procedures in a new cleanCoords function in package fuzzySim. This function removes (any or all of) duplicate, missing, impossible, unlikely, imprecise or overly uncertain spatial coordinates.

Here’s a usage example, which requires having a recent version of fuzzySim and also the geodata package for downloading some GBIF records to clean:

library(fuzzySim)
library(geodata)

# download some species occurrences from GBIF:
occ <- geodata::sp_occurrence(genus = "Orycteropus", species = "afer", fixnames = FALSE)

# clean the occurrences:
names(occ)
occ_clean <- fuzzySim::cleanCoords(occ, coord.cols = c("decimalLongitude", "decimalLatitude"), uncert.col = "coordinateUncertaintyInMeters", uncert.limit = 10000)  # 10 km tolerance

# 761 rows in input data
# 574 rows after 'rm.dup'
# 573 rows after 'rm.equal'
# 573 rows after 'rm.imposs'
# 573 rows after 'rm.missing.any'
# 573 rows after 'rm.zero.any'
# 568 rows after 'rm.imprec.any'
# 462 rows after 'rm.uncert' (with uncert.limit=10000 and uncert.na.pass=TRUE)

Run help(cleanCoords) to see the arguments that you can turn on or off. As I always like my students to look at the results of every step of the analysis before moving on, I added a plot argument which is TRUE by default. So you’ll also get a figure like this, with the selected presences in blue and the excluded ones in red:

Feedback welcome!

2 thoughts on “Safe-and-simple cleaning of species occurrences

  1. Pingback: Downloading and cleaning GBIF data with R | modTools

  2. Pingback: Removing absences from GBIF datasets | modTools

Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s