I often come across GBIF users who are unaware that the records available for a given taxon are not necessarily all presences: there’s a column named “occurrenceStatus” whose value can be “PRESENT” or “ABSENT”! The absence records can, of course, be removed with simple operations in R or even omitted from the download, but many users overlook or accidentally skip this crucial step, and then end up analysing species distributions with some absence records being used as presences.
To reduce the chances of this happening, I just added new arguments to the fuzzySim
function cleanCoords()
(which was explained in a previous post) to allow removing absence records too, when the user wants to filter only the presences. Here’s a usage example using aardvark occurrence records downloaded with the geodata
package. Note that this requires installing the latest version of fuzzySim
(4.9.9):
occ <- geodata::sp_occurrence(genus = "Orycteropus", species = "afer",
fixnames = FALSE)
# NOTE: as per the function help file, if you use GBIF data, remember to check the data use agreement and follow guidance on how to properly cite the actual data sources!
names(occ)
occ_clean <- fuzzySim::cleanCoords(occ,
coord.cols = c("decimalLongitude", "decimalLatitude"),
uncert.col = "coordinateUncertaintyInMeters",
uncert.limit = 10000,
abs.col = "occurrenceStatus")
# 764 rows in input data
# 576 rows after 'rm.dup'
# 575 rows after 'rm.equal'
# 575 rows after 'rm.imposs'
# 575 rows after 'rm.missing.any'
# 575 rows after 'rm.zero.any'
# 570 rows after 'rm.imprec.any'
# 465 rows after 'rm.uncert' (with uncert.limit=10000 and uncert.na.pass=TRUE)
# 461 rows after 'rm.abs'

As you can see, besides some common biodiversity data issues such as duplicated or erroneous coordinates, additional records were removed because they represented absences. Hopefully this can help prevent their incorrect use as species presences. Feedback welcome!
Pingback: Downloading and cleaning GBIF data with R | modTools
Amazing! Since working with SDM so long time, I did not know about this information.
I usually use the spocc to download occurrences. And I confer, the package filters the “ABSENT” =]
library(geodata)
library(spocc)
occ <- geodata::sp_occurrence(genus = "Orycteropus", species = "afer", fixnames = FALSE)
occ
occ_lon_lat <- occ[occ$occurrenceStatus == "ABSENT", c("decimalLongitude", "decimalLatitude")]
occ_lon_lat
occ_spocc_ad <- spocc::occ(query = "Orycteropus afer",
from = "gbif", limit = 1e4, has_coords = TRUE, gbifopts = list(occurrenceStatus = "ABSENT"))
occ_spocc_ad_data <- spocc::occ2df(occ_spocc_ad)[, c(2, 3)]
occ_spocc_ad_data
occ_lon_lat == occ_spocc_ad_data
Thanks! With geodata::sp_occurrence() you can also download only the presences, if you add args=c(“occurrenceStatus=PRESENT”). But many users aren’t aware and don’t read the documentation, and the default is to download all available data. Actually the absences can be valuable too, though currently (at least for the vast majority of species) they aren’t available in a sufficient amount for a significant analysis.
No problem. Sorry for the English mistakes before, I posted without proofreading… Yes, this is very valuable information! I did not know that. Thank you for the post.
Pingback: Removing absences from GBIF datasets – Data Science Austria
An amazing addition to the package and indeed this is normally overlooked by people grabbing data from the GBIF. I also like the last sentence “can help prevent incorrect use” since sometimes absences can be correctly used.
Seems a typo, wondering how the 10000 m uncertainty limit set in the code turned out to be 50000 m in the output. Noticed this when trying to reproduce the code. Otherwise, great job!
Output paste corrected, thanks!