Removing absences from GBIF datasets

I often come across GBIF users who are unaware that the records available for a given taxon are not necessarily all presences: there’s a column named “occurrenceStatus” whose value can be “PRESENT” or “ABSENT”! The absence records can, of course, be removed with simple operations in R or even omitted from the download, but many users overlook or accidentally skip this crucial step, and then end up analysing species distributions with some absence records being used as presences.

To reduce the chances of this happening, I just added new arguments to the fuzzySim function cleanCoords() (which was explained in a previous post) to allow removing absence records too, when the user wants to filter only the presences. Here’s a usage example using aardvark occurrence records downloaded with the geodata package. Note that this requires installing the latest version of fuzzySim (4.9.9):

occ <- geodata::sp_occurrence(genus = "Orycteropus", species = "afer",
       fixnames = FALSE)
# NOTE: as per the function help file, if you use GBIF data, remember to check the data use agreement and follow guidance on how to properly cite the actual data sources!


occ_clean <- fuzzySim::cleanCoords(occ,
             coord.cols = c("decimalLongitude", "decimalLatitude"),
             uncert.col = "coordinateUncertaintyInMeters",
             uncert.limit = 10000,
             abs.col = "occurrenceStatus")

# 764 rows in input data
# 576 rows after 'rm.dup'
# 575 rows after 'rm.equal'
# 575 rows after 'rm.imposs'
# 575 rows after 'rm.missing.any'
# 575 rows after ''
# 570 rows after 'rm.imprec.any'
# 465 rows after 'rm.uncert' (with uncert.limit=10000 and
# 461 rows after 'rm.abs'

As you can see, besides some common biodiversity data issues such as duplicated or erroneous coordinates, additional records were removed because they represented absences. Hopefully this can help prevent their incorrect use as species presences. Feedback welcome!

7 thoughts on “Removing absences from GBIF datasets

  1. Pingback: Downloading and cleaning GBIF data with R | modTools

  2. Amazing! Since working with SDM so long time, I did not know about this information.

    I usually use the spocc to download occurrences. And I confer, the package filters the “ABSENT” =]


    occ <- geodata::sp_occurrence(genus = "Orycteropus", species = "afer", fixnames = FALSE)

    occ_lon_lat <- occ[occ$occurrenceStatus == "ABSENT", c("decimalLongitude", "decimalLatitude")]

    occ_spocc_ad <- spocc::occ(query = "Orycteropus afer",
    from = "gbif", limit = 1e4, has_coords = TRUE, gbifopts = list(occurrenceStatus = "ABSENT"))
    occ_spocc_ad_data <- spocc::occ2df(occ_spocc_ad)[, c(2, 3)]

    occ_lon_lat == occ_spocc_ad_data

    • Thanks! With geodata::sp_occurrence() you can also download only the presences, if you add args=c(“occurrenceStatus=PRESENT”). But many users aren’t aware and don’t read the documentation, and the default is to download all available data. Actually the absences can be valuable too, though currently (at least for the vast majority of species) they aren’t available in a sufficient amount for a significant analysis.

  3. Pingback: Removing absences from GBIF datasets – Data Science Austria

  4. An amazing addition to the package and indeed this is normally overlooked by people grabbing data from the GBIF. I also like the last sentence “can help prevent incorrect use” since sometimes absences can be correctly used.
    Seems a typo, wondering how the 10000 m uncertainty limit set in the code turned out to be 50000 m in the output. Noticed this when trying to reproduce the code. Otherwise, great job!


Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s