Calculate the area under the ROC curve of a model

Like the model evaluation measures calculated by the threshMeasures function, the area under the ROC curve (AUC) assesses the discrimination performance of a model; but unlike them, it does not require the choice of a particular threshold above which to consider that a model predicts species presence, but rather averages discrimination performance over all possible thresholds at a given interval. Mind that there has been criticism to the AUC (Lobo et al. 2008, Jiménez-Valverde et al. 2013), although it is still one of the most widely used metrics in model evaluation. It is highly correlated with species prevalence, so this value is also provided by the AUC function (if simplif = FALSE, the default) for reference.

Although there are functions to calculate the AUC in other R packages (such as ROCR, PresenceAbsence and verification), the AUC function below is more compatible with the remaining functions in this blog and in the modEvA package where it is included (Barbosa et al. 2014), and can be applied not only to a set of presence-absence versus predicted values, but also directly to a model object of class glm.

AUC <- function(obs = NULL, pred = NULL, model = NULL, interval = 0.01, 
       simplif = FALSE, plot = TRUE, plot.values = TRUE, plot.preds = FALSE, 
       grid = FALSE, xlab = c("False positive rate", "(1-specificity)"), 
       ylab = c("True positive rate", "(sensitivity)"), main = "ROC curve", ...) {
  # version 1.4 (16 Oct 2014)
 
  if (!is.null(model)) {
    if (!is.null(obs)) message("Argument 'obs' ignored in favour of 'model'.")
    if (!is.null(pred)) message("Argument 'pred' ignored in favour of 'model'.")
    obs <- model$y
    pred <- model$fitted.values
  }  # end if model
 
  stopifnot(
    length(obs) == length(pred),
    obs %in% c(0,1),
    pred >= 0,
    pred <= 1,
    interval > 0,
    interval < 1
  )
 
  n1 <- sum(obs == 1)
  n0 <- sum(obs == 0) 
 
  # next 3 lines from Wintle et al 2005 supp mat "roc" function
  xy <- c(pred[obs == 0], pred[obs == 1])
  rnk <- rank(xy)
  AUC <- ((n0 * n1) + ((n0 * (n0 + 1))/2) - sum(rnk[1 : n0])) / (n0 * n1)
 
  if (simplif)  return(AUC)
 
  N <- length(obs)
  preval <- prevalence(obs)
  thresholds <- seq(0, 1, by = interval)
  Nthresh <- length(thresholds)
  true.positives <- true.negatives <- sensitivity <- specificity <- false.pos.rate <- n.preds <- prop.preds <- numeric(Nthresh)
 
  for (t in 1 : Nthresh) {
    true.positives[t] <- sum(obs == 1 & pred >= thresholds[t])
    true.negatives[t] <- sum(obs == 0 & pred < thresholds[t])
    sensitivity[t] <- true.positives[t] / n1
    specificity[t] <- true.negatives[t] / n0
    false.pos.rate[t] <- 1 - specificity[t]
    n.preds[t] <- sum(round(pred, nchar(Nthresh) - 1) == thresholds[t])  # inspired by Peterson Papes & Soberon 2008 rethinking ROC analysis
    prop.preds[t] <- n.preds[t] / length(pred)
  }; rm(t)
 
  if (plot) {
    plot(x = c(0, 1), y = c(0, 1), type = "l", xlab = xlab, ylab = ylab, main = main, col = "grey", ...)  # plots the 0.5 diagonal
 
    if (grid) abline(h = thresholds, v = thresholds, col = "lightgrey")
 
    lines(x = false.pos.rate, y = sensitivity, lwd = 2)  # ROC curve
 
    if (plot.values) {
      text(1.0, 0.1, adj = 1, substitute(paste(AUC == a), list(a = round(AUC, 3))))
    }  # end if plot.values
 
    if (plot.preds) {
      #points(x = false.pos.rate, y = sensitivity, cex = rev(sqrt(100 * prop.preds)), col = "red")  # but why do I have to rev?
      #points(x = thresholds, y = specificity, cex = sqrt(100 * prop.preds), col = "blue")
      #points(x = false.pos.rate, y = sensitivity, cex = sqrt(100 * prop.preds), col = "red")  # ROC curve
      points(x = thresholds, y = rep(0, Nthresh), cex = sqrt(100 * prop.preds), col = "red")
    }  # end if plot.preds
  }  # end if plot
 
  return (list(thresholds = data.frame(thresholds, true.positives, true.negatives, sensitivity, specificity, false.pos.rate, n.preds, prop.preds), N = N, prevalence = preval, AUC = AUC, AUCratio = AUC / 0.5))
}

[presented with Pretty R]

Usage examples:

AUC(obs = mydata$myspecies, pred = mydata$myspecies_P, simplif = TRUE)

AUC(obs = mydata$myspecies, pred = mydata$myspecies_P)

AUC(model = mymodel, main = “my species’ ROC plot”)

AUC(model = mymodel, plot.values = FALSE)

If you want the AUC values for a list of models produced e.g. by the multGLM function, you can do:

mymodels.auc <- vector(“numeric”, length(mymodels$models))

names(mymodels.auc) <- names(mymodels$models)

for (i in 1:length(mymodels$models)) {

mymodels.auc[i] <- AUC(model = mymodels$models[[i]], simplif = TRUE)

}

mymodels.auc

References:

Barbosa A.M., Brown J.A. & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

Lobo, J.M., Jiménez-Valverde, A. & Real, R. (2008). AUC: a misleading measure of the performance of predictive distribution models. Global Ecology and Biogeography 17: 145-151

Jiménez-Valverde, A., Acevedo, P., Barbosa, A.M., Lobo, J.M. & Real, R. (2013). Discrimination capacity in species distribution models depends on the representativeness of the environmental domain. Global Ecology and Biogeography 22: 508–516

Advertisements

Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s