False Discovery Rate (FDR)

It is common in ecology to search for statistical relationships between species’ occurrence and a set of predictor variables. However, when a large number of variables is analysed (compared to the number of observations), false findings may arise due to repeated testing. García (2003) recommended controlling the false discovery rate (FDR; Benjamini and Hochberg 1995) in ecological studies. Benjamini & Yekutieli (2001) adapted it to allow for dependency among tests statstics. The p.adjust function in R performs this and other corrections to the significance (p) values of variables under repeated testing. The FDR function, now included in the fuzzySim package (Barbosa, 2015), performs bivariate regressions (either linear or binary logistic) of a set of predictor variables on one response variable, or uses already-obtained p values for a set of predictor variables. It then calculates the FDR with p.adjust, and shows which variables should be retained for or excluded from further multivariate analysis according to their corrected p values (see, for example, Barbosa, Real & Vargas 2009). Besides the p-values, this function also reports the individual AIC and BIC of each variable.

The FDR function uses the Benjamini & Hochberg (“BH”) correction by default, but check the documentation of the p.adjust function for other available methods, namely “BY”, which allows for non-independent data (Benjamini & Yekutieli, 2001) and which is the default e.g. in function multGLM. Input data may be a matrix or data frame containing the response variable (for example, the presence-absence or abundance of a species) and the predictor variables; alternatively, you may already have performed the bivariate regressions and have a set of variables and corresponding p values which you want to correct with FDR — in this case, get a data frame with your variables’ names in the first column and their p values in the second column, and supply it as the pvalues argument to the FDR function (no need to provide data, sp.cols or var.cols in this case). NOTE that the input data format for this function has changed in mid May 2014, with former parameters response and predictors now replaced with data, sp.cols and var.cols (for coherence and compatibility with multGLM). Note also that, as of late October 2015, argument model.type is deprecated, as this information is included in argument family – e.g., if you want linear models (LM), you can set family = ‘gaussian’. You can let the default family = ‘auto’ figure out the family automatically.

function (data = NULL, sp.cols = NULL, var.cols = NULL, pvalues = NULL, 
    model.type = NULL, family = "auto", correction = "fdr", 
    q = 0.05, verbose = TRUE, simplif = FALSE) {
    # version 3.6 (26 Apr 2016)
    if (length(sp.cols) > 1) 
        stop("Sorry, FDR is currently implemented for only one response variable at a time, so 'sp.cols' must indicate only one column")
    if (!is.null(model.type)) 
        warning("Argument 'model.type' is deprecated and now ignored, as this info is included in 'family' (e.g. 'gaussian' for LM, 'binomial' or 'poisson' for GLM).")
    model.type <- "GLM"
    n.init <- nrow(data)
    data <- data[is.finite(data[, sp.cols]), ]
    na.loss <- n.init - nrow(data)
    if (na.loss > 0) 
        message(na.loss, " cases excluded due to missing or non-finite values.")
    if (family == "auto") {
        if (all(data[, sp.cols] %in% c(0, 1))) 
            family <- "binomial"
        else if (all(data[, sp.cols] >= 0 && data[, sp.cols]%%1 == 
            0)) 
            family <- "poisson"
        else family <- "gaussian"
        if (verbose) 
            message("\nUsing generalized linear models of family '", 
                family, "'.\n")
    }
    if (!(correction %in% p.adjust.methods)) 
        stop("Invalid correction method.\nType 'p.adjust.methods' for available options.")
    response <- data[, sp.cols]
    predictors <- data[, var.cols]
    if (!is.null(pvalues)) {
        coeffs <- aic <- bic <- FALSE
        p.bivar <- pvalues[, 2]
        names(p.bivar) <- pvalues[, 1]
    }
    else {
        coeffs <- aic <- bic <- TRUE
        if (is.null(ncol(predictors))) 
            stop("You need more than one predictor to calculate the FDR.")
        p.bivar <- coef.bivar <- aic.bivar <- bic.bivar <- vector("numeric", 
            length = ncol(predictors))
        for (i in 1:length(p.bivar)) {
            model <- glm(response ~ predictors[, i], family = family)
            p.bivar[i] <- anova(model, test = "Chi")[, "Pr(>Chi)"][2]
            coef.bivar[i] <- model[["coefficients"]][2]
            aic.bivar[i] <- extractAIC(model, k = 2)[2]
            bic.bivar[i] <- extractAIC(model, k = log(nobs(model)))[2]
            if (is.na(p.bivar[i])) 
                message("A p-value could not be calculated for var.col number", 
                  i)
            if (is.na(aic.bivar[i])) 
                message("AIC could not be calculated for var.col number", 
                  i)
            if (is.na(aic.bivar[i])) 
                message("BIC could not be calculated for var.col number", 
                  i)
        }
        rm(i)
    }
    if (coeffs) {
        results <- data.frame(cbind(coef.bivar, aic.bivar, bic.bivar, 
            p.bivar), row.names = names(predictors))
        names(results) <- c("bivariate.coeff", "AIC", "BIC", 
            "p.value")
        results <- results[order(results[, "p.value"]), ]
        results[, "p.adjusted"] <- p.adjust(results[, "p.value"], 
            method = correction)
    }
    else {
        results <- data.frame(AIC = aic.bivar, BIC = bic.bivar, 
            p.value = p.bivar, row.names = pvalues[, 1])
        results <- results[order(results[, "p.value"]), ]
        results[, "p.adjusted"] <- p.adjust(results[, "p.value"], 
            method = correction)
    }
    p.adjusted <- NULL
    if (simplif) 
        return(results)
    exclude <- subset(results, p.adjusted > q)
    select <- subset(results, p.adjusted <= q)
    if (verbose) {
        message("Bivariate p-values adjusted with '", correction, 
            "' correction;\n", nrow(exclude), " variable(s) excluded, ", 
            nrow(select), " selected (with q = ", q, ")\n")
    }
    list(exclude = exclude, select = select)
}

[presented with Pretty R]

Usage example:

FDR(data = mydata, sp.cols= 2, var.cols = 5:17)

FDR(data = mydata, sp.cols= 2, var.cols = 5:17, correction = “BY”)

(install fuzzySim and get help file ?FDR for reproducible examples with sample data)

References:

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Barbosa A.M., Real R. & Vargas J.M (2009) Transferability of environmental favourability models in geographic space: The case of the Iberian desman (Galemys pyrenaicus) in Portugal and Spain. Ecological Modelling 220: 747-754

Benjamini Y. & Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57: 289-300

Benjamini Y. & Yekutieli D. (2001) The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29: 1165-1188.

García L.V. (2003) Controlling the false discovery rate in ecological research. Trends in Ecology and Evolution 18: 553-554

Advertisements

Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s