Select among correlated variables based on a given criterion

Correlations among variables are problematic in multivariate models, as they inflate the variance of coefficients and thus may bias the interpretation of the effects of those variables on the response. One of the strategies to circumvent this problem is to eliminate a priori one from each pair of correlated variables, but it is not always straightforward to pick the right variable among these. The corSelect function selects among correlated variables based either on their (stepwise) variance inflation factor (VIF); or on their relationship with the response variable, by building a bivariate model of each individual variable against the response and choosing, among each of two correlated variables, the one with the best p-value, AIC or BIC.

This function is now included in fuzzySim and depends on other functions in the package, such as FDR and multicol. It is also included as a new option within function multGLM in this package, so that multiple models can be built based on uncorrelated variables chosen appropriately for each species.

corSelect <- function(data, sp.cols = NULL, var.cols, cor.thresh = 0.8, select = "p.value", ...) {

# version 1.6 (6 Jul 2017)

if (length(sp.cols) > 1) stop ("Sorry, 'corSelect' is currently implemented for only one 'sp.col' at a time.")

univar.criteria <- c("VIF")
 bivar.criteria <- c("p.value", "AIC", "BIC")

if (!(select %in% c(univar.criteria, bivar.criteria))) stop ("Invalid 'select' criterion.")

if (!is.null(sp.cols) & select %in% bivar.criteria) {
 n.in <- nrow(data)
 data <- data[is.finite(data[ , sp.cols]), ]
 n.out <- nrow(data)
 if (n.out < n.in) warning (n.in - n.out, " observations removed due to missing data in 'sp.cols'; ", n.out, " observations actually evaluated.")
 }

cor.mat <- cor(data[ , var.cols], ...)
 cor.mat[upper.tri(cor.mat, diag = TRUE)] <- NA
 high.cor.mat <- bivar.mat <- numeric(0)

if (max(abs(cor.mat), na.rm = TRUE) >= cor.thresh) {
 high.cor.rowcol <- as.matrix(which(abs(cor.mat) >= cor.thresh, arr.ind = TRUE))
 high.cor.inds <- sort(unique(as.vector(high.cor.rowcol)))
 high.cor.mat <- data.frame(var1 = rownames(cor.mat)[high.cor.rowcol[ , "row"]], var2 = colnames(cor.mat)[high.cor.rowcol[ , "col"]])
 high.cor.mat$corr <- NULL
 for (r in 1:nrow(high.cor.mat)) high.cor.mat$corr[r] <- cor.mat[high.cor.rowcol[ ,"row"][r], high.cor.rowcol[ ,"col"][r]]
 high.cor.mat <- high.cor.mat[order(abs(high.cor.mat$corr), decreasing = TRUE), ]

if (is.null(sp.cols) & select %in% bivar.criteria) {
 message("Bivariate relationships not assessable without a response variable ('sp.cols'). Returning high pairwise correlations among 'var.cols'.")
 return (high.cor.mat)
 }

high.cor.vars <- unique(rownames(cor.mat[high.cor.inds, high.cor.inds]))

if (select %in% bivar.criteria) {
 bivar.mat <- FDR(data = data, sp.cols = sp.cols, var.cols = match(high.cor.vars, colnames(data)), simplif = TRUE)[ , c("p.value", "AIC", "BIC")]
 if (all.equal(order(bivar.mat[ , c("p.value")]), order(bivar.mat[ , c("AIC")]), order(bivar.mat[ , c("BIC")]))) message("Results identical using whether p-value, AIC or BIC to select variables.\n")
 else message("Results NOT identical using whether p-value, AIC or BIC to select variables.\n")
 } # end if select in bivar

data.remaining <- data[ , var.cols]
 while (max(abs(cor.mat), na.rm = TRUE) >= cor.thresh) {
 max.cor.ind <- which(abs(cor.mat) == max(abs(cor.mat), na.rm = TRUE), arr.ind = TRUE)
 v1 <- rownames(cor.mat)[max.cor.ind[1]]
 v2 <- colnames(cor.mat)[max.cor.ind[2]]
 if (select %in% univar.criteria) {
 vif <- multicol(data.remaining)
 targets <- vif[c(v1, v2), "VIF", drop = FALSE]
 } # end if univar
 else if (select %in% bivar.criteria) {
 targets <- bivar.mat[c(v1, v2), select, drop = FALSE]
 } # end if bivar

exclude <- which.max(as.matrix(targets))
 excl.name <- rownames(targets)[exclude]
 excl.ind <- match(excl.name, colnames(cor.mat))
 data.remaining <- data.remaining[ , -excl.ind, drop = FALSE]
 cor.mat <- cor.mat[-excl.ind, -excl.ind, drop = FALSE]
 if (all(is.na(cor.mat))) cor.mat <- numeric(0)
 if (length(cor.mat) == 0) break
 } # end while max

} # end if max > thresh

selected.vars <- colnames(cor.mat)
 selected.var.cols <- match(colnames(cor.mat), colnames(data))
 excluded.vars <- colnames(data)[var.cols][!(colnames(data)[var.cols] %in% colnames(cor.mat))]

vif2 <- multicol(data[ , selected.var.cols])

list(high.correlations = high.cor.mat,
 bivariate.significance = bivar.mat,
 excluded.vars = excluded.vars,
 selected.vars = selected.vars,
 selected.var.cols = selected.var.cols,
 strongest.remaining.corr = cor.mat[which.max(abs(cor.mat))],
 remaining.multicollinearity = vif2
 )
} # end corSelect function

 

You can install and load fuzzySim (>=version 1.5) and then check help(corSelect) for reproducible usage examples.

Advertisements

Analyse multicollinearity in a dataset, including the variance inflation factor (VIF)

The multicol function, now included in the fuzzySim package (Barbosa, 2015), calculates the degree of multicollinearity in a set of numeric variables, using three closely related measures: R-squared, tolerance and the variance inflation factor (VIF). The latter is being increasingly used in ecology and a range of other research areas, and several packages already calculate the VIF in R, but their results can be strikingly different (Estrada & Barbosa, submitted). The multicol function below calculates multicollinearity statistics directly on the independent variables, producing the expected results and without (so far) clashing with other packages. Only ‘independent’ (predictor, right hand side side) variables should be entered, as the result obtained for each variable depends on all the other variables present in the analysed data set.

multicol <- function(vars) {
  # version 0.3 (9 Oct 2014)
result <- matrix(NA, nrow = ncol(vars), ncol = 3)
  rownames(result) <- colnames(vars)
  colnames(result) <- c("Rsquared", "Tolerance", "VIF")
  for (v in 1:ncol(vars)) {
    v.name <- colnames(vars)[v]
    other.v.names <- colnames(vars)[-v]
    mod.formula <- as.formula(paste(v.name, "~", paste(other.v.names, collapse = "+")))
    mod <- lm(mod.formula, data = vars)
    R2 <- summary(mod) $ r.squared
    result[v, "Rsquared"] <- R2
    result[v, "Tolerance"] <- 1 - R2
    result[v, "VIF"] <- 1 / (1 - R2)
  }
  return(result)
}

[presented with Pretty R]

Let’s try some examples with a couple of datasets available in R (trees and OrchardSprays):

multicol(trees)
 
# you'll get a warning and some NA results if any of the variables is not numeric:
multicol(OrchardSprays)
 
# but you can define a subset of columns of 'data' to calculate 'multicol' for:
multicol(OrchardSprays[ , 1:3])

REFERENCES

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Estrada A. & Barbosa A.M. (submitted) Calculating the Variance Inflation Factor in R: cautions and recommendations.

Prevalence and evenness in binary data

For building and evaluating species distribution models, the porportion of presences of the species and the balance between the number of presences and absences may be issues to take into account (e.g. Jiménez-Valverde & Lobo 2006, Barbosa et al. 2013). The prevalence and the evenness functions, included in the modEvA package (Barbosa et al. 2014), can calculate these measures:

prevalence <- function (obs, event = 1) {
  # version 1.0
  # calculates the prevalence (proportion of occurrences) of a value (event) in a vector
  # obs: a vector of binary observations (e.g. 1 vs. 0, male vs. female, disease vs. no disease, etc.)
  # event: the value whose prevalence we want to calculate (e.g. 1, "present", etc.)
  sum(obs == event) / length(obs)
}  # end prevalence function
evenness <- function (obs) {
  # version 1.3 (18 June 2013)
  # calculates the evenness (equilibrium) of cases in a binary vector; result ranges from 0 when all values are the same, to 1 when there are the same number of cases with each value
  # obs: a vector of binary observations (e.g. 1 vs. 0, male vs. female, disease vs. no disease, etc.)
  values <- unique(obs)
  nvalues <- length(values)
  if (!(nvalues %in% c(1, 2))) stop("Input vector includes ", nvalues, " different values; 'evenness' is only implemented for binary observations (with 1 or 2 different values).")
  proportion <- (sum(obs == values[1])) / length(obs)
  if (proportion > 0.5)  balance <- 1 - proportion  else  balance <- proportion
  return(2 * balance)  # so evenness ranges between 0 and 1
}  # end evenness function

Let’s exemplify with 3 sample binary vectors x, y and z:

(x <- rep(c(0, 1), each = 5))
(y <- c(rep(0, 3), rep(1, 7)))
(z <- c(rep(0, 7), rep(1, 3)))

prevalence(x)
evenness(x)

prevalence(y)
evenness(y)

prevalence(z)
evenness(z)

[presented with Pretty R]

References

Barbosa A.M., Brown J.A. & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

Barbosa A.M., Real R., Muñoz A.R. & Brown J.A. (2013) New measures for assessing model equilibrium and prediction mismatch in species distribution models. Diversity and Distributions, 19: 1333-1338

Jiménez-Valverde A. & Lobo J.M. (2006) The ghost of unbalanced species distribution data in geographical model predictions. Diversity and Distributions, 12: 521–524.

Similarity between binary variables

Similarity between binary or dicothomous variables (for example, between species’ presence-absence patterns or between regions’ biotic composition) is an important aspect of biogeography. Jaccard’s (1901) index is one of the most widely used similarity indices in ecology; Baroni-Urbani & Buser’s (1976) index has also been extensively used in chorology and biotic regionalization studies. Other measures of association for binary variables are the phi coefficient and the tetrachoric correlation coefficient (Ekström 2008).

We can use the following function in R to calculate the similarity between two binary (1-0) variables x and y:

binarySimilarity <- function(x, y, method) {
  # version 1.3 (8 May 2014)
  # x and y are 2 binary (0-1) vectors
  # 'method' can be 'CCR', 'Jaccard', 'Baroni', 'BaroniO', 'kappa, 'Mathews', 'Phi' or 'Yule'
 
  x0 <- x == 0
  x1 <- x == 1
  y0 <- y == 0
  y1 <- y == 1
 
  a <- sum(x1 & y1)
  b <- sum(x0 & y1)
  c <- sum(x1 & y0)
  d <- sum(x0 & y0)
  N <- sum(a, b, c, d)
 
  if (method == "CCR") {
    return((a + d) / N)
  }
 
  if (method == "Jaccard") {
    shared <- a
    total  <- sum(x1 | y1)
    return(shared / total)
  }
 
  else if (method == "Baroni") {
    A <- sum(x1)
    B <- sum(y1)
    C <- a
    D <- d
    return((sqrt(C * D) + C) / (sqrt(C * D) + A + B - C))
  }
 
  else if (method == "BaroniO") {
    A <- a
    B <- sum(x1 & y0)
    C <- sum(y1 & x0)
    D <- d
    return((sqrt(A * D) + A) / (sqrt(A * D) + A + B + C))
  }
 
  else if (method == "kappa") {
    return(((a+d)-(((a+c)*(a+b)+(b+d)*(c+d))/N))/(N-(((a+c)*(a+b)+(b+d)*(c+d))/N)))
    }
 
  else if (method == "Mathews") {
    S <- (a + b) / N
    P <- (a + c) / N
    MCC <- (a / N - S * P) / sqrt(prod(P, S, (1 - S), (1 - P)))
    return(MCC)
    #return(((a * d) - (b * c)) / sqrt((a + c) * (a + b) * (c + d) * (b + d)))  # equivalent
  }
 
  else if (method == "Phi") {
    A <- a/N
    AB <- (a + b) / N
    AC <- (a + c) / N
    CD <- (c + d) / N
    BD <- (b + d) / N
    return((A -(AB) * (AC)) / sqrt(prod(AB, CD, AC, BD)))
  }
 
  else if (method == "Yule") {
    return((a * d - b * c)/(a * d + b * c))
  }
 
  else stop("'method' must be either 'CCR', 'Jaccard', 'Baroni', 'BaroniO', 
'kappa', 'Mathews', 'Phi' or 'Yule'")
 
}  # end binarySimilarity function

[presented with Pretty R]

We can also build a square matrix of similarity values (also with either method) between a set of binary (1-0) variables (for example, species or regions) presented as columns in a table. For that we need to load the binarySimilarity function (above) and then the simMatrix one (below):

simMatrix <- function(variables, method = "Jaccard") {
  nvar <- ncol(variables)
    simMat <- matrix(nrow = nvar, ncol = nvar)
    for(i in 1:nvar)  for (j in 1:nvar) {
    simMat[i,j] <- binarySimilarity(variables[,i], variables[,j], method = method)
    }  
  dimnames(simMat) <- list(names(variables), names(variables))
  return(similarityMatrix = simMat)
}

Note that the original formula of Baroni-Urbani & Buser (1976) has been replaced with an equivalent, slightly different but more efficient one (taken from Olivero et al. 1998); you can try and compare their results by using both method = “Baroni” and method = “BaroniO”. For example, if you have a data frame with binary (1-0) columns to compare, like the rotifers01 sample dataset of the fuzzySim package:

sim1 <- simMatrix(rotifers01, method = "Baroni")
 
sim2 <- simMatrix(rotifers01, method = "BaroniO")
 
all.equal(sim1, sim2)  # result is TRUE

These functions were employed, for example, by Barbosa et al. (2012) to study the relationships among mammal distributions in Europe. A binary.similarity function is included in the DeadCanMove package (Barbosa et al. 2014). An upgraded version of binary similarity indices, able to deal with fuzzy in addition to binary values, is now implemented in the fuzzySim package (Barbosa, 2014), although only Jaccard and Baroni-Urbani & Buser’s indices are included there so far.

References:

Barbosa A.M. (2014) fuzzySim: Fuzzy similarity in species’ distributions. R package, version 0.1

Barbosa A.M., Estrada A., Márquez A.L., Purvis A. & Orme C.D.L. (2012) Atlas versus range maps: robustness of chorological relationships to distribution data types in European mammals. Journal of Biogeography 39: 1391-1400

Barbosa A.M., Marques J.T., Santos S.M., Lourenço A., Medinas D., Beja P. & Mira A. (2014) DeadCanMove: Assess how spatial roadkill patterns change with temporal sampling scheme. R package, version 0.1

Baroni-Urbani C. & Buser M.W. (1976) Similarity of Binary Data. Systematic Biology 25: 251-259.

Ekstrom, J. (2008) The phi-coefficient, the tetrachoric correlation coefficient, and the Pearson-Yule debate.

Jaccard P. (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37: 547-579.

Olivero J., Real R. & Vargas J.M. (1998) Distribution of breeding, wintering and resident waterbirds in Europe: biotic regions and the macroclimate. Ornis Fennica 75: 153-175