Select among correlated variables based on a given criterion

Correlations among variables are problematic in multivariate models, as they inflate the variance of coefficients and thus may bias the interpretation of the effects of those variables on the response. One of the strategies to circumvent this problem is to eliminate a priori one from each pair of correlated variables, but it is not always straightforward to pick the right variable among these. The corSelect function selects among correlated variables based either on their (stepwise) variance inflation factor (VIF); or on their relationship with the response variable, by building a bivariate model of each individual variable against the response and choosing, among each of two correlated variables, the one with the best p-value, AIC or BIC.

This function is now included in fuzzySim and depends on other functions in the package, such as FDR and multicol. It is also included as a new option within function multGLM in this package, so that multiple models can be built based on uncorrelated variables chosen appropriately for each species.

corSelect <- function(data, sp.cols = NULL, var.cols, cor.thresh = 0.8, select = "p.value", ...) {

# version 1.6 (6 Jul 2017)

if (length(sp.cols) > 1) stop ("Sorry, 'corSelect' is currently implemented for only one 'sp.col' at a time.")

univar.criteria <- c("VIF")
 bivar.criteria <- c("p.value", "AIC", "BIC")

if (!(select %in% c(univar.criteria, bivar.criteria))) stop ("Invalid 'select' criterion.")

if (!is.null(sp.cols) & select %in% bivar.criteria) {
 n.in <- nrow(data)
 data <- data[is.finite(data[ , sp.cols]), ]
 n.out <- nrow(data)
 if (n.out < n.in) warning (n.in - n.out, " observations removed due to missing data in 'sp.cols'; ", n.out, " observations actually evaluated.")
 }

cor.mat <- cor(data[ , var.cols], ...)
 cor.mat[upper.tri(cor.mat, diag = TRUE)] <- NA
 high.cor.mat <- bivar.mat <- numeric(0)

if (max(abs(cor.mat), na.rm = TRUE) >= cor.thresh) {
 high.cor.rowcol <- as.matrix(which(abs(cor.mat) >= cor.thresh, arr.ind = TRUE))
 high.cor.inds <- sort(unique(as.vector(high.cor.rowcol)))
 high.cor.mat <- data.frame(var1 = rownames(cor.mat)[high.cor.rowcol[ , "row"]], var2 = colnames(cor.mat)[high.cor.rowcol[ , "col"]])
 high.cor.mat$corr <- NULL
 for (r in 1:nrow(high.cor.mat)) high.cor.mat$corr[r] <- cor.mat[high.cor.rowcol[ ,"row"][r], high.cor.rowcol[ ,"col"][r]]
 high.cor.mat <- high.cor.mat[order(abs(high.cor.mat$corr), decreasing = TRUE), ]

if (is.null(sp.cols) & select %in% bivar.criteria) {
 message("Bivariate relationships not assessable without a response variable ('sp.cols'). Returning high pairwise correlations among 'var.cols'.")
 return (high.cor.mat)
 }

high.cor.vars <- unique(rownames(cor.mat[high.cor.inds, high.cor.inds]))

if (select %in% bivar.criteria) {
 bivar.mat <- FDR(data = data, sp.cols = sp.cols, var.cols = match(high.cor.vars, colnames(data)), simplif = TRUE)[ , c("p.value", "AIC", "BIC")]
 if (all.equal(order(bivar.mat[ , c("p.value")]), order(bivar.mat[ , c("AIC")]), order(bivar.mat[ , c("BIC")]))) message("Results identical using whether p-value, AIC or BIC to select variables.\n")
 else message("Results NOT identical using whether p-value, AIC or BIC to select variables.\n")
 } # end if select in bivar

data.remaining <- data[ , var.cols]
 while (max(abs(cor.mat), na.rm = TRUE) >= cor.thresh) {
 max.cor.ind <- which(abs(cor.mat) == max(abs(cor.mat), na.rm = TRUE), arr.ind = TRUE)
 v1 <- rownames(cor.mat)[max.cor.ind[1]]
 v2 <- colnames(cor.mat)[max.cor.ind[2]]
 if (select %in% univar.criteria) {
 vif <- multicol(data.remaining)
 targets <- vif[c(v1, v2), "VIF", drop = FALSE]
 } # end if univar
 else if (select %in% bivar.criteria) {
 targets <- bivar.mat[c(v1, v2), select, drop = FALSE]
 } # end if bivar

exclude <- which.max(as.matrix(targets))
 excl.name <- rownames(targets)[exclude]
 excl.ind <- match(excl.name, colnames(cor.mat))
 data.remaining <- data.remaining[ , -excl.ind, drop = FALSE]
 cor.mat <- cor.mat[-excl.ind, -excl.ind, drop = FALSE]
 if (all(is.na(cor.mat))) cor.mat <- numeric(0)
 if (length(cor.mat) == 0) break
 } # end while max

} # end if max > thresh

selected.vars <- colnames(cor.mat)
 selected.var.cols <- match(colnames(cor.mat), colnames(data))
 excluded.vars <- colnames(data)[var.cols][!(colnames(data)[var.cols] %in% colnames(cor.mat))]

vif2 <- multicol(data[ , selected.var.cols])

list(high.correlations = high.cor.mat,
 bivariate.significance = bivar.mat,
 excluded.vars = excluded.vars,
 selected.vars = selected.vars,
 selected.var.cols = selected.var.cols,
 strongest.remaining.corr = cor.mat[which.max(abs(cor.mat))],
 remaining.multicollinearity = vif2
 )
} # end corSelect function

 

You can install and load fuzzySim (>=version 1.5) and then check help(corSelect) for reproducible usage examples.

Advertisements

Several modEvA functions moved to fuzzySim

Just a heads-up to let you know that the modelling functions formerly included in the modEvA package have been moved to package fuzzySim for operative and editorial reasons. These functions are: multGLM, modelTrim, Fav, getPreds, FDR, percentTestData, integerCols, multConvert and multicol. The rotif.env sample dataset has also moved to fuzzySim. Package modEvA now has dataset rotif.mods and keeps only functions strictly related to model evaluation and analysis, and not to model building.

The new modEvA version (currently 0.9) also includes some bug fixes, including one that made the packaged varPart function calculate the ABCoverlap incorrectly – thanks to Jurica Levatic for spotting this.

Replace a text string in all files within a folder

Imagine you have your package skeleton, with its functions in separate files in the ‘R‘ folder and their help files in the ‘man‘ folder. Now imagine you change your mind about the name of a function or argument which may appear in several of these files. It can be painfully dull and error-prone to replace the name manually or file-by-file… The gsubMultFiles function below changes a character string globally by applying the gsub R function to the texts of all files in a given directory (it can also be used to replace text in files other than R skeletons). Alternatively, you may only want to search for a character string and see in which files it appears.

searchStringInFiles <- function(path, pattern, replacement = NULL, recursive = TRUE, ...) {
  # version 2.0 (17 Apr 2015)
 
  odir <- getwd()
  setwd(path)
  files <- list.files(".", recursive = recursive)
 
  if (is.null(replacement)) {
    for (f in files) {
      if (length(grep(pattern, readLines(f))) > 0) {
        cat("\n", f, "\n", sep = "")
        cat(">>> contains '", pattern, "'!\n", sep = "")
      }  # end if length
    }  # end for f
 
  } else {
 
    for (f in files) {
      in.text <- readLines(f)
      out.text <- gsub(pattern = pattern, replacement = replacement, x = in.text, ...)
      cat(out.text, sep = "\n", file = f)
    }  # end for f
  }  # end else
 
  setwd(odir)
}  # end searchStringInFiles function

[presented with Pretty R]

Load the function and give it your (package) folder path, the character string you want to search for, and (optionally) the replacement for this string:

gsubMultFiles(path = “/home/user/Documents/myPackages/supeR”, pattern = “MYfunction”)

gsubMultFiles(path = “/home/user/Documents/myPackages/supeR”, pattern = “MYfunction”, replacement = “myFunction”)

Mind that global text replacements can get a bit out of control if not used and double-checked carefully. Beware! And backup your package before trying to pull off something like this. (Anyway, this can be undone with the same command after switching the pattern and replacement arguments, I think).

modEvA and fuzzySim packages on R-Forge

Just a quick post to let you know that package modEvA, containing most of the functions posted till today in this blog, and package fuzzySim, containing functions spCodes and splist2presabs and a fuzzy version of binarySimilarity and simMatrix, are finally available on R-Forge and should be accessed there from now on, as that’s where they’ll be updated. You can either install the packages directly with an R command, or download the source and/or binary files from R-Forge. Please refer to the packages pages for more information, and let me know if you bump into problems.

Check your package’s help files for non-ASCII characters

If you’re building an R package (e.g. following this simple tutorial) and have your package skeleton with the ‘man’ folder containing all your help files, it may be useful to check all those files at once for non-ASCII characters (e.g. tildes, dashes) which may not be correctly exported to other computers. The following function aplies the showNonASCIIfile R function to all files in a given folder:

multNonASCIIcheck <- function(path) {
# version 1.0 (4 Apr 2014)
# path: complete folder path to the "man" folder of your package skeleton
  odir <- getwd()
  setwd(path)
  files <- list.files()
  for (f in files) {
    cat("\n", f, "\n", sep = "")
    if(length(tools::showNonASCIIfile(f)) == 0) cat("OK\n")
  }
  setwd(odir)
}

[presented with Pretty R]

Load the function and give it your ‘man’ folder path to get the names of all files and the non-ASCII characters found, e.g.:

multNonASCIIcheck(“/home/user/Documents/myPackages/supeR/man”)

Then you can just open the problematic help files and correct the non-ASCII characters detected.