Similarity between binary variables

Similarity between binary or dicothomous variables (for example, between species’ presence-absence patterns or between regions’ biotic composition) is an important aspect of biogeography. Jaccard’s (1901) index is one of the most widely used similarity indices in ecology; Baroni-Urbani & Buser’s (1976) index has also been extensively used in chorology and biotic regionalization studies. Other measures of association for binary variables are the phi coefficient and the tetrachoric correlation coefficient (Ekström 2008).

We can use the following function in R to calculate the similarity between two binary (1-0) variables x and y:

binarySimilarity <- function(x, y, method) {
  # version 1.3 (8 May 2014)
  # x and y are 2 binary (0-1) vectors
  # 'method' can be 'CCR', 'Jaccard', 'Baroni', 'BaroniO', 'kappa, 'Mathews', 'Phi' or 'Yule'
 
  x0 <- x == 0
  x1 <- x == 1
  y0 <- y == 0
  y1 <- y == 1
 
  a <- sum(x1 & y1)
  b <- sum(x0 & y1)
  c <- sum(x1 & y0)
  d <- sum(x0 & y0)
  N <- sum(a, b, c, d)
 
  if (method == "CCR") {
    return((a + d) / N)
  }
 
  if (method == "Jaccard") {
    shared <- a
    total  <- sum(x1 | y1)
    return(shared / total)
  }
 
  else if (method == "Baroni") {
    A <- sum(x1)
    B <- sum(y1)
    C <- a
    D <- d
    return((sqrt(C * D) + C) / (sqrt(C * D) + A + B - C))
  }
 
  else if (method == "BaroniO") {
    A <- a
    B <- sum(x1 & y0)
    C <- sum(y1 & x0)
    D <- d
    return((sqrt(A * D) + A) / (sqrt(A * D) + A + B + C))
  }
 
  else if (method == "kappa") {
    return(((a+d)-(((a+c)*(a+b)+(b+d)*(c+d))/N))/(N-(((a+c)*(a+b)+(b+d)*(c+d))/N)))
    }
 
  else if (method == "Mathews") {
    S <- (a + b) / N
    P <- (a + c) / N
    MCC <- (a / N - S * P) / sqrt(prod(P, S, (1 - S), (1 - P)))
    return(MCC)
    #return(((a * d) - (b * c)) / sqrt((a + c) * (a + b) * (c + d) * (b + d)))  # equivalent
  }
 
  else if (method == "Phi") {
    A <- a/N
    AB <- (a + b) / N
    AC <- (a + c) / N
    CD <- (c + d) / N
    BD <- (b + d) / N
    return((A -(AB) * (AC)) / sqrt(prod(AB, CD, AC, BD)))
  }
 
  else if (method == "Yule") {
    return((a * d - b * c)/(a * d + b * c))
  }
 
  else stop("'method' must be either 'CCR', 'Jaccard', 'Baroni', 'BaroniO', 
'kappa', 'Mathews', 'Phi' or 'Yule'")
 
}  # end binarySimilarity function

[presented with Pretty R]

We can also build a square matrix of similarity values (also with either method) between a set of binary (1-0) variables (for example, species or regions) presented as columns in a table. For that we need to load the binarySimilarity function (above) and then the simMatrix one (below):

simMatrix <- function(variables, method = "Jaccard") {
  nvar <- ncol(variables)
    simMat <- matrix(nrow = nvar, ncol = nvar)
    for(i in 1:nvar)  for (j in 1:nvar) {
    simMat[i,j] <- binarySimilarity(variables[,i], variables[,j], method = method)
    }  
  dimnames(simMat) <- list(names(variables), names(variables))
  return(similarityMatrix = simMat)
}

Note that the original formula of Baroni-Urbani & Buser (1976) has been replaced with an equivalent, slightly different but more efficient one (taken from Olivero et al. 1998); you can try and compare their results by using both method = “Baroni” and method = “BaroniO”. For example, if you have a data frame with binary (1-0) columns to compare, like the rotifers01 sample dataset of the fuzzySim package:

sim1 <- simMatrix(rotifers01, method = "Baroni")
 
sim2 <- simMatrix(rotifers01, method = "BaroniO")
 
all.equal(sim1, sim2)  # result is TRUE

These functions were employed, for example, by Barbosa et al. (2012) to study the relationships among mammal distributions in Europe. A binary.similarity function is included in the DeadCanMove package (Barbosa et al. 2014). An upgraded version of binary similarity indices, able to deal with fuzzy in addition to binary values, is now implemented in the fuzzySim package (Barbosa, 2014), although only Jaccard and Baroni-Urbani & Buser’s indices are included there so far.

References:

Barbosa A.M. (2014) fuzzySim: Fuzzy similarity in species’ distributions. R package, version 0.1

Barbosa A.M., Estrada A., Márquez A.L., Purvis A. & Orme C.D.L. (2012) Atlas versus range maps: robustness of chorological relationships to distribution data types in European mammals. Journal of Biogeography 39: 1391-1400

Barbosa A.M., Marques J.T., Santos S.M., Lourenço A., Medinas D., Beja P. & Mira A. (2014) DeadCanMove: Assess how spatial roadkill patterns change with temporal sampling scheme. R package, version 0.1

Baroni-Urbani C. & Buser M.W. (1976) Similarity of Binary Data. Systematic Biology 25: 251-259.

Ekstrom, J. (2008) The phi-coefficient, the tetrachoric correlation coefficient, and the Pearson-Yule debate.

Jaccard P. (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37: 547-579.

Olivero J., Real R. & Vargas J.M. (1998) Distribution of breeding, wintering and resident waterbirds in Europe: biotic regions and the macroclimate. Ornis Fennica 75: 153-175

Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s