Similarity between binary or dicothomous variables (for example, between species’ presence-absence patterns or between regions’ biotic composition) is an important aspect of biogeography. Jaccard’s (1901) index is one of the most widely used similarity indices in ecology; Baroni-Urbani & Buser’s (1976) index has also been extensively used in chorology and biotic regionalization studies. Other measures of association for binary variables are the *phi* coefficient and the tetrachoric correlation coefficient (Ekström 2008).

We can use the following function in R to calculate the similarity between two binary (1-0) variables *x* and *y*:

binarySimilarity <- function(x, y, method) { # version 1.3 (8 May 2014) # x and y are 2 binary (0-1) vectors # 'method' can be 'CCR', 'Jaccard', 'Baroni', 'BaroniO', 'kappa, 'Mathews', 'Phi' or 'Yule' x0 <- x == 0 x1 <- x == 1 y0 <- y == 0 y1 <- y == 1 a <- sum(x1 & y1) b <- sum(x0 & y1) c <- sum(x1 & y0) d <- sum(x0 & y0) N <- sum(a, b, c, d) if (method == "CCR") { return((a + d) / N) } if (method == "Jaccard") { shared <- a total <- sum(x1 | y1) return(shared / total) } else if (method == "Baroni") { A <- sum(x1) B <- sum(y1) C <- a D <- d return((sqrt(C * D) + C) / (sqrt(C * D) + A + B - C)) } else if (method == "BaroniO") { A <- a B <- sum(x1 & y0) C <- sum(y1 & x0) D <- d return((sqrt(A * D) + A) / (sqrt(A * D) + A + B + C)) } else if (method == "kappa") { return(((a+d)-(((a+c)*(a+b)+(b+d)*(c+d))/N))/(N-(((a+c)*(a+b)+(b+d)*(c+d))/N))) } else if (method == "Mathews") { S <- (a + b) / N P <- (a + c) / N MCC <- (a / N - S * P) / sqrt(prod(P, S, (1 - S), (1 - P))) return(MCC) #return(((a * d) - (b * c)) / sqrt((a + c) * (a + b) * (c + d) * (b + d))) # equivalent } else if (method == "Phi") { A <- a/N AB <- (a + b) / N AC <- (a + c) / N CD <- (c + d) / N BD <- (b + d) / N return((A -(AB) * (AC)) / sqrt(prod(AB, CD, AC, BD))) } else if (method == "Yule") { return((a * d - b * c)/(a * d + b * c)) } else stop("'method' must be either 'CCR', 'Jaccard', 'Baroni', 'BaroniO', 'kappa', 'Mathews', 'Phi' or 'Yule'") } # end binarySimilarity function

[presented with Pretty R]

We can also build a square matrix of similarity values (also with either method) between a set of binary (1-0) variables (for example, species or regions) presented as columns in a table. For that we need to load the *binarySimilarity* function (above) and then the *simMatrix* one (below):

simMatrix <- function(variables, method = "Jaccard") { nvar <- ncol(variables) simMat <- matrix(nrow = nvar, ncol = nvar) for(i in 1:nvar) for (j in 1:nvar) { simMat[i,j] <- binarySimilarity(variables[,i], variables[,j], method = method) } dimnames(simMat) <- list(names(variables), names(variables)) return(similarityMatrix = simMat) }

Note that the original formula of Baroni-Urbani & Buser (1976) has been replaced with an equivalent, slightly different but more efficient one (taken from Olivero et al. 1998); you can try and compare their results by using both *method = “Baroni”* and *method = “BaroniO”*. For example, if you have a data frame with binary (1-0) columns to compare, like the *rotifers01* sample dataset of the *fuzzySim* package:

sim1 <- simMatrix(rotifers01, method = "Baroni") sim2 <- simMatrix(rotifers01, method = "BaroniO") all.equal(sim1, sim2) # result is TRUE

These functions were employed, for example, by Barbosa et al. (2012) to study the relationships among mammal distributions in Europe. A *binary.similarity* function is included in the *DeadCanMove* package (Barbosa et al. 2014). An upgraded version of binary similarity indices, able to deal with fuzzy in addition to binary values, is now implemented in the *fuzzySim* package (Barbosa, 2014), although only Jaccard and Baroni-Urbani & Buser’s indices are included there so far.

**References:**

Barbosa A.M. (2014) fuzzySim: Fuzzy similarity in species’ distributions. R package, version 0.1

Barbosa A.M., Estrada A., Márquez A.L., Purvis A. & Orme C.D.L. (2012) Atlas versus range maps: robustness of chorological relationships to distribution data types in European mammals. *Journal of Biogeography* 39: 1391-1400

Barbosa A.M., Marques J.T., Santos S.M., Lourenço A., Medinas D., Beja P. & Mira A. (2014) DeadCanMove: Assess how spatial roadkill patterns change with temporal sampling scheme. R package, version 0.1

Baroni-Urbani C. & Buser M.W. (1976) Similarity of Binary Data. *Systematic Biology* 25: 251-259.

Ekstrom, J. (2008) The phi-coefficient, the tetrachoric correlation coefficient, and the Pearson-Yule debate.

Jaccard P. (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. *Bulletin del la Société Vaudoise des Sciences Naturelles* 37: 547-579.

Olivero J., Real R. & Vargas J.M. (1998) Distribution of breeding, wintering and resident waterbirds in Europe: biotic regions and the macroclimate. *Ornis Fennica* 75: 153-175