Determine the percentage of data to reserve for model testing

Based on the work of Schaafsma & van Vark (1979), Huberty (1994) provided a heuristic (‘rule of thumb’) for determining an adequate proportion of data to set aside for testing species presence/absence models, based on the number of predictor variables that are used (Fielding & Bell 1997). The percentTestData function, now included in the fuzzySim package (Barbosa 2015), calculates this proportion as a percentage:

percentTestData <- function(nvar) {
  # v1.1 (28 Mar 2013)
  # heuristic ('rule of thumb') of Huberty (1994; in Fielding & Bell 1997) to determine the test/training data ratio for presence-absence models
  # nvar: number of predictor variables
  huberty <- 1 / (1 + sqrt(nvar - 1))
  return(round(100 * huberty, 1))
}

[presented with Pretty R]

For example, if you’re building a model based on 15 variables, load the percentTestData function and then just type:

percentTestData(15)

References

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Huberty C.J. (1994) Applied Discriminant Analysis. Wiley, New York, 466 pp.
Schaafsma W. & van Vark G.N. (1979) Classification and discrimination problems with applications. Part IIa. Statistica Neerlandica 33: 91-126
Fielding A. H. & Bell J. F. (1997) A review of methods for the assessment of prediction errors in conservation presence/absence models. Environmental Conservation 24: 38-49
Advertisements

Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s