Some thoughts on modelling in R
Alex Hayes
2018-04-10
Part 1 Motivation
Each model in R essentially lives in its own package and has a unique interface. This introduces a large amount of cognitive load on data analysts. For example, suppose we want to use KNN. We might do something like this:
library(tidyverse)
library(rsample)
data <- initial_split(iris)
train <- training(data)
test <- select(testing(data), -Species)
knn_preds <- class::knn(test = test,
train = select(train, -Species),
cl = train$Species,
k = 5)
But if we want to use naive Bayes, we might end up writing code that looks like this:
nb_model <- e1071::naiveBayes(Species ~ ., data = train)
nb_preds <- predict(nb_model, newdata = test)
This has some problems:
knn
generates predictions immediately on a test set, whilenaiveBayes
creates a model object- For
knn
we have to pass argumentscl
andk
even though it would be reasonable to selectk
by cross-validation andcl
could be more succinctly expressed as an outcome in a formula knn
andnaiveBayes
have different interfaces for specifying design matrices and outcomesknn
andnaiveBayes
both return factor predictions by default, but this might not be the case for other packages. If we want to class probabilities, we have to passprob = TRUE
toknn
andtype = "raw"
topredict.naiveBayes
, and the outputs are in entirely different formats.
That is, there isn’t a consistent interface to the packages themselves. Additionally, the packages don’t make use of a conceptual framework that makes it easy to think about modelling.
The goal of the document is provide a grammar of modelling that resolves both of these problems.