Part 1 Motivation

Each model in R essentially lives in its own package and has a unique interface. This introduces a large amount of cognitive load on data analysts. For example, suppose we want to use KNN. We might do something like this:

library(tidyverse)
library(rsample)

data <- initial_split(iris)

train <- training(data)
test  <- select(testing(data), -Species)

knn_preds <- class::knn(test = test,
                        train = select(train, -Species),
                        cl = train$Species,
                        k = 5)

But if we want to use naive Bayes, we might end up writing code that looks like this:

nb_model <- e1071::naiveBayes(Species ~ ., data = train)
nb_preds <- predict(nb_model, newdata = test)

This has some problems:

  • knn generates predictions immediately on a test set, while naiveBayes creates a model object
  • For knn we have to pass arguments cl and k even though it would be reasonable to select k by cross-validation and cl could be more succinctly expressed as an outcome in a formula
  • knn and naiveBayes have different interfaces for specifying design matrices and outcomes
  • knn and naiveBayes both return factor predictions by default, but this might not be the case for other packages. If we want to class probabilities, we have to pass prob = TRUE to knn and type = "raw" to predict.naiveBayes, and the outputs are in entirely different formats.

That is, there isn’t a consistent interface to the packages themselves. Additionally, the packages don’t make use of a conceptual framework that makes it easy to think about modelling.

The goal of the document is provide a grammar of modelling that resolves both of these problems.