Part 1 Motivation

Each model in R essentially lives in its own package and has a unique interface. This introduces a large amount of cognitive load on data analysts. For example, suppose we want to use KNN. We might do something like this:

library(tidyverse)
library(rsample)

data <- initial_split(iris)

train <- training(data)
test  <- select(testing(data), -Species)

knn_preds <- class::knn(test = test,
                        train = select(train, -Species),
                        cl = train$Species,
                        k = 5)

But if we want to use naive Bayes, we might end up writing code that looks like this:

nb_model <- e1071::naiveBayes(Species ~ ., data = train)
nb_preds <- predict(nb_model, newdata = test)

This has some problems:

knn generates predictions immediately on a test set, while naiveBayes creates a model object
For knn we have to pass arguments cl and k even though it would be reasonable to select k by cross-validation and cl could be more succinctly expressed as an outcome in a formula
knn and naiveBayes have different interfaces for specifying design matrices and outcomes
knn and naiveBayes both return factor predictions by default, but this might not be the case for other packages. If we want to class probabilities, we have to pass prob = TRUE to knn and type = "raw" to predict.naiveBayes, and the outputs are in entirely different formats.

That is, there isn’t a consistent interface to the packages themselves. Additionally, the packages don’t make use of a conceptual framework that makes it easy to think about modelling.

The goal of the document is provide a grammar of modelling that resolves both of these problems.

Some thoughts on modelling in R

Some thoughts on modelling in R

Alex Hayes

2018-04-10

Part 1 Motivation