Some thoughts on modelling in R
Alex Hayes
2018-04-10
Part 1 Motivation
Each model in R essentially lives in its own package and has a unique interface. This introduces a large amount of cognitive load on data analysts. For example, suppose we want to use KNN. We might do something like this:
library(tidyverse)
library(rsample)
data <- initial_split(iris)
train <- training(data)
test <- select(testing(data), -Species)
knn_preds <- class::knn(test = test,
train = select(train, -Species),
cl = train$Species,
k = 5)But if we want to use naive Bayes, we might end up writing code that looks like this:
nb_model <- e1071::naiveBayes(Species ~ ., data = train)
nb_preds <- predict(nb_model, newdata = test)This has some problems:
knngenerates predictions immediately on a test set, whilenaiveBayescreates a model object- For
knnwe have to pass argumentsclandkeven though it would be reasonable to selectkby cross-validation andclcould be more succinctly expressed as an outcome in a formula knnandnaiveBayeshave different interfaces for specifying design matrices and outcomesknnandnaiveBayesboth return factor predictions by default, but this might not be the case for other packages. If we want to class probabilities, we have to passprob = TRUEtoknnandtype = "raw"topredict.naiveBayes, and the outputs are in entirely different formats.
That is, there isn’t a consistent interface to the packages themselves. Additionally, the packages don’t make use of a conceptual framework that makes it easy to think about modelling.
The goal of the document is provide a grammar of modelling that resolves both of these problems.