Part 4 Interface desirables

4.1 Desirables necessary for conceptual clarity

Each model and model family corresponds to a single distinct class

This makes it easy to provide an interface compliant template that researchers can will fill in with new models they implement. It also allows them to write custom S3 methods, which they likely expect, and to provide smart cross-validation schemes when appropriate.

Type safe in/out definitions for each method on each class

Users think in terms of model familys

Since it’s most intuitive to think about model families, we want the interface to reflect this. For operations that require an actual model, such as prediction, users should be able to pass in model family objects, and then have the software automatically extract and use the best model in the model family. The interface should provide a clear idea of what it means to fit a model family, so that interested users can customize hyperparameter spaces and selection techniques.

4.2 Desirables necessary for practicality

Uses as much language from Scikit-Learn as possible

Since Scikit-Learn is likely the most uniform and widely known interface for predictive modelling, this reduces cognitive load. Additionally, the Python ML world has a strong tradition of implementing new models with the Scikit-Learn interface that we hope will carry over to new researchers implementing new methods in R.

Researchers can easily implement new models that conform to the interface

This is because we often want to use multiple models at the same time, and it’s a nightmare when these models don’t share the same API.

Ensembling is easy

Most users interested in prediction will create an ensemble at some point. To appeal to the Kaggle crowd as much as possible, it should be easy to bag, boost and stack models to arbitrary depths.

Additionally, a number of R packages use the SuperLearner package as the basis for Super Learner and semi-parametric inference. If new machine learning techniques implement a standardized interface, then packages like SuperLearner (and also caret and mlr for that matter) don’t need to implement wrappers for each new model.

Facilitates automated machine learning

Recently industrial data science has been tending toward automating as many redundant parts of the modelling process as possible. Take for example:

H2O’s Automated Machine Learning, which has already landed in R with h2o.automl()
H2O’s Driverless AI
Airbnb’s modelling workflow
TPOT
mlrMBO
Spearmint

In terms of implementation, this means it should be easy to programmatically generate and manipulate model and model familys. This also means that hyperparameter space specifications should be flexible and informative enough to handle new hyperparameter search algorithms, and that it should be easy to specify smart defaults.

I think it would also be nice to “metalearn” good hyperparameters to try initially, as in auto-sklearn.

Tidy and pipeable data structures

The interface should conform to tidyverse principles, given the consistent standards and wide user adoption of the tidyverse.