Part 4 Interface desirables
4.1 Desirables necessary for conceptual clarity
Each model and model family corresponds to a single distinct class
This makes it easy to provide an interface compliant template that researchers can will fill in with new models they implement. It also allows them to write custom S3 methods, which they likely expect, and to provide smart cross-validation schemes when appropriate.
Type safe in/out definitions for each method on each class
Users think in terms of model family
s
Since it’s most intuitive to think about model families, we want the interface to reflect this. For operations that require an actual model
, such as prediction, users should be able to pass in model family
objects, and then have the software automatically extract and use the best model
in the model family
. The interface should provide a clear idea of what it means to fit
a model family, so that interested users can customize hyperparameter spaces and selection techniques.
4.2 Desirables necessary for practicality
Uses as much language from Scikit-Learn
as possible
Since Scikit-Learn
is likely the most uniform and widely known interface for predictive modelling, this reduces cognitive load. Additionally, the Python ML world has a strong tradition of implementing new models with the Scikit-Learn
interface that we hope will carry over to new researchers implementing new methods in R.
Researchers can easily implement new models that conform to the interface
This is because we often want to use multiple models at the same time, and it’s a nightmare when these models don’t share the same API.
Ensembling is easy
Most users interested in prediction will create an ensemble at some point. To appeal to the Kaggle crowd as much as possible, it should be easy to bag, boost and stack models to arbitrary depths.
Additionally, a number of R packages use the SuperLearner
package as the basis for Super Learner and semi-parametric inference. If new machine learning techniques implement a standardized interface, then packages like SuperLearner
(and also caret
and mlr
for that matter) don’t need to implement wrappers for each new model.
Facilitates automated machine learning
Recently industrial data science has been tending toward automating as many redundant parts of the modelling process as possible. Take for example:
- H2O’s Automated Machine Learning, which has already landed in R with
h2o.automl()
- H2O’s Driverless AI
- Airbnb’s modelling workflow
- TPOT
- mlrMBO
- Spearmint
In terms of implementation, this means it should be easy to programmatically generate and manipulate model
and model family
s. This also means that hyperparameter space specifications should be flexible and informative enough to handle new hyperparameter search algorithms, and that it should be easy to specify smart defaults.
I think it would also be nice to “metalearn” good hyperparameters to try initially, as in auto-sklearn.
Tidy and pipeable data structures
The interface should conform to tidyverse principles, given the consistent standards and wide user adoption of the tidyverse.