Part 5 Proposed interface: model fitting
Let’s imagine an interface that makes operations on model
and model family
s feel natural in R. We begin by discussing the process of fitting models
and model families
, primarily from the perspective of predictive models.
In some sense, you can work with a model by fitting a model family on a hyperparameter grid containing a single point. This is the approach caret takes, and I believe the one present in the current interface proposal.
I think is minorly problematic in terms of conceptual clarity, but majorly problematic in terms of implementation of new models. If you’re implementing a new modelling technique, it makes a lot more sense to first write a fit method for models (i.e. glmnet::glmnet) and then to write a fit method (i.e. hyperparameter selection method) that may make heavy use of fit.model.
I want this separation because I think it’ll be key to selling an interface to people writing new methods
5.1 Fitting a model
object
getting to data: data preprocessing: some of this is going to be model based
data augmentation, filtering and variable selection, up/downsampling, unsupervised transformations, etc
To fit a model, we need:
- Data
- Specific hyperparameter values
- A way to train the model given data and specific hyperparameters
On the other hand, to fit a model family, we need:
getting to data: data preprocessing: some of this is going to be model based
data augmentation, filtering and variable selection, up/downsampling, unsupervised transformations, etc
use data to estimate certain values (maybe hyperparameter values) to input into a model - use data scale to define a hyperparameter search range, for example
- Data
- The hyperparameter space to consider
- A way to train the model given data and specific hyperparameters
- A way to search through hyperparameter space
A way to determine which trained model is best
To find the best model in the model family, we need a performance metric, such as root mean squared error, and an estimate of this metric on out-of-sample model data. This likely means getting multiple estimates of model performance by fitting the same model on resampled datasets.
model calibration: specifies both what you need to compare models and how to compare models: think about: compatible types of resampling: different types of CV that you could safely use together
We first need to specify all of these in order to fit a model.
5.2 Objects in play
Recall that to fit a model, we need:
- Data
- Specific hyperparameter values
- A way to train the model given data and specific hyperparameters
On the other hand, to fit a model family, we need:
- Data
- The hyperparameter space to consider
- A way to train the model given data and specific hyperparameters
- A way to search through hyperparameter space
- A way to determine which trained model is best
Our task is now to design intuitive ways to specify all of these. Thankfully, Max Kuhn has already solved several of these problems for us:
- The
recipes
package creates maps from messy input data to design matrices, generalizing the formula. The learned map can then be applied to new data. - The
yardstick
package provides tidy calculations of various performance metrics given predictions and the baseline truth - The
tidyposterior
package provides methods to compare models within a model family by comparing resampled performance metrics - The
rsample
package provides infrastructure for a variety of resampling strategies (although does not provide a way to specify a resampling scheme beyond rawrset
s)
CRAN provides packages to fit most models of interest. This leaves us with a couple remaining problems that we will assume have nice solutions for the moment.
5.2.1 Model calibration
We need to be able to find the best model in a given model family. For the sake of this document, we’ll assume there’s an imaginary calibration
object that consists of: (1) a resampling specification, (2) a performance metric and (3) an appropriate strategy for comparing performance metrics.
As a concrete example, a calibration
object might specify that each model in the model should be fit on 20 bootstrap samples, and the best model would have the lowest average training RMSE on resamples datasets.
TODO: min RMSE, or min RMSE within 1-SE following Breiman
5.2.2 Hyperparameter space definition
To my knowledge, the various hyperparameter search methods use hyperparameter spaces defined as:
- Probability distributions over HP space (random search algorithms)
- Fixed sets of points in HP space (grid search, possibly Gaussian procress or tree Parzen estimators as an initial grid)
GP/TPE could also use a probability distribution initially, with some smart initial sampling scheme to pick initial points. This is what mlrMBO does. auto-sklearn provides an initial grid for GP/TPE based on hyperparameter values that work well on a library of previous datasets and calls the approach “metalearning.”
So presumable we want hp_dist and hp_grid objects that both subclass hp_space objects. We could even provide semi-sane translation between the two.
hp_grid_to_dist would guess the domain of the hyperparameters
hp_dist_to_grid could sample at quantiles or on a latin space design or whatever is smartest
To specify hp_dist objects we should look at Hyperopt specifications. Doing things on log scale will probably be important, and we should think about important transformations for hyperparameters and how to handle them.
More broadly, the model/model family framework can extend beyond supervised learning. For k-means, you might want a fit.k_means_family to select k according to some reasonable strategy. Just something to keep in mind.
For now, I’m going to assume that the problem of hyperparameter space definition has been solved, and that there are nice hp_space
objects that contain this information.
5.2.3 Hyperparameter search
Similarly, let’s assume that there are standard functions for searching through hyperparameter space of class hp_search
.
5.2.4 Model and model family objects
5.2.4.1 model
trained
: a logical indicating if the model has been fitdesign
: arecipe
specifying a transformation into a design matrixhyperparameters
: a named list of hyperparameters to fit the model
5.2.4.2 model_family
trained
: a logical indicating if the model has been fitdesign
: arecipe
specifying a transformation into a design matrixhp_space
: ahp_space
objecthp_search
: ahp_search
objectcalibration
: acalibration
object
5.2.4.3 Pipeable helpers
Each of the following would accept a model
or model_family
object and update the appropriate field:
add_design
would return a model with an updateddesign
field. It would nice for this to be a generic that also had matrix and formula methods that promoted data up to recipes and data framesadd_hp_space
,add_hp_search
andadd_calibration
would work the same way
5.3 Model Instantiation
Let’s assume we’d like to use the KNN model family.
In terms of implementation, I think things will be easiest if each model has a dedicated object initialization function. This function should return an object of class c("knn", "model_family")
, with reasonable defaults in the hp_space
, hp_search
and calibration
fields:
knn_family <- new_knn()
But since the current paradigm in R doesn’t involve instantiating model objects before fitting them, I think it would also be good to provide a wrapper called knn
that first creates a knn
object and then fits it. That is, the following should all be equivalent:
knn_fam_untrained <- new_knn()
knn_family <- fit(knn_fam_untrained, design, data)
knn_family <- knn(design, data)
knn_family <- new_knn() %>%
fit(design, data)
knn_family <- new_knn() %>%
add_design(design, data) %>%
fit()
5.4 Model fitting
To fit a model
object, we could then do any of the following, returning an object of class c("knn", "model_family")
.
knn_model <- knn(design, data, hp_space(k = 13, metric = "euclidean"))
knn_model <- fit(new_knn(), design, data,
hp_space = hp_space(k = 13, metric = "euclidean"))
knn_model <- new_knn() %>%
add_design(recipe, data) %>%
add_hp_space(k = 13, metric = "euclidean") %>%
fit()
Since we are fitting a model
rather than model_family
here we don’t need to specify a hyperparameter search algorithm or a performance assessment specification.
That is, you get a model
back when there is a single set of hyperparameters in the hp_space
and a model_family
anytime the hp_space
specifies multiple/infinite hyperparameter combinations.
To fit model_family
objects, the following would be equivalent
knn_family <- knn(design, data)
# and showing default arguments
knn_family <- fit(model_family = new_knn(), # not a default argument!
design = design,
data = data,
hp_space = default_knn_hp_space,
hp_strategy = gaussian_process_opt,
calibration = default_calibration)
For users departing from the defaults, this might look like
hyperparams <- hp_space(k = 3:4, metric = c("euclidean", "manhattan"))
resamp_spec <- calibration(score = "mae", sampling = "bootstrap", reps = 10)
knn_family <- new_knn() %>%
add_design(recipe, data) %>%
add_hp_space(hyperparams) %>%
add_hp_search(hyperband) %>%
add_resampling(resamp_spec) %>%
fit()
If you wanted to do inference on the best model
in knn_family
, you could get it with
best_knn_model <- extract_model(knn_family)
5.5 Prediction
Default predict methods should always return predictions of the same type as the input data. That is, if you specify a numeric outcome, you get a numeric prediction, if you specify a factor outcome, you get a factor prediction. This makes it easy for users to assess model performance, which is probably the first thing you want to do do after predicting.
This would look like
predictions <- predict(knn_family, newdata)
predictions <- predict(best_knn_model, newdata)
For sanity and consistency with Scikit-Learn
, I think it would be good to add a new generic predict_proba
to get class probabilities for classification problems
class_probs <- predict_proba(knn_family, newdata)
5.6 Shortcut methods
TODO. things like
fit_predict, fit_transform, fit_score, etc