Part 5 Proposed interface: model fitting

Let’s imagine an interface that makes operations on model and model familys feel natural in R. We begin by discussing the process of fitting models and model families, primarily from the perspective of predictive models.

In some sense, you can work with a model by fitting a model family on a hyperparameter grid containing a single point. This is the approach caret takes, and I believe the one present in the current interface proposal.

I think is minorly problematic in terms of conceptual clarity, but majorly problematic in terms of implementation of new models. If you’re implementing a new modelling technique, it makes a lot more sense to first write a fit method for models (i.e. glmnet::glmnet) and then to write a fit method (i.e. hyperparameter selection method) that may make heavy use of fit.model.

I want this separation because I think it’ll be key to selling an interface to people writing new methods

5.1 Fitting a model object

  • getting to data: data preprocessing: some of this is going to be model based

  • data augmentation, filtering and variable selection, up/downsampling, unsupervised transformations, etc

To fit a model, we need:

  • Data
  • Specific hyperparameter values
  • A way to train the model given data and specific hyperparameters

On the other hand, to fit a model family, we need:

  • getting to data: data preprocessing: some of this is going to be model based

  • data augmentation, filtering and variable selection, up/downsampling, unsupervised transformations, etc

  • use data to estimate certain values (maybe hyperparameter values) to input into a model - use data scale to define a hyperparameter search range, for example

  • Data
  • The hyperparameter space to consider
  • A way to train the model given data and specific hyperparameters
  • A way to search through hyperparameter space
  • A way to determine which trained model is best

To find the best model in the model family, we need a performance metric, such as root mean squared error, and an estimate of this metric on out-of-sample model data. This likely means getting multiple estimates of model performance by fitting the same model on resampled datasets.

model calibration: specifies both what you need to compare models and how to compare models: think about: compatible types of resampling: different types of CV that you could safely use together

We first need to specify all of these in order to fit a model.

5.2 Objects in play

Recall that to fit a model, we need:

  • Data
  • Specific hyperparameter values
  • A way to train the model given data and specific hyperparameters

On the other hand, to fit a model family, we need:

  • Data
  • The hyperparameter space to consider
  • A way to train the model given data and specific hyperparameters
  • A way to search through hyperparameter space
  • A way to determine which trained model is best

Our task is now to design intuitive ways to specify all of these. Thankfully, Max Kuhn has already solved several of these problems for us:

  • The recipes package creates maps from messy input data to design matrices, generalizing the formula. The learned map can then be applied to new data.
  • The yardstick package provides tidy calculations of various performance metrics given predictions and the baseline truth
  • The tidyposterior package provides methods to compare models within a model family by comparing resampled performance metrics
  • The rsample package provides infrastructure for a variety of resampling strategies (although does not provide a way to specify a resampling scheme beyond raw rsets)

CRAN provides packages to fit most models of interest. This leaves us with a couple remaining problems that we will assume have nice solutions for the moment.

5.2.1 Model calibration

We need to be able to find the best model in a given model family. For the sake of this document, we’ll assume there’s an imaginary calibration object that consists of: (1) a resampling specification, (2) a performance metric and (3) an appropriate strategy for comparing performance metrics.

As a concrete example, a calibration object might specify that each model in the model should be fit on 20 bootstrap samples, and the best model would have the lowest average training RMSE on resamples datasets.

TODO: min RMSE, or min RMSE within 1-SE following Breiman

5.2.2 Hyperparameter space definition

To my knowledge, the various hyperparameter search methods use hyperparameter spaces defined as:

- Probability distributions over HP space (random search algorithms)
- Fixed sets of points in HP space (grid search, possibly Gaussian procress or tree Parzen estimators as an initial grid)

GP/TPE could also use a probability distribution initially, with some smart initial sampling scheme to pick initial points. This is what mlrMBO does. auto-sklearn provides an initial grid for GP/TPE based on hyperparameter values that work well on a library of previous datasets and calls the approach “metalearning.”

So presumable we want hp_dist and hp_grid objects that both subclass hp_space objects. We could even provide semi-sane translation between the two.

hp_grid_to_dist would guess the domain of the hyperparameters
hp_dist_to_grid could sample at quantiles or on a latin space design or whatever is smartest

To specify hp_dist objects we should look at Hyperopt specifications. Doing things on log scale will probably be important, and we should think about important transformations for hyperparameters and how to handle them.

More broadly, the model/model family framework can extend beyond supervised learning. For k-means, you might want a fit.k_means_family to select k according to some reasonable strategy. Just something to keep in mind.

For now, I’m going to assume that the problem of hyperparameter space definition has been solved, and that there are nice hp_space objects that contain this information.

5.2.4 Model and model family objects

5.2.4.1 model

  • trained: a logical indicating if the model has been fit
  • design: a recipe specifying a transformation into a design matrix
  • hyperparameters: a named list of hyperparameters to fit the model

5.2.4.2 model_family

  • trained: a logical indicating if the model has been fit
  • design: a recipe specifying a transformation into a design matrix
  • hp_space: a hp_space object
  • hp_search: a hp_search object
  • calibration: a calibration object

5.2.4.3 Pipeable helpers

Each of the following would accept a model or model_family object and update the appropriate field:

  • add_design would return a model with an updated design field. It would nice for this to be a generic that also had matrix and formula methods that promoted data up to recipes and data frames
  • add_hp_space, add_hp_search and add_calibration would work the same way

5.3 Model Instantiation

Let’s assume we’d like to use the KNN model family.

In terms of implementation, I think things will be easiest if each model has a dedicated object initialization function. This function should return an object of class c("knn", "model_family"), with reasonable defaults in the hp_space, hp_search and calibration fields:

knn_family <- new_knn()

But since the current paradigm in R doesn’t involve instantiating model objects before fitting them, I think it would also be good to provide a wrapper called knn that first creates a knn object and then fits it. That is, the following should all be equivalent:

knn_fam_untrained <- new_knn()
knn_family <- fit(knn_fam_untrained, design, data)

knn_family <- knn(design, data)

knn_family <- new_knn() %>% 
  fit(design, data)

knn_family <- new_knn() %>% 
  add_design(design, data) %>% 
  fit()

5.4 Model fitting

To fit a model object, we could then do any of the following, returning an object of class c("knn", "model_family").

knn_model <- knn(design, data, hp_space(k = 13, metric = "euclidean"))

knn_model <- fit(new_knn(), design, data,
                 hp_space = hp_space(k = 13, metric = "euclidean"))

knn_model <- new_knn() %>% 
  add_design(recipe, data) %>% 
  add_hp_space(k = 13, metric = "euclidean") %>% 
  fit()

Since we are fitting a model rather than model_family here we don’t need to specify a hyperparameter search algorithm or a performance assessment specification.

That is, you get a model back when there is a single set of hyperparameters in the hp_space and a model_family anytime the hp_space specifies multiple/infinite hyperparameter combinations.

To fit model_family objects, the following would be equivalent

knn_family <- knn(design, data)

# and showing default arguments

knn_family <- fit(model_family = new_knn(), # not a default argument!
                  design = design,
                  data = data,
                  hp_space = default_knn_hp_space,
                  hp_strategy = gaussian_process_opt,
                  calibration = default_calibration)

For users departing from the defaults, this might look like

hyperparams <- hp_space(k = 3:4, metric = c("euclidean", "manhattan"))
resamp_spec <- calibration(score = "mae", sampling = "bootstrap", reps = 10)

knn_family <- new_knn() %>% 
  add_design(recipe, data) %>% 
  add_hp_space(hyperparams) %>% 
  add_hp_search(hyperband) %>% 
  add_resampling(resamp_spec) %>% 
  fit()

If you wanted to do inference on the best model in knn_family, you could get it with

best_knn_model <- extract_model(knn_family)

5.5 Prediction

Default predict methods should always return predictions of the same type as the input data. That is, if you specify a numeric outcome, you get a numeric prediction, if you specify a factor outcome, you get a factor prediction. This makes it easy for users to assess model performance, which is probably the first thing you want to do do after predicting.

This would look like

predictions <- predict(knn_family, newdata)
predictions <- predict(best_knn_model, newdata)

For sanity and consistency with Scikit-Learn, I think it would be good to add a new generic predict_proba to get class probabilities for classification problems

class_probs <- predict_proba(knn_family, newdata)

5.6 Shortcut methods

TODO. things like

fit_predict, fit_transform, fit_score, etc