Part 3 The modelling process

People working with data create statistical models for three major reasons:

  • To describe data and perform exploratory data analysis
  • To estimate the (causal) effect of a treatment on a response
  • To develop a predictive model for some response of interest

A model is often the best descriptive statistic. -Frank Harrell on Twitter

https://twitter.com/f2harrell/status/983675033200549888

Exploratory data analysis might consist of:

  • using unsupervised learning techniques to find patterns
  • using supervised learning to look for patterns
  • ask and answer many rapidly formed causal / associational questions in quick succession succession
  • sharing these findings with others in technical or non-technical reports

(Causal) inference might consist of:

  • following a preregistered analysis plan to use a several well specified models to the data
  • Diagnostics to make sure modelling assumptions are satisfied

This is in theory straightforward if you’ve already done the conceptual work to figure out what should go in your model and what type of model you should fit, what the coefficents will mean, etc, etc. But in a more exploratory analysis you might brainstorm and try several different models to see which are appropriate for the data, and also you might need to select one of several potentially appropriate models.

Developing a predictive model might consist of:

  • Looking at the data and seeing what kinds of features you can engineer
  • Fitting a metric ton of models, some particularly smart ones suggested by looking at the data
  • Assessing the performance of each of the models
  • Comparing the models to see which has best performance
  • Making sure that complex models have converged (MCMC or gradient descent)
  • Probing each model with stuff like LIME to understand it and see avenues for improvement
  • Combining models into a well-tested ensemble for use in production when the inputs can become all kinds of nonsense

Once our data scientist friend has data, the big picture is that they will:

  1. Spend most of their time cleaning and tidying the data to get it into a bare minimum usable format.

  2. Brainstorm up appropriate models and model families to provide solutions for their inferential, predictive and exploratory tasks.

  3. Fit a variety, perhaps many thousands, or these models and model families to the data, which wanting to compare and iterator through the various

  4. Diagnose model fit, check model convergence, compare model performance, visualize model coefficients, predict on new data, etc. Look at model parameters. Compare model statistics for many models. Assess a model for many values of a hyperparameter.

  5. Somehow turn this into a meaningful piece of information, business product, or a report to share with other people.

The goal is to focus on steps (3) and (4) in the hopes that coherent conceptual frame work will emerge. Whether the key aspect of this framwork is “tidyness” or something else that I think remains to be seen.

Next we discuss some general properties that will make a modelling library desirable, both in terms of conceptual organization and practical considerations.