set up some example data

warning: dummy coding variables results in information loss

random forest example where the wrong type of stuff happens because the factors are now treated as integers

if the model you want to work can deal with new factors, please let it!

if you want to test if method x will explode, do something like this:

…here we see that things have broken. what are our options?

options

  • convert to mode
  • likelihood encoding – especially for huge numbers of factors
  • one hot encodings – linearly dependent columns, so lots of GLM / unpenalized likelihood methods are to break, also you’ll get identifiability issues
  • missing data if the prediction method can still make predictions

  • dummy vignette also says you can use integer encoding, or step_other. don’t recommend integer encoding. example of how step other might work.

one hot encodings with recipes

what happens to novel factor levels

For additional details, please read the full vignette on dummy variables in the recipes package.