Model Selection (Draft)#
Introduction#
Model selection is a process to estimate the performance of candidate models in order to score or choose the best one. The problem of model selection is to trade off model fit, against model complexity, commonly understood to be equivalent to the effective number of parameters in the model. Since increasing the number of parameters is generally accompanied by a better fit, models are compared by trading off these two quantities as shown in Fig. 46: model complexity and model fit. The more complex the model is, the more it will try to fit all samples even including some abnormal sample points, called outliers, which cause the model over-fitting. Besides, adapting to random fluctuations (noise) corrupting normal sample points will also induce over-fitting, as the model begins to fit the noise rather than the signal in the data points. A simple model cannot only reduce the risk of overfitting but will also be more convincing with fewer parameters. This principle is know as Occam’s Razor (also spelled as Ockham’s razor): “plurality should not be posited without necessity.” This can be paraphrased as that the simplest (good) explanation is the best.
Model complexity: can be evaluated by the number of free parameters in the model.
Model fit: to fit a model to recorded data and can be evaluated by likelihood.
The comparison of probabilistic models can be implemented by several information criteria depending on the user’s considerations. These information criteria have been proposed to avoid the over-fitting of more complex models by the addition of a penalty term. The penalty discourages overfitting because increasing the number of parameters in the model almost always improves the goodness of the fit. The Akaike information criterion (AIC) and Bayesian information criterion (BIC) are widely used as model selection criteria that combines the maximum likelihood with the number of model parameters.
Akaike Information Criterion (AIC)#
AIC is a criterion developed by [Akaike, 1998, Akaike, 1974] for a mesaure of the goodness of fit of any estimated statistical models for a give set of data. It is derived from frequentist probability and inference, where probabilities are fundamentally related to frequencies of events. The AIC is formally defined as
where \(\hat{L}\) is the maximum value of the likelihood function for the model, \(k\) is the number of estimated parameters in the model, and n is the number of data points. The criterion selects models with the minimum AIC value.
Bayesian Information Criterion (BIC)#
BIC is a criterion developed by [Schwarz, 1978] for model selection among a class of parametric models with different numbers of parameters. It is derived from Bayesian probability and inference, where probabilities are fundamentally related to their prior knowledge about an event.
The BIC is formally defined as
where \(\hat{L}\) is the maximum value of the likelihood function for the model, \(k\) is the number of estimated parameters in the model, and \(n\) is the number of data points. The criterion considers models with the minimum BIC value.
For moderate and large samples (when \(n\) > 2), the penalty term in the BIC related to the model complexity is larger than the corresponding term in the AIC, thus BIC penalizes model complexity more heavily than AIC. It means that more complex models will have a larger (worse) BIC value, in turn, be less likely to be selected. On the other hand, it causes AIC to pick more complex models.
References#
- Aka98
Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer, 1998.
- Aka74
Hirotugu Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723, 1974.
- Sch78
Gideon Schwarz. Estimating the dimension of a model. The annals of statistics, pages 461–464, 1978.
Contributors#
Maqsood Rajput, Frederik Tilmann, Jan Freund