Two levels of

inferenceare involved in the task of data modelling. At the first level of inference, we assume that one of the models that we invented is true, and we fit that model to the data. ... The results of this inference are often summarised by the most probably parameter values and error bars on those parameters. ... The second level of inference is the task of model comparison. Here, we wish to compare the models in the light of the data, and assign some sort of preference or ranking to the alternatives.

$$ P(w|D, H_i) = \frac{P(D|w, H_i)P(w|H_i)}{P(D|H_i)} $$

Posterior = Likelihood * Prior / Evidence.

"Evidence" is commonly ignored in the first type of inference, but becomes important in model selection.

It is common to summarise the posterior distribution by the value of \( w_{MP} \) (most probable), and error bars on these best fit parameters. The error bars are obtained from the curvature of the posterior; writing the Hessian ... and Taylor-expanding the log posterior ...

$$ P(H_i |D) \propto P(D|H_i)P(H_i) $$

... this subjective part of the inference will typically be overwhelmed by the obejective term, the evidence.

... Equation (3) has not been normalized because in the data modelling process we may develop new models after the data have arrived (figure 1), when an inadequacy of the first models is detected, for example. So we do not start with a completely defined hypothesis space. Inference is open-ended: we continually seek more probably models to account for the data we gather.

Of course, the evidence is not the whole story if we have good reason to assign unequal priors to the alternative models \(H\). ... The classic example is the 'Sure Thing' hypothesis, by Edwin Thompson Jaynes, which is the hypothesis that the data set will be \(D\), the precise data set that actually occurred; the evidence for the Sure Thing hypothesis is huge. But Sure Thing belongs to an immense class of similar hypotheses which should all be assigned corerspondingly tiny prior probabilities; so the posterior probability for Sure Thing is negligible alongisde any sensible model.

$$ P(D|H_i) = \int P(D|w, H_i)P(w|H_i) dw $$

Since, often the posterior have a strong peak,

$$ P(D|H_i) \simeq P(D|w_{MP}, H_i) P(w_{MP}|H_i) \Delta w $$

Evidence \( \simeq \) Best fit likelihood \( \times \) Occam factor

... then \(P(w_{MP}|H_i) = \frac{1}{\Delta^0 w}\), and

Occam factor = \( \frac{\Delta w}{\Delta^0 w} \)

i.e. the ratio of the posterior accessible volume of \(H_i\)'s parameter space to the prior accessible volume.

Bayesian model selection is a simple extension of maximum likelihood model selection: the evidence is obtained by multiplying the best fit likelihood by the Occam factor. ... Minimum description length (MDL) methods are closely related to this Bayesian framework (Rissanen, 1978, Wallace and Boulton, 1968, Wallace and Freeman, 1987). The log evidence \( \log_2 P(D|H_i) \) is the number of bits in the ideal shortest message that encodes the data D using model \(H_i\). Akaike's criterion can be viewed as an approximation to MDL (Schwarz, 1978, Zellner, 1984). Any implementation of MDL necessitates approximations in evaluating the length of the ideal shortest message. ... I can see no advantage in MDL.