Bayes Estimation#
Classical estimation theory involves the estimate of unknown but deterministic parameters.
However, it is often possible to model the parameter to be estimated as a random variable \(\boldsymbol{\alpha}\) with some postulated distribution \(p(\alpha)\), referred to as the a priori pdf.
For example, suppose there are many values for the sample mean collected over some period of time.
Bayesian methods allow the introduction of this data via the assumed distribution of the sample mean.
Weighting Function#
There may also be a weighting function that incorporates the cost \(C(\boldsymbol{\alpha}, \hat{\boldsymbol{\alpha}}(\vec{\mathbf{y}}))\), also referred to by some authors as a loss, introduced by the error \(\boldsymbol{\alpha}_e\) between the estimate and its true value and defined by
An example of such a weighting function is the mean-squared error between the estimate \(\hat{\boldsymbol{\alpha}}(\vec{y})\) and the parameter \(\boldsymbol{\alpha}\).
This cost is the counterpart of that observed in detection theory as described.
Average Risk#
Similarly, an average risk \(\mathcal{R}\) is obtained by averaging the cost over \(p(\alpha, \vec{y})\), the joint pdf of the parameter \(\boldsymbol{\alpha}\), and the observable \(\vec{\mathbf{y}}\) according to
Objective. Minimizing this average risk yields a Bayes estimate.
Conditional Cost#
The conditional cost \(C_c(\hat{\boldsymbol{\alpha}}(\vec{\mathbf{y}})|\vec{y})\) is defined as
Note that this is a correction of [B2, Eq. (10.80)].
so that an alternative expression for the average risk can be expressed as
Since the pdf of the observable is positive over the range of its outcomes, minimizing the average risk is accomplished by minimizing the conditional risk.
Average Risk as a Function of Conditional Cost#
To derive equation (C3.81) from equations (C3.79) and (C3.80), we start by examining the expression for the average risk \(\mathcal{R}\) given in equation (C3.79):
Here, \(C(\alpha, \hat{\alpha}(\vec{y}))\) is the cost function, and \(p(\alpha, \vec{y})\) is the joint probability density function (pdf) of the parameter \(\alpha\) and the observable \(\vec{y}\).
Next, we recognize that the joint pdf \(p(\alpha, \vec{y})\) can be factored using Bayes’ theorem:
Substituting this into the expression for \(\mathcal{R}\):
Since \(p(\vec{y})\) does not depend on \(\alpha\), we can factor it out of the inner integral:
The inner integral is the definition of the conditional cost \(C_c(\hat{\alpha}(\vec{y})|\vec{y})\) as given in equation (C3.80):
Substituting this back into our expression for \(\mathcal{R}\):
This is equation (C3.81), which expresses the average risk \(\mathcal{R}\) in terms of the observable’s pdf \(p(\vec{y})\) and the conditional cost \(C_c(\hat{\alpha}(\vec{y})|\vec{y})\).
Mathematical Formulation of the Bayes Estimator#
An alternative and insightful way to introduce Bayes estimation is by observing that, from a mathematical standpoint, the estimator is derived from Equation (C3.80) and can be expressed as
The a posteriori distribution is obtained from Bayes’ theorem and can be written as
where \( p(\vec{y}|\alpha) \) is referred to as the likelihood distribution, since it provides the current observations reflecting the value of the parameter \(\alpha\).
Some Specific Cost Functions#
To proceed further, a cost function must be selected.
Examples of cost functions include:
Squared-error cost function, defined by
\[ C_S(\alpha, \hat{\alpha}(\vec{y})) = (\alpha - \hat{\alpha}(\vec{y}))^2 = \alpha_e^2 \]Uniform cost function defined by
\[\begin{split} C_U(\alpha, \hat{\alpha}(\vec{y})) = \begin{cases} 0, & |\alpha_e| < \frac{\Delta}{2} \\ 1, & |\alpha_e| > \frac{\Delta}{2} \end{cases} \end{split}\]where \(\Delta\) is a small number.
Absolute-error cost function, defined by
\[ C_{AE}(\alpha, \hat{\alpha}(\vec{y})) = |\alpha_e| \]
Discussion: Sequential Learning in Bayesian Estimation#
From the previous discussion, it becomes evident that a Bayes estimate relies on the a posteriori distribution, which requires knowledge of both the a priori distribution and the likelihood function.
The a priori distribution encapsulates our understanding of the parameter before any data is observed.
Once data is collected, this knowledge is updated, leading to a new a posteriori distribution.
This process highlights Bayesian estimation as a form of sequential learning.
If an additional independent data set \(\vec{x}\) becomes available after measuring \(\vec{y}\), we can update our estimates using a sequential version of Bayes’ theorem:
This equation demonstrates how to update the posterior probability \(p(\alpha|\vec{x}, \vec{y})\) of the parameter \(\alpha\) given both data sets \(\vec{x}\) and \(\vec{y}\).
Joint Likelihood: The numerator \(p(\vec{x}, \vec{y}|\alpha)p(\alpha)\) represents the joint likelihood of observing both data sets given \(\alpha\), multiplied by the prior probability of \(\alpha\).
Independence Assumption: Since \(\vec{x}\) and \(\vec{y}\) are independent given \(\alpha\), we can factor the joint likelihood as \(p(\vec{x}|\alpha)p(\vec{y}|\alpha)\).
Sequential Updating: The expression \(\frac{p(\alpha|\vec{x})p(\vec{y}|\alpha)}{\int_{-\infty}^{\infty} p(\alpha|\vec{x})p(\vec{y}|\alpha) \, d\alpha}\) shows that the posterior after observing \(\vec{x}\) serves as the new prior when updating with \(\vec{y}\).
Normalization: The denominator ensures that the posterior probability distribution \(p(\alpha|\vec{x}, \vec{y})\) integrates to one.