Maximum a Posteriori (MAP) Estimation#
Motivation.
When the cost function in Bayesian estimation is unspecified or assumed to be equal for all estimation errors, the estimation problem reduces to Maximum a Posteriori (MAP) estimation.
In this special case of Bayesian estimation, the cost function is uniform, and the optimal estimator minimizes the expected loss by selecting the parameter value that maximizes the posterior distribution.
To apply the MAP estimator, we assume prior knowledge of the parameter’s probability distribution, represented by the prior probability \( p(\alpha) \).
MAP Estimator Definition#
The MAP estimate is computed by maximizing the posterior probability density function (PDF) \( p(\alpha | \vec{y}) \) with respect to the parameter \( \alpha \):
According to Bayes’ theorem, the posterior PDF is given by:
Since the denominator \( p(\vec{y}) \) does not depend on \( \alpha \), it can be omitted in the maximization process.
Therefore, the MAP estimate simplifies to finding the value of \( \alpha \) that maximizes the product \( p(\vec{y} | \alpha) p(\alpha) \):
This simplification makes MAP estimation computationally efficient, as it avoids calculating the often complex denominator \( p(\vec{y}) \).
MAP Estimator Formulation#
Conditional Cost
Recall that the conditional cost \( C_c(\hat{\alpha}|\vec{y}) \) represents the expected cost of estimating the parameter \( \alpha \) given the observation \( \vec{y} \):
where
\( C(\alpha, \hat{\alpha}(\vec{y})) \) is the cost of estimating \( \alpha \) as \( \hat{\alpha}(\vec{y}) \).
\( p(\alpha|\vec{y}) \) is the posterior probability density function of \( \alpha \) given \( \vec{y} \).
Uniform Cost Function
Recall that the uniform cost function penalizes any estimation error beyond a small threshold \( \frac{\Delta}{2} \):
where \( \Delta \) is a small positive number representing the acceptable estimation error margin.
Expressing the Average Risk with the Uniform Cost Function
Recall that the average risk \( \mathcal{R} \) is the expected cost over all possible observations \( \vec{y} \):
where \( p(\vec{y}) \) is the probability density function of the observation \( \vec{y} \).
Using the uniform cost function in the conditional cost equation, we have
Since \( C_U(\alpha, \hat{\alpha}(\vec{y})) = 0 \) when \( |\alpha - \hat{\alpha}(\vec{y})| < \frac{\Delta}{2} \), and \( C_U(\alpha, \hat{\alpha}(\vec{y})) = 1 \) otherwise, the integral simplifies to:
This expression calculates the probability that the estimation error exceeds \( \frac{\Delta}{2} \).
We can split the integral into two regions where the estimation error is beyond the acceptable threshold:
Substituting the expression for \( C_c(\hat{\alpha}|\vec{y}) \) back into the average risk:
Recognizing that the total probability integrates to 1:
We can rewrite the sum of the two integrals as:
Substituting back into the average risk expression, we have:
Note that
The term \( \int_{\hat{\alpha}(\vec{y}) - \frac{\Delta}{2}}^{\hat{\alpha}(\vec{y}) + \frac{\Delta}{2}} p(\alpha|\vec{y})\, d\alpha \) represents the probability that the true parameter \( \alpha \) lies within the acceptable error margin of the estimate \( \hat{\alpha}(\vec{y}) \).
By subtracting this probability from 1, we obtain the probability that the estimation error exceeds the acceptable threshold, which is exactly what the conditional cost measures under the uniform cost function.
The Optimal Estimate \( \hat{\alpha}_{\text{MAP}}\): Minimizing the Average Risk
To minimize \( \mathcal{R} \), we need to maximize the integral \( \int_{\hat{\alpha}(\vec{y}) - \frac{\Delta}{2}}^{\hat{\alpha}(\vec{y}) + \frac{\Delta}{2}} p(\alpha|\vec{y})\, d\alpha \) for each \( \vec{y} \).
Since \( p(\vec{y}) \geq 0 \), the only way to minimize \( \mathcal{R} \) is by maximizing the integral inside the brackets.
This integral measures how concentrated the posterior probability \( p(\alpha|\vec{y}) \) is around the estimate \( \hat{\alpha}(\vec{y}) \).
As \( \Delta \) approaches zero (i.e., for very small acceptable errors), the integral becomes proportional to the value of the posterior probability density at \( \hat{\alpha}(\vec{y}) \):
Thus, minimizing \( \mathcal{R} \) is equivalent to maximizing \( p(\hat{\alpha}(\vec{y})|\vec{y}) \).
Therefore, the optimal estimate \( \hat{\alpha}_{\text{MAP}} \) is the value of \( \alpha \) that maximizes the posterior probability density function given \( \vec{y} \):
Finding The MAP Estimate#
For small \( \Delta \), the integral
is maximized when the estimate \( \hat{\alpha}_{\text{MAP}} \) corresponds to the point where the posterior density \( p(\alpha | \vec{y}) \) reaches its maximum.
Essentially, we want to choose \( \hat{\alpha}_{\text{MAP}} \) such that the probability of the true parameter \( \alpha \) lying within the interval \( \left[ \hat{\alpha}_{\text{MAP}} - \frac{\Delta}{2},\, \hat{\alpha}_{\text{MAP}} + \frac{\Delta}{2} \right] \) is as high as possible.
In the case of a unimodal posterior density—that is, a distribution with a single peak—the estimate \( \hat{\alpha}_{\text{MAP}} \) is the mode of \( p(\alpha | \vec{y}) \).
Note that the mode of a probability distribution is the value at which the PDF reaches its maximum—the point where the distribution has its peak.
Differential Equation
We have that, in calculus, the maxima (and minima) of a differentiable function occur at critical points where the first derivative is zero.
This is because the slope of the tangent to the function at these points is horizontal, indicating a potential maximum or minimum.
Therefore, in our MAP estimation,
To find the maximum of the posterior density \( p(\alpha | \vec{y}) \), we look for the point where the function attains its highest value with respect to \( \alpha \).
Setting the derivative to zero helps us locate critical points of \( p(\alpha | \vec{y}) \), i.e.:
For a unimodal distribution, this critical point corresponds to the global maximum.
By solving \( \frac{\partial}{\partial \alpha} p(\alpha | \vec{y}) = 0 \), we find the value of \( \alpha \) where \( p(\alpha | \vec{y}) \) is at its peak, which is the most probable estimate given the observed data \( \vec{y} \).
Using \(\ln(\cdot)\)
We have that:
The logarithm simplifies the optimization problem, especially when the posterior density involves exponential functions or products of multiple terms.
The logarithm turns products into sums and exponents into multipliers, making differentiation more straightforward.
In our case, since the natural logarithm is a strictly increasing function, the location of the maximum of \( p(\alpha | \vec{y}) \) remains the same when considering \( \ln p(\alpha | \vec{y}) \), i.e., maximizing \( p(\alpha | \vec{y}) \) is equivalent to maximizing \( \ln p(\alpha | \vec{y}) \).
Therefore, we often obtain a simpler equation to solve for \( \hat{\alpha}_{\text{MAP}} \), by setting:
This approach is particularly useful when \( p(\alpha | \vec{y}) \) is composed of exponential terms common in probability distributions like the Gaussian distribution.
Note that these equations assume that the derivatives exist and that \( p(\alpha | \vec{y}) \) is differentiable with respect to \( \alpha \).
Example C3.9: MAP Estimation of The True Mean#
Problem Statement
Using Example C3.7, show that the a posteriori PDF of Eq. (C3.98) has a maximum when
Also show that the a posteriori PDF \( p(\mu | \vec{y}) \) is Gaussian, the mode and mean are identical, so that the MAP estimate is the same as the MMSE estimate.
Solution
To find the MAP estimate \( \hat{\boldsymbol{\mu}}_{\text{MAP}} \) of the unknown random mean \( \boldsymbol{\mu} \) using the given observations \( \mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_m \), we’ll use the posterior PDF provided in equation (C3.98):
This posterior PDF is a normal distribution with mean \( \gamma^2 \omega \) and variance \( \gamma^2 \).
We have that the normal distribution is unimodal and symmetric, the maximum of \( p(\mu | \vec{y}) \) occurs at its mean.
Note that, in this example, we do not need to solve the differential equation (i.e., set the derivative of the posterior density to zero) to find the MAP estimate, sepcifically:
Because we know the maximum occurs at the mean for a normal distribution, we can directly identify the MAP estimate without further calculations.
There’s no need to set the derivative \( \frac{\partial}{\partial \mu} p(\mu | \vec{y}) = 0 \) or \( \frac{\partial}{\partial \mu} \ln p(\mu | \vec{y}) = 0 \) because we already know where the maximum lies.
Therefore, the MAP estimate \( \hat{\mu}_{\text{MAP}} \) is:
Here we present breifly again the derivation of \(\gamma\) and \(\omega\).
Calculating \( \gamma^2 \) and \( \omega \):
Recall that the parameters \( \gamma^2 \) and \( \omega \) are given by :
To simplify, combine the terms in the denominator:
And
And \( \bar{y} \) is the sample mean:
Multiply \( \gamma^2 \) and \( \omega \), we have
Next, multiply the numerators and denominators:
Simplify the terms inside the parentheses:
Substitute back:
The \( \sigma^2 \beta^2 \) terms cancel out:
Thus, the MAP estimate is the same as the MMSE estimate obtained in previous section.