Maximum Likelihood (ML) Estimation#
Bayesian estimation relies on the posterior probability distribution \( p(\alpha | \vec{y}) \), through Bayes’ theorem, it requires the knowledge of both:
the prior distribution \( p(\alpha) \), and
the likelihood function \( p(\vec{y} | \alpha) \) .
In contrast, ML estimation just requires the knowledge of
the likelihood function \( p(\vec{y} | \alpha) \)
without the need for any prior distribution or cost function.
Formally, the ML estimation focuses solely on maximizing the likelihood function \( p(\vec{y} | \alpha) \) with respect to the parameter \( \alpha \).
Mathematically, the ML estimator is given by:
Determine ML Estimate#
Suppose we have \( m \) independent observations \( \mathbf{y}_1, \mathbf{y}_2, \dots, \mathbf{y}_m \) of a random variable \( \mathbf{y} \), each dependent on an unknown parameter \( \alpha \). Each observation has a probability density function (pdf) \( p(y_i | \alpha) \).
Since the observations are independent, the joint pdf of all observations (i.e., the likelihood function) is the product of the individual pdfs:
The maximum-likelihood estimate \( \hat{\boldsymbol{\alpha}}_{\text{ML}} \) of the parameter \(\alpha\) is determined by finding the maximum of this likelihood function based on the observed data \( \vec{\mathbf{y}} \).
Derivation Directly From the Likelihood Function#
The likelihood function, i.e., the joint probability of observing the data \( \vec{y} \) given the parameter \( \alpha \), is given by
Compute the derivative of the likelihood function with respect to \( \alpha \), and then set the derivative equal to zero:
Solve the resulting equation for \( \alpha \) to find the ML estimate \( \hat{\boldsymbol{\alpha}}_{\text{ML}} \).
Derivation From the Log-Likelihood Function#
Since the logarithm is a monotonic function, we may have a simpler derivationi if using the log-likelihood function
Specifically, take the natural logarithm of the likelihood function to obtain the log-likelihood function
Note that, the logarithm turns the product into a sum, which is easier to differentiate.
Compute the derivative of the log-likelihood with respect to \( \alpha \), and set the derivative equal to zero:
Solve the resulting equation for \( \alpha \) to find the maximum likelihood estimate \( \hat{\boldsymbol{\alpha}}_{\text{ML}} \).
Note that this method only works if the (log)-likelihood function is differentiable with respect to \( \alpha \).
Example with Normally Distributed Observations#
Assume each observation \( \mathbf{y}_i \) is normally distributed with unknown mean \( \alpha \) and known variance \( \sigma^2 \):
Each individual PDF is given by
Method 1: Compute directly from the likelihood function
Thus, the likelihood function is obtained as
Notice that each term in the product is an exponential function. Thus, the entire product can be rewritten as:
Differentiate \( L(\alpha) \) with respect to \( \alpha \):
Simplifying the derivative inside:
Therefore:
To find the maximum likelihood estimate (MLE), set \( \frac{dL(\alpha)}{d\alpha} = 0 \), where \(\hat{\alpha}\) is a root:
Since \( L(\hat{\alpha}) \) is always positive, we can ignore it, leading to:
Method 2. Compute using the log-likelihood function
The log-likelihood function is
Simplify the sum, we have
Next, we differentiate the log-likelihood function to find \(\alpha\) that returns its maximum value
Specificaly, we compute the derivative with respect to \( \alpha \):
Simplify:
Set the derivative equal to zero, assuming \( \hat{\alpha} \) is the root:
Multiply both sides by \( \sigma^2 \):
Simplify the sum:
Solve for \( \hat{\alpha} \):
Using either methods, the maximum likelihood estimate of the unknown mean \( \alpha \) is the sample mean:
Discussion: The Use of Logarithms
The natural logarithm is a monotonic function, meaning it preserves the order of the likelihood values.
Maximizing the log-likelihood is equivalent to maximizing the likelihood itself.
Using the log-likelihood simplifies calculations, especially when dealing with products of probabilities, as it converts them into sums.
For example, in general case, when we attempt to differentiate a product of multiple functions with respect to \( \alpha \), we must apply the product rule from calculus.
Specifically, for a product of two functions, \( f(\alpha) \) and \( g(\alpha) \), the derivative is:
\[ \frac{d}{d\alpha} [f(\alpha) \cdot g(\alpha)] = f'(\alpha) \cdot g(\alpha) + f(\alpha) \cdot g'(\alpha) \]Extending this to \( m \) terms becomes increasingly complex, as the product rule must be applied iteratively.
The number of terms in the derivative grows exponentially with \( m \), making the calculation cumbersome and error-prone for large \( m \).
Discussion: Applicability
This method is applicable when the derivatives of the likelihood or log-likelihood functions exist.
In cases where the likelihood function is complex or does not have a closed-form solution, numerical optimization techniques may be employed to find the maximum likelihood estimate.
The ML estimation provides a practical and efficient method for parameter estimation
based solely on observed data
without requiring prior distributions or cost functions.
Bayes vs. ML Estimation#
Bayesian estimation can yield more accurate estimates when the prior distribution and cost functions are correctly specified.
However, it may lead to incorrect inferences if these are inaccurately modeled.
On the other hand, ML estimation is often favored for its simplicity and because it depends solely on the observed data, making it easier to compute.
However, it is sometimes criticized for ignoring prior information about the parameter (in case if such prior information is already available).
MAP vs. ML Estimation#
To intuitively understand the difference between ML estimation and MAP estimation, let’s start with Bayes’ theorem, which relates the posterior probability distribution \( p(\alpha | \vec{y}) \) to the likelihood \( p(\vec{y} | \alpha) \) and the prior \( p(\alpha) \):
recall that:
\( p(\alpha | \vec{y}) \) is the posterior probability of the parameter \( \alpha \) given the observed data \( \vec{y} \).
\( p(\vec{y} | \alpha) \) is the likelihood of observing the data \( \vec{y} \) given the parameter \( \alpha \).
\( p(\alpha) \) is the prior probability of the parameter \( \alpha \).
\( p(\vec{y}) \) is the marginal likelihood or evidence, calculated as \( p(\vec{y}) = \int p(\vec{y} | \alpha) p(\alpha) \, d\alpha \).
MAP Estimation#
In MAP estimation, we aim to find the value of \( \alpha \) that maximizes the posterior distribution \( p(\alpha | \vec{y}) \).
This involves solving:
To find this maximum, we can take the derivative of the logarithm of the posterior distribution with respect to \( \alpha \) and set it to zero:
Using Bayes’ theorem, the logarithm of the posterior becomes:
Taking the derivative with respect to \( \alpha \):
We can observe that the term \( \ln p(\vec{y}) \) is a constant with respect to \( \alpha \) because it does not depend on \( \alpha \).
Therefore, its derivative is zero:
Simplifying, we get:
ML Estimation as a Special Case of MAP Estimation#
In ML estimation, we seek the value of \( \alpha \) that maximizes the likelihood function \( p(\vec{y} | \alpha) \), disregarding the prior \( p(\alpha) \):
This involves solving:
Connection Between MAP and ML#
If the prior distribution \( p(\alpha) \) is broad and relatively flat, i.e., wide dispersion, (e.g., a uniform distribution over a large interval), then \( \ln p(\alpha) \) is approximately constant with respect to \( \alpha \).
Consequently, its derivative is nearly zero:
Therefore, under a non-informative prior, the MAP estimation equation simplifies to:
This shows that the MAP estimate \( \hat{\boldsymbol{\alpha}}_{\text{MAP}} \) approaches the ML estimate \( \hat{\boldsymbol{\alpha}}_{\text{ML}} \) when the prior is non-informative.
We can say that:
ML estimation relies solely on the likelihood function \( p(\vec{y} | \alpha) \) and does not require prior knowledge about \( \alpha \).
MAP estimation incorporates prior information through \( p(\alpha) \), potentially improving the estimate if the prior is accurate.
When Prior is Non-Informative: The influence of the prior diminishes, and ML and MAP estimates converge.
Therefore:
Advantage of ML Estimation:
ML estimation is valuable when no prior information is available
or when we prefer an estimate based purely on observed data.
Limitation of ML Estimation: Without incorporating prior knowledge (when it is available), the ML estimate may be less accurate than the MAP estimate if relevant prior information exists.
Properties of ML Estimation#
ML estimations have several important properties under regularity conditions:
Regularity Conditions#
For the following properties to hold, certain regularity conditions must be satisfied:
The likelihood function \( p(\vec{y} | \alpha) \) must be sufficiently smooth (differentiable) with respect to \( \alpha \).
The parameter space should be open, and the true parameter \( \alpha \) should be an interior point.
The Fisher information \( F(\alpha) \) must exist and be finite.
The model should not have singularities or discontinuities in the parameter space.
Efficiency#
Recall that an estimator is efficient if it achieves the lowest possible variance, i.e., minimum-variance, among all unbiased estimators for a parameter \( \alpha \).
Cramér-Rao Lower Bound (CRLB): The variance of any unbiased estimator is bounded below by the CRLB:
\[ \operatorname{Var}(\hat{\boldsymbol{\alpha}}) \geq \frac{1}{F(\alpha)} \]where \( F(\alpha) \) is the Fisher information defined as:
\[ F(\alpha) = - E\left[ \frac{\partial^2}{\partial \alpha^2} \ln p(\vec{y} | \alpha) \right] \]If an efficient estimator exists, it can be the ML estimation.
Invariance under Transformations#
Recall that an estimator is invariant under a transformation \( g(\alpha) \) if the estimator of \( g(\alpha) \) is obtained by applying \( g \) to the estimator of \( \alpha \)
Parameter \(\alpha\)
transformation \(g(\alpha)\)
estimate of the parameter \(\hat{\boldsymbol{\alpha}}_{\text{ML}}\)
estimate of the transformation \(\hat{\mathbf{g}}_{\text{ML}}\)
We have that, if \( g(\alpha) \) is an invertible function, then
Proof
Given \( \hat{\boldsymbol{\alpha}}_{\text{ML}} \) is the ML estimation of \( \alpha \), and \( g(\alpha) \) is an invertible function.
We want to prove that, the ML estimation of \( g(\alpha) \) is:
Let \( \theta = g(\alpha) \).
The likelihood function for \( \theta \) is:
where \( \alpha = g^{-1}(\theta) \).
Maximizing \( p(\vec{y} | \theta) \) with respect to \( \theta \) is equivalent to maximizing \( p(\vec{y} | \alpha) \) with respect to \( \alpha \).
Therefore, \( \hat{\theta}_{\text{ML}} = g(\hat{\boldsymbol{\alpha}}_{\text{ML}}) \).
Application. This property allows us to directly obtain the ML estimation of any invertible transformation of \( \alpha \) by applying the transformation to \( \hat{\boldsymbol{\alpha}}_{\text{ML}} \).
Generalized Likelihood Ratio Test (GLRT): The invariance property of ML estimations facilitates hypothesis testing using the GLRT, where unknown parameters are replaced by their ML estimates.
Asymptotic Properties#
Under regularity conditions, ML estimations exhibit desirable asymptotic behavior as the sample size \( m \) increases, i.e., \(m\to \infty\):
Asymptotical Efficience#
ML estimations are asymptotically efficient, meaning they achieve the CRLB as the sample size \( m \) approaches infinity.
Consistency#
Recall that an estimator \( \hat{\boldsymbol{\alpha}} \) is consistent if it converges in probability to the true parameter value \( \alpha \) as \( m \to \infty \):
\[ \hat{\boldsymbol{\alpha}} \xrightarrow{p} \alpha \]ML estimations are consistent estimators under certain conditions, meaning they become increasingly accurate as more data are collected.
Asymptotic Normality#
We define that an estimator is asymptotically normal if the distribution of \( \sqrt{m} (\hat{\boldsymbol{\alpha}} - \alpha) \) converges to a normal distribution as \( m \to \infty \):
\[ \sqrt{m} (\hat{\boldsymbol{\alpha}} - \alpha) \xrightarrow{d} \mathcal{N}\left( 0, \frac{1}{F(\alpha)} \right) \]ML estimation Asymptotic Normality: The distribution of the ML estimation approaches a normal distribution centered at \( \alpha \) with variance \( 1 / (m F(\alpha)) \).
Best Asymptotically Normal (BAN)#
We define that an estimator is BAN if it is asymptotically normal with the smallest possible variance among all consistent estimators.
ML estimations achieve the lowest asymptotic variance permitted by the CRLB, making them BAN estimators.
Function of Sufficient Statistics#
Recall that a statistic is sufficient for a parameter \( \alpha \) if it captures all the information about \( \alpha \) present in the data \( \vec{\mathbf{y}} \).
ML estimations are functions of sufficient statistics.
This means that the ML estimation can be calculated from these statistics without any loss of information.
Example: Maximum Likelihood Estimation of the Mean in a Gaussian Distribution#
This example, based on [B2. Ex. 10.14], repeats the above example in greater detail (and different notation).
It also continues from a previous example, C3.7, to compare with Bayes and MAP estimators.
Let’s consider the problem of estimating the true mean \( \mu \) of a Gaussian (normal) distribution with a known variance \( \sigma^2 \), based on \( m \) i.i.d. observations \( \mathbf{y}_1, \mathbf{y}_2, \dots, \mathbf{y}_m \).
Each observation \( \mathbf{y}_i \) follows the probability density function (pdf):
Formulate the Likelihood Function#
Since the observations are independent, the joint pdf (likelihood function) of all observations given \( \mu \) is the product of the individual pdfs:
Simplify the expression:
Compute the Log-Likelihood Function#
Taking the natural logarithm of the likelihood function to simplify differentiation:
Simplify further:
Differentiate the Log-Likelihood with Respect to \( \mu \)#
Compute the derivative:
Calculate the derivative inside the summation:
Therefore, the derivative becomes:
Set the Derivative to Zero to Find the Maximum#
Set the derivative equal to zero to find the maximum likelihood estimate \( \hat{\boldsymbol{\mu}}_{\text{ML}} \).
Since a realization \(\hat{\mu}_{\text{ML}}\) is a root of this differential equation, we can rewrite it as:
Multiply both sides by \( \sigma^2 \):
Simplify the summation:
Solve for \( \hat{\mu} \):
The maximum likelihood estimate of \( \boldsymbol{\mu} \) is the sample mean \( \hat{\boldsymbol{\mu}}_{\text{ML}} \):
Examine the Properties of the Sample Mean as an Estimator#
The sample mean \( \hat{\mu}_{\text{ML}} \) possesses several desirable properties (as discussed above for a general ML estimator):
Unbiasedness#
The expected value of the sample mean equals the true mean, indicating that it is an unbiased estimator.
Consistency#
The estimator converges in probability to the true mean as the sample size increases.
Efficiency#
The sample mean achieves the minimum possible variance among all unbiased estimators of \( \mu \).
It attains the CRLB
\[ \operatorname{Var}(\hat{\mu}_{\text{ML}}) = \frac{\sigma^2}{m} \]
Best Asymptotically Normal (BAN)#
As \( m \to \infty \), the distribution of \( \hat{\mu}_{\text{ML}} \) approaches a normal distribution:
\[ \sqrt{m} (\hat{\mu}_{\text{ML}} - \mu) \xrightarrow{d} \mathcal{N}\left( 0, \sigma^2 \right) \]It is the best estimator in terms of asymptotic normality.
Sufficiency#
The sample mean is a sufficient statistic for \( \mu \).
Indeed, we have the factorization as
\( h(\vec{y}) \): This part of the factorization depends only on the data \( \vec{y} \) and not on the parameter \( \mu \).
\( g(\bar{y}, \mu) \): This part depends on the data only through the sample mean \( \bar{y} \) and the parameter \( \mu \).
Thus, according to the Fisher Factorization Theorem, since the joint pdf can be expressed as a product of a function that depends on the data only through \( \bar{y} \) and another function that does not depend on \( \mu \), the sample mean \( \bar{y} \) is a sufficient statistic for \( \mu \).
Furthermore, the sufficiency of \( \bar{y} \) simplifies analysis and computation because it reduces the data \( \mathbf{y} \) to a single summary statistic \( \bar{y} \) without losing information about \( \mu \).
Generalization. ML Estimator as Undesirable Estimator#
In this scenario, the maximum likelihood estimator (ML estimation) coincides with the sample mean, which has all the properties of an excellent estimator.
The favorable properties observed here do not always hold for ML estimations in other contexts.
In some situations, the ML estimation may be biased.
Sample Variance#
Estimating the variance \( \sigma^2 \) when \( \mu \) is unknown.
The ML estimation of \( \sigma^2 \) is:
\[ \hat{\sigma}^2_{\text{ML}} = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{\mu}_{\text{ML}})^2 \]This estimator is biased because:
\[ E[\hat{\sigma}^2_{\text{ML}}] = \left( \frac{m - 1}{m} \right) \sigma^2 < \sigma^2 \]
Sample Mean Squared#
Estimating \( \mu^2 \) using \( (\hat{\mu}_{\text{ML}})^2 \) results in a biased estimator.
The bias arises because:
\[ E[(\hat{\mu}_{\text{ML}})^2] = \mu^2 + \frac{\sigma^2}{m} \]which is greater than \( \mu^2 \).
While the ML estimation of the mean in a Gaussian distribution with known variance is an ideal estimator, this is not universally the case for all parameters or distributions.
In some cases, ML estimations can be biased or lack efficiency.
Therefore, it’s important to analyze the properties of the ML estimation in each specific context to understand its suitability and potential limitations.