Maximum Likelihood (ML) Estimation#
Bayesian estimation relies on the posterior probability distribution
the prior distribution
, andthe likelihood function
.
In contrast, ML estimation just requires the knowledge of
the likelihood function
without the need for any prior distribution or cost function.
Formally, the ML estimation focuses solely on maximizing the likelihood function
Mathematically, the ML estimator is given by:
Determine ML Estimate#
Suppose we have
Since the observations are independent, the joint pdf of all observations (i.e., the likelihood function) is the product of the individual pdfs:
The maximum-likelihood estimate
Derivation Directly From the Likelihood Function#
The likelihood function, i.e., the joint probability of observing the data
Compute the derivative of the likelihood function with respect to
Solve the resulting equation for
Derivation From the Log-Likelihood Function#
Since the logarithm is a monotonic function, we may have a simpler derivationi if using the log-likelihood function
Specifically, take the natural logarithm of the likelihood function to obtain the log-likelihood function
Note that, the logarithm turns the product into a sum, which is easier to differentiate.
Compute the derivative of the log-likelihood with respect to
Solve the resulting equation for
Note that this method only works if the (log)-likelihood function is differentiable with respect to
Example with Normally Distributed Observations#
Assume each observation
Each individual PDF is given by
Method 1: Compute directly from the likelihood function
Thus, the likelihood function is obtained as
Notice that each term in the product is an exponential function. Thus, the entire product can be rewritten as:
Differentiate
Simplifying the derivative inside:
Therefore:
To find the maximum likelihood estimate (MLE), set
Since
Method 2. Compute using the log-likelihood function
The log-likelihood function is
Simplify the sum, we have
Next, we differentiate the log-likelihood function to find
Specificaly, we compute the derivative with respect to
Simplify:
Set the derivative equal to zero, assuming
Multiply both sides by
Simplify the sum:
Solve for
Using either methods, the maximum likelihood estimate of the unknown mean
Discussion: The Use of Logarithms
The natural logarithm is a monotonic function, meaning it preserves the order of the likelihood values.
Maximizing the log-likelihood is equivalent to maximizing the likelihood itself.
Using the log-likelihood simplifies calculations, especially when dealing with products of probabilities, as it converts them into sums.
For example, in general case, when we attempt to differentiate a product of multiple functions with respect to
, we must apply the product rule from calculus.Specifically, for a product of two functions,
and , the derivative is:Extending this to
terms becomes increasingly complex, as the product rule must be applied iteratively.The number of terms in the derivative grows exponentially with
, making the calculation cumbersome and error-prone for large .
Discussion: Applicability
This method is applicable when the derivatives of the likelihood or log-likelihood functions exist.
In cases where the likelihood function is complex or does not have a closed-form solution, numerical optimization techniques may be employed to find the maximum likelihood estimate.
The ML estimation provides a practical and efficient method for parameter estimation
based solely on observed data
without requiring prior distributions or cost functions.
Bayes vs. ML Estimation#
Bayesian estimation can yield more accurate estimates when the prior distribution and cost functions are correctly specified.
However, it may lead to incorrect inferences if these are inaccurately modeled.
On the other hand, ML estimation is often favored for its simplicity and because it depends solely on the observed data, making it easier to compute.
However, it is sometimes criticized for ignoring prior information about the parameter (in case if such prior information is already available).
MAP vs. ML Estimation#
To intuitively understand the difference between ML estimation and MAP estimation, let’s start with Bayes’ theorem, which relates the posterior probability distribution
recall that:
is the posterior probability of the parameter given the observed data . is the likelihood of observing the data given the parameter . is the prior probability of the parameter . is the marginal likelihood or evidence, calculated as .
MAP Estimation#
In MAP estimation, we aim to find the value of
This involves solving:
To find this maximum, we can take the derivative of the logarithm of the posterior distribution with respect to
Using Bayes’ theorem, the logarithm of the posterior becomes:
Taking the derivative with respect to
We can observe that the term
Therefore, its derivative is zero:
Simplifying, we get:
ML Estimation as a Special Case of MAP Estimation#
In ML estimation, we seek the value of
This involves solving:
Connection Between MAP and ML#
If the prior distribution
Consequently, its derivative is nearly zero:
Therefore, under a non-informative prior, the MAP estimation equation simplifies to:
This shows that the MAP estimate
We can say that:
ML estimation relies solely on the likelihood function
and does not require prior knowledge about .MAP estimation incorporates prior information through
, potentially improving the estimate if the prior is accurate.When Prior is Non-Informative: The influence of the prior diminishes, and ML and MAP estimates converge.
Therefore:
Advantage of ML Estimation:
ML estimation is valuable when no prior information is available
or when we prefer an estimate based purely on observed data.
Limitation of ML Estimation: Without incorporating prior knowledge (when it is available), the ML estimate may be less accurate than the MAP estimate if relevant prior information exists.
Properties of ML Estimation#
ML estimations have several important properties under regularity conditions:
Regularity Conditions#
For the following properties to hold, certain regularity conditions must be satisfied:
The likelihood function
must be sufficiently smooth (differentiable) with respect to .The parameter space should be open, and the true parameter
should be an interior point.The Fisher information
must exist and be finite.The model should not have singularities or discontinuities in the parameter space.
Efficiency#
Recall that an estimator is efficient if it achieves the lowest possible variance, i.e., minimum-variance, among all unbiased estimators for a parameter
.Cramér-Rao Lower Bound (CRLB): The variance of any unbiased estimator is bounded below by the CRLB:
where
is the Fisher information defined as:If an efficient estimator exists, it can be the ML estimation.
Invariance under Transformations#
Recall that an estimator is invariant under a transformation
Parameter
transformation
estimate of the parameter
estimate of the transformation
We have that, if
Proof
Given
We want to prove that, the ML estimation of
Let
The likelihood function for
where
Maximizing
Therefore,
Application. This property allows us to directly obtain the ML estimation of any invertible transformation of
Generalized Likelihood Ratio Test (GLRT): The invariance property of ML estimations facilitates hypothesis testing using the GLRT, where unknown parameters are replaced by their ML estimates.
Asymptotic Properties#
Under regularity conditions, ML estimations exhibit desirable asymptotic behavior as the sample size
Asymptotical Efficience#
ML estimations are asymptotically efficient, meaning they achieve the CRLB as the sample size
approaches infinity.
Consistency#
Recall that an estimator
is consistent if it converges in probability to the true parameter value as :ML estimations are consistent estimators under certain conditions, meaning they become increasingly accurate as more data are collected.
Asymptotic Normality#
We define that an estimator is asymptotically normal if the distribution of
converges to a normal distribution as :ML estimation Asymptotic Normality: The distribution of the ML estimation approaches a normal distribution centered at
with variance .
Best Asymptotically Normal (BAN)#
We define that an estimator is BAN if it is asymptotically normal with the smallest possible variance among all consistent estimators.
ML estimations achieve the lowest asymptotic variance permitted by the CRLB, making them BAN estimators.
Function of Sufficient Statistics#
Recall that a statistic is sufficient for a parameter
if it captures all the information about present in the data .ML estimations are functions of sufficient statistics.
This means that the ML estimation can be calculated from these statistics without any loss of information.
Example: Maximum Likelihood Estimation of the Mean in a Gaussian Distribution#
This example, based on [B2. Ex. 10.14], repeats the above example in greater detail (and different notation).
It also continues from a previous example, C3.7, to compare with Bayes and MAP estimators.
Let’s consider the problem of estimating the true mean
Each observation
Formulate the Likelihood Function#
Since the observations are independent, the joint pdf (likelihood function) of all observations given
Simplify the expression:
Compute the Log-Likelihood Function#
Taking the natural logarithm of the likelihood function to simplify differentiation:
Simplify further:
Differentiate the Log-Likelihood with Respect to #
Compute the derivative:
Calculate the derivative inside the summation:
Therefore, the derivative becomes:
Set the Derivative to Zero to Find the Maximum#
Set the derivative equal to zero to find the maximum likelihood estimate
Since a realization
Multiply both sides by
Simplify the summation:
Solve for
The maximum likelihood estimate of
Examine the Properties of the Sample Mean as an Estimator#
The sample mean
Unbiasedness#
The expected value of the sample mean equals the true mean, indicating that it is an unbiased estimator.
Consistency#
The estimator converges in probability to the true mean as the sample size increases.
Efficiency#
The sample mean achieves the minimum possible variance among all unbiased estimators of
.It attains the CRLB
Best Asymptotically Normal (BAN)#
As
, the distribution of approaches a normal distribution:It is the best estimator in terms of asymptotic normality.
Sufficiency#
The sample mean is a sufficient statistic for
.
Indeed, we have the factorization as
: This part of the factorization depends only on the data and not on the parameter . : This part depends on the data only through the sample mean and the parameter .
Thus, according to the Fisher Factorization Theorem, since the joint pdf can be expressed as a product of a function that depends on the data only through
Furthermore, the sufficiency of
Generalization. ML Estimator as Undesirable Estimator#
In this scenario, the maximum likelihood estimator (ML estimation) coincides with the sample mean, which has all the properties of an excellent estimator.
The favorable properties observed here do not always hold for ML estimations in other contexts.
In some situations, the ML estimation may be biased.
Sample Variance#
Estimating the variance
The ML estimation of
is:This estimator is biased because:
Sample Mean Squared#
Estimating
using results in a biased estimator.The bias arises because:
which is greater than
.
While the ML estimation of the mean in a Gaussian distribution with known variance is an ideal estimator, this is not universally the case for all parameters or distributions.
In some cases, ML estimations can be biased or lack efficiency.
Therefore, it’s important to analyze the properties of the ML estimation in each specific context to understand its suitability and potential limitations.