Maximum Likelihood (ML) Estimation#

Bayesian estimation relies on the posterior probability distribution p(α|y), through Bayes’ theorem, it requires the knowledge of both:

  • the prior distribution p(α), and

  • the likelihood function p(y|α) .

In contrast, ML estimation just requires the knowledge of

  • the likelihood function p(y|α)

  • without the need for any prior distribution or cost function.

Formally, the ML estimation focuses solely on maximizing the likelihood function p(y|α) with respect to the parameter α.

Mathematically, the ML estimator is given by:

α^ML(y)=argmaxα^(y)p(y|α)

Determine ML Estimate#

Suppose we have m independent observations y1,y2,,ym of a random variable y, each dependent on an unknown parameter α. Each observation has a probability density function (pdf) p(yi|α).

Since the observations are independent, the joint pdf of all observations (i.e., the likelihood function) is the product of the individual pdfs:

p(y|α)=i=1mp(yi|α)

The maximum-likelihood estimate α^ML of the parameter α is determined by finding the maximum of this likelihood function based on the observed data y.

Derivation Directly From the Likelihood Function#

The likelihood function, i.e., the joint probability of observing the data y given the parameter α, is given by

L(α)=p(y|α)=i=1mp(yi|α)

Compute the derivative of the likelihood function with respect to α, and then set the derivative equal to zero:

αlnL(α)=0

Solve the resulting equation for α to find the ML estimate α^ML.

Derivation From the Log-Likelihood Function#

Since the logarithm is a monotonic function, we may have a simpler derivationi if using the log-likelihood function

Specifically, take the natural logarithm of the likelihood function to obtain the log-likelihood function

lnL(α)=lnp(y|α)=i=1mlnp(yi|α)

Note that, the logarithm turns the product into a sum, which is easier to differentiate.

Compute the derivative of the log-likelihood with respect to α, and set the derivative equal to zero:

αlnL(α)=i=1mαlnp(yi|α)=0

Solve the resulting equation for α to find the maximum likelihood estimate α^ML.

Note that this method only works if the (log)-likelihood function is differentiable with respect to α.

Example with Normally Distributed Observations#

Assume each observation yi is normally distributed with unknown mean α and known variance σ2:

yiN(α,σ2)

Each individual PDF is given by

p(yi|α)=12πσ2exp((yiα)22σ2)

Method 1: Compute directly from the likelihood function

Thus, the likelihood function is obtained as

L(α)=i=1m12πσ2exp((yiα)22σ2)

Notice that each term in the product is an exponential function. Thus, the entire product can be rewritten as:

L(α)=(12πσ2)mexp(i=1m(yiα)22σ2)

Differentiate L(α) with respect to α:

dL(α)dα=(12πσ2)mexp(i=1m(yiα)22σ2)(ddα[i=1m(yiα)22σ2])

Simplifying the derivative inside:

ddα[i=1m(yiα)22σ2]=i=1m(yiα)σ2

Therefore:

dL(α)dα=L(α)(1σ2i=1m(yiα))

To find the maximum likelihood estimate (MLE), set dL(α)dα=0, where α^ is a root:

L(α^)(1σ2i=1m(yiα^))=0

Since L(α^) is always positive, we can ignore it, leading to:

i=1m(yiα^)=0α^=1mi=1myi

Method 2. Compute using the log-likelihood function

The log-likelihood function is

lnL(α)=i=1m[12ln(2πσ2)(yiα)22σ2]

Simplify the sum, we have

lnL(α)=m2ln(2πσ2)12σ2i=1m(yiα)2

Next, we differentiate the log-likelihood function to find α that returns its maximum value

Specificaly, we compute the derivative with respect to α:

αlnL(α)=12σ22i=1m(yiα)(1)

Simplify:

αlnL(α)=1σ2i=1m(yiα)

Set the derivative equal to zero, assuming α^ is the root:

1σ2i=1m(yiα^)=0

Multiply both sides by σ2:

i=1m(yiα^)=0

Simplify the sum:

i=1myimα^=0

Solve for α^:

α^=1mi=1myi

Using either methods, the maximum likelihood estimate of the unknown mean α is the sample mean:

α^ML=y¯=1mi=1myi

Discussion: The Use of Logarithms

  • The natural logarithm is a monotonic function, meaning it preserves the order of the likelihood values.

  • Maximizing the log-likelihood is equivalent to maximizing the likelihood itself.

  • Using the log-likelihood simplifies calculations, especially when dealing with products of probabilities, as it converts them into sums.

    For example, in general case, when we attempt to differentiate a product of multiple functions with respect to α, we must apply the product rule from calculus.

    Specifically, for a product of two functions, f(α) and g(α), the derivative is:

    ddα[f(α)g(α)]=f(α)g(α)+f(α)g(α)

    Extending this to m terms becomes increasingly complex, as the product rule must be applied iteratively.

    The number of terms in the derivative grows exponentially with m, making the calculation cumbersome and error-prone for large m.

Discussion: Applicability

  • This method is applicable when the derivatives of the likelihood or log-likelihood functions exist.

  • In cases where the likelihood function is complex or does not have a closed-form solution, numerical optimization techniques may be employed to find the maximum likelihood estimate.

The ML estimation provides a practical and efficient method for parameter estimation

  • based solely on observed data

  • without requiring prior distributions or cost functions.

Bayes vs. ML Estimation#

  • Bayesian estimation can yield more accurate estimates when the prior distribution and cost functions are correctly specified.

    • However, it may lead to incorrect inferences if these are inaccurately modeled.

  • On the other hand, ML estimation is often favored for its simplicity and because it depends solely on the observed data, making it easier to compute.

    • However, it is sometimes criticized for ignoring prior information about the parameter (in case if such prior information is already available).

MAP vs. ML Estimation#

To intuitively understand the difference between ML estimation and MAP estimation, let’s start with Bayes’ theorem, which relates the posterior probability distribution p(α|y) to the likelihood p(y|α) and the prior p(α):

p(α|y)=p(y|α)p(α)p(y)

recall that:

  • p(α|y) is the posterior probability of the parameter α given the observed data y.

  • p(y|α) is the likelihood of observing the data y given the parameter α.

  • p(α) is the prior probability of the parameter α.

  • p(y) is the marginal likelihood or evidence, calculated as p(y)=p(y|α)p(α)dα.

MAP Estimation#

In MAP estimation, we aim to find the value of α that maximizes the posterior distribution p(α|y).

This involves solving:

α^MAP=argmaxα^(y)p(α|y)

To find this maximum, we can take the derivative of the logarithm of the posterior distribution with respect to α and set it to zero:

αlnp(α|y)=0

Using Bayes’ theorem, the logarithm of the posterior becomes:

lnp(α|y)=lnp(y|α)+lnp(α)lnp(y)

Taking the derivative with respect to α:

αlnp(α|y)=αlnp(y|α)+αlnp(α)αlnp(y)

We can observe that the term lnp(y) is a constant with respect to α because it does not depend on α.

Therefore, its derivative is zero:

αlnp(y)=0

Simplifying, we get:

αlnp(α|y)=αlnp(y|α)+αlnp(α)

ML Estimation as a Special Case of MAP Estimation#

In ML estimation, we seek the value of α that maximizes the likelihood function p(y|α), disregarding the prior p(α):

α^ML=argmaxα^(y)p(y|α)

This involves solving:

αlnp(y|α)=0

Connection Between MAP and ML#

If the prior distribution p(α) is broad and relatively flat, i.e., wide dispersion, (e.g., a uniform distribution over a large interval), then lnp(α) is approximately constant with respect to α.

Consequently, its derivative is nearly zero:

αlnp(α)0

Therefore, under a non-informative prior, the MAP estimation equation simplifies to:

αlnp(α|y)αlnp(y|α)

This shows that the MAP estimate α^MAP approaches the ML estimate α^ML when the prior is non-informative.

We can say that:

  • ML estimation relies solely on the likelihood function p(y|α) and does not require prior knowledge about α.

  • MAP estimation incorporates prior information through p(α), potentially improving the estimate if the prior is accurate.

  • When Prior is Non-Informative: The influence of the prior diminishes, and ML and MAP estimates converge.

Therefore:

  • Advantage of ML Estimation:

    • ML estimation is valuable when no prior information is available

    • or when we prefer an estimate based purely on observed data.

  • Limitation of ML Estimation: Without incorporating prior knowledge (when it is available), the ML estimate may be less accurate than the MAP estimate if relevant prior information exists.

Properties of ML Estimation#

ML estimations have several important properties under regularity conditions:

Regularity Conditions#

For the following properties to hold, certain regularity conditions must be satisfied:

  • The likelihood function p(y|α) must be sufficiently smooth (differentiable) with respect to α.

  • The parameter space should be open, and the true parameter α should be an interior point.

  • The Fisher information F(α) must exist and be finite.

  • The model should not have singularities or discontinuities in the parameter space.

Efficiency#

  • Recall that an estimator is efficient if it achieves the lowest possible variance, i.e., minimum-variance, among all unbiased estimators for a parameter α.

  • Cramér-Rao Lower Bound (CRLB): The variance of any unbiased estimator is bounded below by the CRLB:

    Var(α^)1F(α)

    where F(α) is the Fisher information defined as:

    F(α)=E[2α2lnp(y|α)]
  • If an efficient estimator exists, it can be the ML estimation.

Invariance under Transformations#

Recall that an estimator is invariant under a transformation g(α) if the estimator of g(α) is obtained by applying g to the estimator of α

  • Parameter α

  • transformation g(α)

  • estimate of the parameter α^ML

  • estimate of the transformation g^ML

We have that, if g(α) is an invertible function, then

ML estimate is invariantg^ML=g(α^ML)

Proof

Given α^ML is the ML estimation of α, and g(α) is an invertible function.

We want to prove that, the ML estimation of g(α) is:

g^ML=θ^ML=g(α^ML)

Let θ=g(α).

The likelihood function for θ is:

p(y|θ)=p(y|α)

where α=g1(θ).

Maximizing p(y|θ) with respect to θ is equivalent to maximizing p(y|α) with respect to α.

Therefore, θ^ML=g(α^ML).

Application. This property allows us to directly obtain the ML estimation of any invertible transformation of α by applying the transformation to α^ML.

Generalized Likelihood Ratio Test (GLRT): The invariance property of ML estimations facilitates hypothesis testing using the GLRT, where unknown parameters are replaced by their ML estimates.

Asymptotic Properties#

Under regularity conditions, ML estimations exhibit desirable asymptotic behavior as the sample size m increases, i.e., m:

Asymptotical Efficience#

  • ML estimations are asymptotically efficient, meaning they achieve the CRLB as the sample size m approaches infinity.

Consistency#

  • Recall that an estimator α^ is consistent if it converges in probability to the true parameter value α as m:

    α^pα
  • ML estimations are consistent estimators under certain conditions, meaning they become increasingly accurate as more data are collected.

Asymptotic Normality#

  • We define that an estimator is asymptotically normal if the distribution of m(α^α) converges to a normal distribution as m:

    m(α^α)dN(0,1F(α))
  • ML estimation Asymptotic Normality: The distribution of the ML estimation approaches a normal distribution centered at α with variance 1/(mF(α)).

Best Asymptotically Normal (BAN)#

  • We define that an estimator is BAN if it is asymptotically normal with the smallest possible variance among all consistent estimators.

  • ML estimations achieve the lowest asymptotic variance permitted by the CRLB, making them BAN estimators.

Function of Sufficient Statistics#

  • Recall that a statistic is sufficient for a parameter α if it captures all the information about α present in the data y.

  • ML estimations are functions of sufficient statistics.

  • This means that the ML estimation can be calculated from these statistics without any loss of information.

Example: Maximum Likelihood Estimation of the Mean in a Gaussian Distribution#

This example, based on [B2. Ex. 10.14], repeats the above example in greater detail (and different notation).

It also continues from a previous example, C3.7, to compare with Bayes and MAP estimators.

Let’s consider the problem of estimating the true mean μ of a Gaussian (normal) distribution with a known variance σ2, based on m i.i.d. observations y1,y2,,ym.

Each observation yi follows the probability density function (pdf):

p(yi|μ)=fyi(yi;μ,σ)=12πσ2exp((yiμ)22σ2),for i=1,2,,m

Formulate the Likelihood Function#

Since the observations are independent, the joint pdf (likelihood function) of all observations given μ is the product of the individual pdfs:

p(y|μ)=i=1mp(yi|μ)=(12πσ2)mexp(12σ2i=1m(yiμ)2)

Simplify the expression:

p(y|μ)=1(2πσ2)m/2exp(12σ2i=1m(yiμ)2)

Compute the Log-Likelihood Function#

Taking the natural logarithm of the likelihood function to simplify differentiation:

lnp(y|μ)=ln(1(2πσ2)m/2)12σ2i=1m(yiμ)2

Simplify further:

lnp(y|μ)=m2ln(2πσ2)12σ2i=1m(yiμ)2

Differentiate the Log-Likelihood with Respect to μ#

Compute the derivative:

μlnp(y|μ)=12σ2μ(i=1m(yiμ)2)

Calculate the derivative inside the summation:

μ(yiμ)2=2(yiμ)

Therefore, the derivative becomes:

μlnp(y|μ)=12σ2(2i=1m(yiμ))=1σ2i=1m(yiμ)

Set the Derivative to Zero to Find the Maximum#

Set the derivative equal to zero to find the maximum likelihood estimate μ^ML.

Since a realization μ^ML is a root of this differential equation, we can rewrite it as:

μlnp(y|μ)=1σ2i=1m(yiμ^)=0

Multiply both sides by σ2:

i=1m(yiμ^)=0

Simplify the summation:

i=1myimμ^=0

Solve for μ^:

mμ^=i=1myiμ^=1mi=1myi

The maximum likelihood estimate of μ is the sample mean μ^ML:

μ^ML=y¯=1mi=1myi

Examine the Properties of the Sample Mean as an Estimator#

The sample mean μ^ML possesses several desirable properties (as discussed above for a general ML estimator):

Unbiasedness#

E[μ^ML]=μ

The expected value of the sample mean equals the true mean, indicating that it is an unbiased estimator.

Consistency#

μ^MLpμasm

The estimator converges in probability to the true mean as the sample size increases.

Efficiency#

  • The sample mean achieves the minimum possible variance among all unbiased estimators of μ.

  • It attains the CRLB

    Var(μ^ML)=σ2m

Best Asymptotically Normal (BAN)#

  • As m, the distribution of μ^ML approaches a normal distribution:

    m(μ^MLμ)dN(0,σ2)
  • It is the best estimator in terms of asymptotic normality.

Sufficiency#

  • The sample mean is a sufficient statistic for μ.

Indeed, we have the factorization as

p(y|μ)=(12πσ2)mexp(12σ2i=1myi2)h(y)exp(mμy¯mμ222σ2)g(y¯,μ)
  • h(y): This part of the factorization depends only on the data y and not on the parameter μ.

  • g(y¯,μ): This part depends on the data only through the sample mean y¯ and the parameter μ.

Thus, according to the Fisher Factorization Theorem, since the joint pdf can be expressed as a product of a function that depends on the data only through y¯ and another function that does not depend on μ, the sample mean y¯ is a sufficient statistic for μ.

Furthermore, the sufficiency of y¯ simplifies analysis and computation because it reduces the data y to a single summary statistic y¯ without losing information about μ.

Generalization. ML Estimator as Undesirable Estimator#

In this scenario, the maximum likelihood estimator (ML estimation) coincides with the sample mean, which has all the properties of an excellent estimator.

The favorable properties observed here do not always hold for ML estimations in other contexts.

In some situations, the ML estimation may be biased.

Sample Variance#

Estimating the variance σ2 when μ is unknown.

  • The ML estimation of σ2 is:

    σ^ML2=1mi=1m(yiμ^ML)2
  • This estimator is biased because:

    E[σ^ML2]=(m1m)σ2<σ2

Sample Mean Squared#

  • Estimating μ2 using (μ^ML)2 results in a biased estimator.

  • The bias arises because:

    E[(μ^ML)2]=μ2+σ2m

    which is greater than μ2.

While the ML estimation of the mean in a Gaussian distribution with known variance is an ideal estimator, this is not universally the case for all parameters or distributions.

In some cases, ML estimations can be biased or lack efficiency.

Therefore, it’s important to analyze the properties of the ML estimation in each specific context to understand its suitability and potential limitations.