Estimators Based on Sufficient Statistics#
Statistic#
Let \( T(\vec{\mathbf{y}}) \) be a statistic, which is a function of the observed random variables \( \mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_m \) generated by a probability density function (PDF) \( p(y_1, y_2, \ldots, y_m; \alpha) \) where \(\alpha\) is an unknown parameter.
Note that
When random samples depend on \(\alpha\), the statistic derived from these samples is used to infer \(\alpha\).
A useful statistic should summarize all the information from the measurements efficiently, often by reducing multiple random variables to a single manageable form.
Such a statistic can serve as a good estimator of \(\alpha\).
For instance:
The estimator \(\hat{\boldsymbol{\alpha}} = \mathbf{y}_1\) does not capture all the information about \(\alpha\).
However, the sample mean-based estimator \(\hat{\boldsymbol{\alpha}} = T(\vec{\mathbf{y}}) \triangleq \bar{\mathbf{y}}\), derived from independent, identically distributed normal random variables, retains all relevant information about \(\alpha\) and is considered a sufficient statistic.
This implies that \( T(\vec{\mathbf{y}}) = \bar{\mathbf{y}} \) encapsulates all the information about \(\boldsymbol{\alpha}\) available in the sample.
Definition of Sufficient Statistic#
In the context of statistical estimation, a sufficient statistic is a function of the data that captures all the information necessary to estimate a parameter.
Mathematically speaking, a statistic \( T(\vec{\mathbf{y}}) \) is considered sufficient for \(\alpha\) if the conditional pdf of the data given \( T(\vec{\mathbf{y}})\), i.e., \(p(y_1, y_2, \ldots, y_m \mid T(\vec{y}))\) does not depend on \(\alpha\).
Formally, if you have a set of observations \(\vec{\mathbf{y}} = (\mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_m)\) and a parameter of interest \(\boldsymbol{\alpha}\), then a statistic \(T(\vec{\mathbf{y}})\) is said to be sufficient for \(\boldsymbol{\alpha}\) if the conditional distribution of the data \(\vec{\mathbf{y}}\), given the statistic \(T(\vec{\mathbf{y}})\) and the parameter \(\boldsymbol{\alpha}\), is independent of \(\boldsymbol{\alpha}\), i.e.:
This means that once you know \(T(\vec{\mathbf{y}})\), the original data \(\vec{\mathbf{y}}\) provides no additional information about the parameter \(\boldsymbol{\alpha}\). In other words, \(T(\vec{\mathbf{y}})\) captures all the information that \(\vec{\mathbf{y}}\) contains about \(\boldsymbol{\alpha}\).
Determine If a Statistic is Sufficient#
One method to determine whether a statistic is sufficient, e.g., a statistic \( T(\bar{\mathbf{y}}) \) is sufficient for a parameter \(\alpha\), is through the Fisher Factorization Theorem.
According to the theorem, if the probability density function (pdf) \( p(y_1, y_2, \ldots, y_m; \alpha) \) can be factored into two parts:
\( g(T(\bar{\vec{y}}), \alpha) \): a function that depends on the statistic \( T(\bar{\mathbf{y}}) \) and the parameter \(\alpha\).
\( h(y_1, y_2, \ldots, y_m) \): a function that depends only on the observed data and not on the parameter \(\alpha\).
Then, \( T(\bar{\vec{y}}) \) is a sufficient statistic for \(\alpha\).
Mathematically, the factorization can be expressed as:
Here, \( g(\cdot) \) encapsulates the dependency on the statistic and the parameter, while \( h(\cdot) \) isolates the data dependency, confirming the sufficiency of \( T(\bar{\mathbf{y}}) \).
Therefore, an estimate (or statistic) \( T(\bar{\mathbf{y}}) \) is considered a sufficient estimate for a parameter \(\alpha\) if it satisfies the Fisher factorization theorem.
The converse is also true. If \( T(\vec{y}) \) is a sufficient statistic for the parameter \(\alpha\), then the probability density function (pdf) \( p(\vec{y}; \alpha) \) can be factored according to the Fisher factorization theorem.
This means that if \( T(\vec{y}) \) is sufficient, the pdf can be expressed as:
In general, \(T(\vec{\mathbf{y}})\) is a sufficient statistic for \(\boldsymbol{\alpha}\) if and only if the probability distribution (or likelihood) \(p(\vec{\mathbf{y}}|\boldsymbol{\alpha})\) can be factored into the product of two functions:
Here:
\(g(T(\vec{\mathbf{y}}), \boldsymbol{\alpha})\) is a function that depends on the data only through the statistic \(T(\vec{\mathbf{y}})\) and the parameter \(\boldsymbol{\alpha}\),
It encapsulates all the information about the parameter \(\boldsymbol{\alpha}\) through the statistic \(T(\vec{\mathbf{y}})\).
\(h(\vec{\mathbf{y}})\) is a function that depends on the data \(\vec{\mathbf{y}}\) but not on the parameter \(\boldsymbol{\alpha}\).
It does not provide any information about \(\boldsymbol{\alpha}\).
It only contributes to the overall likelihood through the data distribution, not through the parameter estimation.
Example: Sample Mean as a Sufficient Statistic#
Consider the scenario where you have a set of \(m\) independent and identically distributed (i.i.d.) observations \(\vec{\mathbf{y}} = (\mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_m)\), each RV follows a normal distribution \(\mathcal{N}(\mu, \sigma^2)\), where the mean \(\mu\) is the parameter of interest, and \(\sigma^2\) is known.
The likelihood function is:
This can be rewritten as:
The sum of squares is expanded as
The sample mean is defined as
Substituting this into the likelihood function, we get:
Applying the Fisher Factorization theorem, we get
where
\( g(T(\vec{y}), \alpha) = g(\bar{y}, \mu) \):
Depends on \( \mu \): Through \( \mu \) in \( (\bar{y} - \mu)^2 \)
Depends on \(T(\vec{y}) = \bar{y} \)
\( h(\vec{y}) \):
Depends only on the data \( \vec{y} \): Through \( \sum_{i=1}^{m} (y_i - \bar{y})^2 \) and the constant term \( \left( \frac{1}{\sqrt{2\pi\sigma^2}} \right)^m \) (not that from \(\vec{y}\) we can compute \(\bar{y}\))
Does not depend on \( \mu \)
Thus, this factorization confirms that the sample mean \(\bar{\mathbf{y}}\) is a sufficient statistic for the parameter \(\mu\).
Derivation of the likelihood function (joint PDF)#
Note that the likelihood function is the joint PDF of the observations \( \mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_m \).
Specifically, given that \( \mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_m \) are independent and identically distributed (i.i.d.) random variables from a normal distribution \( \mathcal{N}(\mu, \sigma^2) \), the joint pdf of these observations can be written as:
Since each \( \mathbf{y}_i \) is normally distributed with mean \( \mu \) and variance \( \sigma^2 \), the pdf of each \( y_i \) is:
Thus, the joint pdf is:
This can be simplified to:
This is the joint pdf of the observations \( \mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_m \), and it is also the likelihood function when considering \( \mu \) as the parameter to be estimated from the data.
Discussion: Sufficient Statistic \(T(\vec{\mathbf{y}})\) vs. Estimator \(\hat{\boldsymbol{\alpha}}(\vec{\mathbf{y}})\)#
A sufficient statistic is a function of the data that captures all the information available in the data about a particular parameter.
An estimator is a rule or function that provides an estimate of a parameter based on the observed data. It is also typically a statistic (a function of the data) that is used to infer the value of an unknown parameter.
Loosely speaking, a sufficient statistic is a statistic (a function of data), which can be used to serve as an estimator. A sufficient statistic often plays a dual role as a statistic and as an estimator.
Sufficient Statistic for The Sample Variance#
Consider independent and identically distributed (i.i.d.) Gaussian random variables, where each individual observation \(\mathbf{y}_i\) is normally distributed with mean \(\mu\) and variance \(\sigma^2\) as above.
We know that the sample mean \(\bar{\mathbf{y}}\) is a sufficient statistic for estimating the true mean \(\mu\). Now, what is a sufficient statistic for estimating the true variance \(\sigma^2\)?
We have that the sample variance \(\mathbf{s}_{\mathbf{y}}^2\), where
Recall that the likelihood function is given by
Given that \( \mu \) is known (or already estimated), our goal is to identify a statistic that encapsulates all information about \( \sigma^2 \) contained in the data \( \vec{y} \).
The realization of the sample variance is
Thus, the sum of squared deviations from the mean is:
Substitute the expression for the sum of squared deviations into the likelihood function:
Applying the Fisher Factorization theorem, we obtain
where
\( g(T(\vec{y}), \alpha) = g(s_y^2, \sigma^2) \):
Depends on \( \sigma^2 \): Through the exponential term and the coefficient.
Depends on the data only through \( s_y^2 \): Since \( s_y^2 \) is a function of \( \vec{y} \).
\( h(\vec{y}) = 1 \):
Independence from \( \sigma^2 \): This function does not involve \( \sigma^2 \).
In this case, \( h(\vec{y}) \) is simply 1, meaning it does not contribute any additional information about \( \sigma^2 \).
Thus, the sample variance \( s_{\mathbf{y}}^2 \) is a sufficient statistic for the variance \(\sigma^2\).
In summary:
The sample mean \(\bar{y}\) is a sufficient statistic for the mean \(\mu\) of the distribution.
This means it contains all the necessary information about \(\mu\) present in the data.
Moreover, \(\bar{y}\) is an unbiased estimator of the true mean \(\mu\), meaning that its expected value equals \(\mu\).
The sample variance \( s_{\mathbf{y}}^2 \), on the other hand, is also a sufficient statistic for the variance \(\sigma^2\).
It captures all the relevant information about the variance from the data.
However, when used as an estimator of the true variance \(\sigma^2\), \( s_{\mathbf{y}}^2 \) is a biased estimator.
Specifically, the expectation of \( s_{\mathbf{y}}^2 \) is \(\frac{m-1}{m}\sigma^2\), which is slightly lower than the true variance \(\sigma^2\). - To correct this bias, the unbiased estimator of the variance is \(\frac{m}{m-1} s_{\mathbf{y}}^2\), often denoted as \( s^2_{\text{unbiased}} \).