Estimators Based on Sufficient Statistics#

Statistic#

Let \( T(\vec{\mathbf{y}}) \) be a statistic, which is a function of the observed random variables \( \mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_m \) generated by a probability density function (PDF) \( p(y_1, y_2, \ldots, y_m; \alpha) \) where \(\alpha\) is an unknown parameter.

Note that

  • When random samples depend on \(\alpha\), the statistic derived from these samples is used to infer \(\alpha\).

  • A useful statistic should summarize all the information from the measurements efficiently, often by reducing multiple random variables to a single manageable form.

  • Such a statistic can serve as a good estimator of \(\alpha\).

For instance:

  • The estimator \(\hat{\boldsymbol{\alpha}} = \mathbf{y}_1\) does not capture all the information about \(\alpha\).

  • However, the sample mean-based estimator \(\hat{\boldsymbol{\alpha}} = T(\vec{\mathbf{y}}) \triangleq \bar{\mathbf{y}}\), derived from independent, identically distributed normal random variables, retains all relevant information about \(\alpha\) and is considered a sufficient statistic.

  • This implies that \( T(\vec{\mathbf{y}}) = \bar{\mathbf{y}} \) encapsulates all the information about \(\boldsymbol{\alpha}\) available in the sample.

Definition of Sufficient Statistic#

In the context of statistical estimation, a sufficient statistic is a function of the data that captures all the information necessary to estimate a parameter.

Mathematically speaking, a statistic \( T(\vec{\mathbf{y}}) \) is considered sufficient for \(\alpha\) if the conditional pdf of the data given \( T(\vec{\mathbf{y}})\), i.e., \(p(y_1, y_2, \ldots, y_m \mid T(\vec{y}))\) does not depend on \(\alpha\).

Formally, if you have a set of observations \(\vec{\mathbf{y}} = (\mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_m)\) and a parameter of interest \(\boldsymbol{\alpha}\), then a statistic \(T(\vec{\mathbf{y}})\) is said to be sufficient for \(\boldsymbol{\alpha}\) if the conditional distribution of the data \(\vec{\mathbf{y}}\), given the statistic \(T(\vec{\mathbf{y}})\) and the parameter \(\boldsymbol{\alpha}\), is independent of \(\boldsymbol{\alpha}\), i.e.:

\[ \boxed{ P(\vec{\mathbf{y}}|T(\vec{\mathbf{y}}),\boldsymbol{\alpha}) = P(\vec{\mathbf{y}}|T(\vec{\mathbf{y}})) } \]

This means that once you know \(T(\vec{\mathbf{y}})\), the original data \(\vec{\mathbf{y}}\) provides no additional information about the parameter \(\boldsymbol{\alpha}\). In other words, \(T(\vec{\mathbf{y}})\) captures all the information that \(\vec{\mathbf{y}}\) contains about \(\boldsymbol{\alpha}\).

Determine If a Statistic is Sufficient#

One method to determine whether a statistic is sufficient, e.g., a statistic \( T(\bar{\mathbf{y}}) \) is sufficient for a parameter \(\alpha\), is through the Fisher Factorization Theorem.

According to the theorem, if the probability density function (pdf) \( p(y_1, y_2, \ldots, y_m; \alpha) \) can be factored into two parts:

  • \( g(T(\bar{\vec{y}}), \alpha) \): a function that depends on the statistic \( T(\bar{\mathbf{y}}) \) and the parameter \(\alpha\).

  • \( h(y_1, y_2, \ldots, y_m) \): a function that depends only on the observed data and not on the parameter \(\alpha\).

Then, \( T(\bar{\vec{y}}) \) is a sufficient statistic for \(\alpha\).

Mathematically, the factorization can be expressed as:

\[ \boxed{ p(y_1, \ldots, y_m; \alpha) = g(T(y_1, \ldots, y_m), \alpha) h(y_1, \ldots, y_m) } \]

Here, \( g(\cdot) \) encapsulates the dependency on the statistic and the parameter, while \( h(\cdot) \) isolates the data dependency, confirming the sufficiency of \( T(\bar{\mathbf{y}}) \).

Therefore, an estimate (or statistic) \( T(\bar{\mathbf{y}}) \) is considered a sufficient estimate for a parameter \(\alpha\) if it satisfies the Fisher factorization theorem.

The converse is also true. If \( T(\vec{y}) \) is a sufficient statistic for the parameter \(\alpha\), then the probability density function (pdf) \( p(\vec{y}; \alpha) \) can be factored according to the Fisher factorization theorem.

This means that if \( T(\vec{y}) \) is sufficient, the pdf can be expressed as:

\[ p(\vec{y}; \alpha) = g(T(\vec{y}), \alpha) \, h(\vec{y}) \]

In general, \(T(\vec{\mathbf{y}})\) is a sufficient statistic for \(\boldsymbol{\alpha}\) if and only if the probability distribution (or likelihood) \(p(\vec{\mathbf{y}}|\boldsymbol{\alpha})\) can be factored into the product of two functions:

\[ \boxed{ p(\vec{\mathbf{y}}|\boldsymbol{\alpha}) = g(T(\vec{\mathbf{y}}), \boldsymbol{\alpha}) h(\vec{\mathbf{y}}) } \]

Here:

  • \(g(T(\vec{\mathbf{y}}), \boldsymbol{\alpha})\) is a function that depends on the data only through the statistic \(T(\vec{\mathbf{y}})\) and the parameter \(\boldsymbol{\alpha}\),

    • It encapsulates all the information about the parameter \(\boldsymbol{\alpha}\) through the statistic \(T(\vec{\mathbf{y}})\).

  • \(h(\vec{\mathbf{y}})\) is a function that depends on the data \(\vec{\mathbf{y}}\) but not on the parameter \(\boldsymbol{\alpha}\).

    • It does not provide any information about \(\boldsymbol{\alpha}\).

    • It only contributes to the overall likelihood through the data distribution, not through the parameter estimation.

Example: Sample Mean as a Sufficient Statistic#

Consider the scenario where you have a set of \(m\) independent and identically distributed (i.i.d.) observations \(\vec{\mathbf{y}} = (\mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_m)\), each RV follows a normal distribution \(\mathcal{N}(\mu, \sigma^2)\), where the mean \(\mu\) is the parameter of interest, and \(\sigma^2\) is known.

The likelihood function is:

\[ p(\vec{y};\mu, \sigma^2) = \prod_{i=1}^{m} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right) \]

This can be rewritten as:

\[ p(y_1, \ldots, y_m; \mu, \sigma^2) = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^m \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^{m} (y_i - \mu)^2\right) \]

The sum of squares is expanded as

\[ \sum_{i=1}^{m} (y_i - \mu)^2 = \sum_{i=1}^{m} y_i^2 - 2\mu \sum_{i=1}^{m} y_i + m\mu^2 \]

The sample mean is defined as

\[ \bar{y} = \frac{1}{m} \sum_{i=1}^{m} y_i \]

Substituting this into the likelihood function, we get:

\[ p(\vec{y};\mu, \sigma^2) = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^m \exp\left(-\frac{1}{2\sigma^2} \left( \sum_{i=1}^{m} y_i^2 - 2m\mu \bar{y} + m\mu^2 \right) \right) \]

Applying the Fisher Factorization theorem, we get

\[ p(\vec{y}; \mu, \sigma^2) = \underbrace{\exp\left( -\frac{m(\bar{y} - \mu)^2}{2\sigma^2} \right)}_{g(\bar{y}, \mu)} \cdot \underbrace{\left( \frac{1}{\sqrt{2\pi\sigma^2}} \right)^m \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^{m} (y_i - \bar{y})^2 \right)}_{h(\vec{y})} \]

where

  • \( g(T(\vec{y}), \alpha) = g(\bar{y}, \mu) \):

    • Depends on \( \mu \): Through \( \mu \) in \( (\bar{y} - \mu)^2 \)

    • Depends on \(T(\vec{y}) = \bar{y} \)

  • \( h(\vec{y}) \):

    • Depends only on the data \( \vec{y} \): Through \( \sum_{i=1}^{m} (y_i - \bar{y})^2 \) and the constant term \( \left( \frac{1}{\sqrt{2\pi\sigma^2}} \right)^m \) (not that from \(\vec{y}\) we can compute \(\bar{y}\))

    • Does not depend on \( \mu \)

Thus, this factorization confirms that the sample mean \(\bar{\mathbf{y}}\) is a sufficient statistic for the parameter \(\mu\).

Derivation of the likelihood function (joint PDF)#

Note that the likelihood function is the joint PDF of the observations \( \mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_m \).

Specifically, given that \( \mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_m \) are independent and identically distributed (i.i.d.) random variables from a normal distribution \( \mathcal{N}(\mu, \sigma^2) \), the joint pdf of these observations can be written as:

\[ f(\vec{\mathbf{y}};\mu) = f(\mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_m; \mu) = \prod_{i=1}^{m} f(\mathbf{y}_i ; \mu) \]

Since each \( \mathbf{y}_i \) is normally distributed with mean \( \mu \) and variance \( \sigma^2 \), the pdf of each \( y_i \) is:

\[ f(\mathbf{y}_i ; \mu) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right) \]

Thus, the joint pdf is:

\[ f(\vec{y} ; \mu) = \prod_{i=1}^{m} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right) \]

This can be simplified to:

\[ f(\vec{y} ; \mu) = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^m \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^{m} (y_i - \mu)^2\right) \]

This is the joint pdf of the observations \( \mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_m \), and it is also the likelihood function when considering \( \mu \) as the parameter to be estimated from the data.

Discussion: Sufficient Statistic \(T(\vec{\mathbf{y}})\) vs. Estimator \(\hat{\boldsymbol{\alpha}}(\vec{\mathbf{y}})\)#

  • A sufficient statistic is a function of the data that captures all the information available in the data about a particular parameter.

  • An estimator is a rule or function that provides an estimate of a parameter based on the observed data. It is also typically a statistic (a function of the data) that is used to infer the value of an unknown parameter.

Loosely speaking, a sufficient statistic is a statistic (a function of data), which can be used to serve as an estimator. A sufficient statistic often plays a dual role as a statistic and as an estimator.

Sufficient Statistic for The Sample Variance#

Consider independent and identically distributed (i.i.d.) Gaussian random variables, where each individual observation \(\mathbf{y}_i\) is normally distributed with mean \(\mu\) and variance \(\sigma^2\) as above.

We know that the sample mean \(\bar{\mathbf{y}}\) is a sufficient statistic for estimating the true mean \(\mu\). Now, what is a sufficient statistic for estimating the true variance \(\sigma^2\)?

We have that the sample variance \(\mathbf{s}_{\mathbf{y}}^2\), where

\[ \mathbf{s}_{\mathbf{y}}^2 = \frac{1}{m} \sum_{i=1}^{m}(\mathbf{y}_i - \bar{\mathbf{y}})^2 \]

Recall that the likelihood function is given by

\[ p(\vec{y}; \mu, \sigma^2) = \left( \frac{1}{\sqrt{2\pi\sigma^2}} \right)^m \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^{m} (y_i - \mu)^2 \right) \]

Given that \( \mu \) is known (or already estimated), our goal is to identify a statistic that encapsulates all information about \( \sigma^2 \) contained in the data \( \vec{y} \).

The realization of the sample variance is

\[ s_y^2 = \frac{1}{m} \sum_{i=1}^{m} (y_i - \mu)^2 \]

Thus, the sum of squared deviations from the mean is:

\[ \sum_{i=1}^{m} (y_i - \mu)^2 = m s_y^2 \]

Substitute the expression for the sum of squared deviations into the likelihood function:

\[ p(\vec{y}; \mu, \sigma^2) = \left( \frac{1}{\sqrt{2\pi\sigma^2}} \right)^m \exp\left( -\frac{m s_y^2}{2\sigma^2} \right) \]

Applying the Fisher Factorization theorem, we obtain

\[ p(\vec{y}; \mu, \sigma^2) = \underbrace{\left( \frac{1}{\sqrt{2\pi\sigma^2}} \right)^m \exp\left( -\frac{m s_y^2}{2\sigma^2} \right)}_{g(s_y^2, \sigma^2)} \cdot \underbrace{1}_{h(\vec{y})} \]

where

  • \( g(T(\vec{y}), \alpha) = g(s_y^2, \sigma^2) \):

    • Depends on \( \sigma^2 \): Through the exponential term and the coefficient.

    • Depends on the data only through \( s_y^2 \): Since \( s_y^2 \) is a function of \( \vec{y} \).

  • \( h(\vec{y}) = 1 \):

    • Independence from \( \sigma^2 \): This function does not involve \( \sigma^2 \).

    • In this case, \( h(\vec{y}) \) is simply 1, meaning it does not contribute any additional information about \( \sigma^2 \).

Thus, the sample variance \( s_{\mathbf{y}}^2 \) is a sufficient statistic for the variance \(\sigma^2\).

In summary:

  • The sample mean \(\bar{y}\) is a sufficient statistic for the mean \(\mu\) of the distribution.

    • This means it contains all the necessary information about \(\mu\) present in the data.

    • Moreover, \(\bar{y}\) is an unbiased estimator of the true mean \(\mu\), meaning that its expected value equals \(\mu\).

  • The sample variance \( s_{\mathbf{y}}^2 \), on the other hand, is also a sufficient statistic for the variance \(\sigma^2\).

    • It captures all the relevant information about the variance from the data.

    • However, when used as an estimator of the true variance \(\sigma^2\), \( s_{\mathbf{y}}^2 \) is a biased estimator.

    • Specifically, the expectation of \( s_{\mathbf{y}}^2 \) is \(\frac{m-1}{m}\sigma^2\), which is slightly lower than the true variance \(\sigma^2\). - To correct this bias, the unbiased estimator of the variance is \(\frac{m}{m-1} s_{\mathbf{y}}^2\), often denoted as \( s^2_{\text{unbiased}} \).