Lossless Coding of Information Sources#

Goal of Data Compression#

The primary goal of data compression is to represent an information source using the fewest possible bits while ensuring that the original data can be faithfully recovered from its compressed form.

In communication systems, this means:

  • Minimizing data size for efficient transmission or storage.

  • Preserving information integrity, ensuring the original message can be reconstructed without distortion (in lossless compression) or with acceptable distortion (in lossy compression).

The quality of recovery depends on the type of compression used, but the overarching aim is efficiency without compromising usability.

Classification of Data Compression#

Data compression is categorized into two main types:

  • Lossless Compression

    • Exact reconstruction: The original data can be perfectly recovered from the compressed version.

    • Used for text, executable files, and certain types of images (e.g., PNG).

    • Examples: Huffman coding, Arithmetic coding, Lempel-Ziv (LZ77/LZ78), ZIP format.

  • Lossy Compression

    • Irreversible compression: Some data is discarded to achieve higher compression ratios.

    • Used where minor distortions are acceptable, such as in multimedia applications.

    • Examples: JPEG (images), MP3 (audio), MPEG (video).

The choice between lossless and lossy compression depends on the application:

  • Lossless compression is required for data integrity, such as text files, medical images, or software.

  • Lossy compression is preferred for human-perceived data, where small distortions are imperceptible but greatly reduce file size.

Objective of Lossless Compression#

The key goal of lossless compression is to:

  • Minimize the number of bits needed to represent the source data.

  • Ensure perfect reconstruction from the compressed form.

This means that every bit of the original data can be fully recovered without errors. For example:

  • Text compression (e.g., using Huffman coding) ensures that a decompressed file matches the original character-for-character.

  • PNG image compression guarantees that every pixel remains unchanged after decompression.

In information theory, the theoretical limit of lossless compression is determined by the entropy of the source.

Shannon’s entropy theorem establishes that the minimum average number of bits needed per symbol cannot be lower than the entropy of the source.

This concept serves as the foundation for optimal coding techniques such as Huffman coding and Arithmetic coding, which aim to approach this lower bound.

Lossless Source Coding#

Model of a Discrete Memoryless Source (DMS)#

A Discrete Memoryless Source (DMS) generates symbols from a finite alphabet:

\[ \mathcal{X} = \{a_1, a_2, \ldots, a_N\}. \]

Each symbol \(a_i\) appears with probability:

\[ P[X = a_i] = p_i, \quad \text{where} \quad \sum_{i=1}^{N} p_i = 1. \]
  • Memoryless Property: Each symbol is statistically independent of previous symbols.

  • Independent Replicas: A sequence of length \(n\) is generated by \(n\) independent draws from \(X\).

For example, a binary source with:

\[ \mathcal{X} = \{0, 1\}, \quad P[0] = 0.6, \quad P[1] = 0.4 \]

might produce sequences like 001101, where each digit is sampled independently.

Note that the total number of sequences is

\[ |\mathcal{X}^n| = N^n = 2^n \quad \text{for a binary source} \]

Output Sequence Definition#

An output sequence \(\vec{x}\) of length \(n\) from a DMS is:

\[\boxed{ \vec{x} = (x_1, x_2, \ldots, x_n) \in \mathcal{X}^n } \]

where each \(x_j \in \mathcal{X}\).

  • Example (Binary Source, \(n = 100\)):
    A possible output sequence:

    \[ \vec{x} = 0100110101 \dots \]
  • Large \(n\) Assumption:

    • As \(n\) increases, statistical properties (e.g., Law of Large Numbers) dominate.

    • This allows simpler probabilistic analysis by assuming empirical frequencies approach expected probabilities.

Definition of Typical Sequences#

A sequence \(\vec{x}\) is typical if the empirical frequency of each symbol matches its expected probability:

\[ \frac{\text{Occurrences of } a_i}{n} \approx p_i, \quad 1 \leq i \leq N. \]

That is, the symbol \(a_i\) appears approximately:

\[ n p_i \]

times in \(\vec{x}\).

  • Example (Binary Source, \(p_0 = 0.6, p_1 = 0.4\), \(n = 100\)):
    A typical sequence would contain about 60 zeros and 40 ones.

  • The typical set \(\mathcal{A}\) consists of all such sequences.

  • “Approximately” is formalized using probabilistic bounds, but intuitively means small deviations relative to \(n\).

This concept is crucial in source coding theorems, as typical sequences represent high-probability outcomes, allowing efficient compression strategies.

Probability of Typical Sequences#

Law of Large Numbers and Typicality#

The law of large numbers states that as \(n \to \infty\), the empirical frequency of each symbol \(a_i\) in a sequence \(\vec{x}\) converges to its probability \(p_i\) with high probability.

For a Discrete Memoryless Source (DMS):

  • The probability that a sequence \(\vec{x}\) is typical (i.e., \(\vec{x} \in \mathcal{A}\)) approaches 1 as \(n\) grows large.

  • This happens because the fraction of occurrences of each \(a_i\) behaves like a sample mean, which concentrates around \(p_i\) for large \(n\).

Thus, for large \(n\), almost all sequences generated by the source are typical, making the typical set \(\mathcal{A}\) the dominant set of outcomes.

This property is crucial for data compression, as it means that we can focus only on encoding typical sequences while ignoring the exponentially smaller probability of atypical sequences.

Probability of a Typical Sequence#

For a typical sequence \(\vec{x}\) of length \(n\), each symbol \(a_i\) appears approximately:

\[ n_i \approx n p_i. \]

Let \(\boldsymbol{X}\) be a random vector.

The probability of \(\vec{x}\) occurring is:

\[ P[\boldsymbol{X} = \vec{x}] = \prod_{i=1}^{N} p_i^{n_i}. \]

Since \(n_i \approx n p_i\), we approximate:

\[ \log P[\boldsymbol{X} = \vec{x}] \approx \log \prod_{i=1}^{N} p_i^{n p_i}. \]

Using logarithm properties:

\[ \log P[\boldsymbol{X} = \vec{x}] = \sum_{i=1}^{N} n p_i \log p_i. \]

Substituting \(n p_i\):

\[ \log P[\boldsymbol{X} = \vec{x}] = n \sum_{i=1}^{N} p_i \log p_i = -n H(X), \]

where:

\[ H(X) = -\sum_{i=1}^{N} p_i \log p_i \]

is the entropy of the source (assuming base 2 for bits).

Exponentiating:

\[\boxed{ P[\boldsymbol{X} = \vec{x}] \approx 2^{-n H(X)}. } \]

This result has critical implications:

  • All typical sequences have nearly the same probability, approximately \(2^{-n H(X)}\).

  • Since the number of typical sequences is about \(2^{n H(X)}\), their total probability approaches \(1\).

  • This insight is key for Shannon’s source coding theorem, as it justifies encoding sequences using approximately \(n H(X)\) bits, achieving optimal compression.

Transmission Rate and Typical Set Size#

Cardinality of the Typical Set#

Since the probability of typical sequences approaches 1 for large \(n\) (i.e., \(P[\mathcal{A}] \approx 1\)), and each typical sequence has probability:

\[ P[\boldsymbol{X} = \vec{x}] \approx 2^{-n H(X)}, \]

the number of typical sequences (cardinality of \(\mathcal{A}\)) is approximately:

\[ |\mathcal{A}| \times 2^{-n H(X)} \approx 1 \quad \Rightarrow \quad \boxed{|\mathcal{A}| \approx 2^{n H(X)}}. \]

This means the number of high-probability sequences grows exponentially with \(n\).

For a binary source with entropy \(H(X) = 1\) bit (e.g., a fair coin), we have:

\[ |\mathcal{A}| \approx 2^{100} \quad \text{for} \quad n = 100. \]

This means that out of the \(2^{100}\) possible sequences of length \(100\), most of the \(2^{100}\) sequences are typical—and nearly all probability mass is concentrated in this set.

Focus on Typical Sequences for Transmission#

For large \(n\), the subset of all possible sequences (total size \(|\mathcal{X}|^n = N^n\)) that are typical is almost certain to occur.

  • Non-typical sequences (e.g., all zeros in a fair coin source) have a vanishingly small probability.

  • Thus, for lossless transmission, we only need to encode the typical sequences in \(\mathcal{A}\), ignoring the rest.

This reduces the complexity of encoding, since we do not need to account for atypical sequences that are statistically improbable.

Bits Required and Transmission Rate#

Since:

\[ |\mathcal{A}| \approx 2^{n H(X)} \]

we need at least:

\[ \log_2 |\mathcal{A}| \approx n H(X) \]

bits to encode all typical sequences uniquely.

Thus, the transmission rate (bits per source symbol) is:

\[\boxed{ R \approx \frac{n H(X)}{n} = H(X) \quad \text{bits per transmission}. } \]

This result aligns directly with Shannon’s source coding theorem, which states:

  • A DMS can be compressed to \(H(X)\) bits per symbol as \(n \to \infty\) without loss.

  • No lower rate suffices for error-free reconstruction.

Shannon’s First Theorem#

Statement of Shannon’s First Theorem#

Shannon’s First Theorem, also known as the Lossless Source Coding Theorem, was introduced by Claude Shannon in 1948 and establishes a fundamental result in information theory.

It states:

THEOREM:
Let \(X\) denote a Discrete Memoryless Source (DMS) with entropy \(H(X)\). A lossless source code can exist at any rate \(R\) if:

\[ R > H(X). \]

Conversely, no lossless code can exist at rates lower than \(H(X)\):

\[ R < H(X). \]

This theorem establishes the minimum rate required for lossless compression and is interpreted in two key parts:

  • Achievability: If \(R > H(X)\), then there exists a coding scheme that allows perfect reconstruction of the source output using \(R\) bits per symbol on average.

  • Converse: If \(R < H(X)\), no such lossless code exists, meaning perfect reconstruction is impossible at that rate.

Example:
Consider a binary DMS with entropy:

\[ H(X) = 0.8 \text{ bits per symbol}. \]
  • It can be compressed losslessly at \(R = 0.9\) bits per symbol.

  • It cannot be compressed at \(R = 0.7\) bits per symbol without losing information.

This theorem formalizes the optimal compression limit for any source.

Fundamental Limit and Role of Entropy#

Shannon’s theorem establishes \(H(X)\) as the fundamental limit for lossless source coding.

  • Previously, entropy was defined as a measure of uncertainty or average information content based on the probability distribution of source outputs.

  • Shannon’s theorem transforms this intuition into a precise engineering principle:

    \[ H(X) \text{ is not just a statistical measure—it is the absolute minimum compression rate.} \]

Implications:

  • If \(H(X) = 1\) bit then no compression is possible below 1 bit per symbol.

  • If \(H(X) = 0.5\) bits then a sequence of \(n\) symbols can be compressed to \(0.5n\) bits on average.

Example: Fair Coin Toss
A fair coin has:

\[ H(X) = 1 \text{ bit}. \]
  • This means each flip carries exactly 1 bit of information.

  • Theorem confirms: It cannot be compressed below 1 bit per flip losslessly.

Shannon’s First Theorem provides the formal limit on how much a DMS can be compressed without loss. It underpins modern compression algorithms, such as:

  • Huffman Coding (variable-length encoding for efficient compression).

  • Arithmetic Coding (near-optimal encoding based on probability models).

  • Lempel-Ziv (LZ) Compression (foundation for ZIP, PNG, and other formats).

This theorem is one of the most fundamental results in information theory, setting the mathematical foundation for data compression and efficient communication systems.

Discrete Stationary Sources#

Entropy Limit for DMS#

For a Discrete Memoryless Source (DMS), Shannon’s First Theorem established that the entropy \(H(X)\) defines the minimum rate for lossless compression.

  • In a DMS, each symbol is independent of previous symbols.

  • The entropy per symbol remains constant over time.

  • Typical sequences allow us to directly relate the transmission rate to \(H(X)\).

This independence property simplifies the coding process, as symbols can be encoded separately.

Introduction to Statistically Dependent Sources#

Now, we extend our analysis to discrete stationary sources, where symbol probabilities are not independent but exhibit statistical dependence.

A stationary source has memory, meaning that current outputs may depend on past outputs.

A source is stationary if its joint probability distributions remain unchanged under time shifts, i.e.,

\[ P(X_1, X_2) = P(X_{1+t}, X_{2+t}). \]

This means that the statistical structure remains constant over time.

Examples of stationary sources:

  • Markov Chains (where the probability of the next state depends on the current state).

  • Natural Language Text (where the probability of letters or words depends on context).

  • Speech Signals (where sound patterns evolve over time with dependencies).

In these cases, past symbols provide predictive information, influencing entropy calculations.

Entropy of a Sequence#

To analyze stationary sources, we evaluate the entropy of a sequence of symbols.

For a block of \(k\) random variables:

\[ X_1, X_2, \dots, X_k \]

from a stationary source, the joint entropy follows the chain rule:

\[ \boxed{ H(X_1, X_2, \ldots, X_k) = \sum_{i=1}^{k} H(X_i \mid X_1, X_2, \ldots, X_{i-1}). } \]

where:

  • \(H(X_i \mid X_1, X_2, \ldots, X_{i-1})\) is the conditional entropy of \(X_i\), given the prior sequence.

  • This term measures the uncertainty in \(X_i\) when past symbols are known.

For a DMS (independent symbols):

\[ H(X_i \mid X_1, X_2, \dots, X_{i-1}) = H(X). \]

Thus, joint entropy scales linearly:

\[ H(X_1, X_2, \dots, X_k) = k H(X). \]

However, for stationary sources (dependent symbols):

\[ H(X_i \mid X_1, X_2, \dots, X_{i-1}) \leq H(X). \]
  • The reason is that past symbols provide information, reducing uncertainty.

  • The more predictable the source, the lower the conditional entropy.

  • In extreme cases (fully predictable sources), entropy can approach zero.

Thus, we can see that:

  • Memoryless sources have a constant entropy rate per symbol.

  • Stationary sources may exhibit redundancy, leading to lower entropy rates and better compression opportunities.

This concept is critical in practical applications like speech recognition, text compression (e.g., predictive models in AI), and video encoding, where statistical dependencies are leveraged for efficient representation.

Entropy Rate#

Entropy per Letter#

For a block of \(k\) symbols:

\[ X_1, X_2, \ldots, X_k, \]

the entropy per symbol is defined as:

\[ \boxed{ H_k(X) = \frac{1}{k} H(X_1, X_2, \ldots, X_k). } \]

This normalization provides the average uncertainty per output over a block of length \(k\). Using the chain rule for entropy:

\[ H_k(X) = \frac{1}{k} \sum_{i=1}^{k} H(X_i \mid X_1, X_2, \ldots, X_{i-1}). \]
  • For small \(k\): \(H_k(X)\) reflects short-term dependencies.

  • For large \(k\): We seek a long-term average to fully characterize the source.

Definition of Entropy Rate#

The entropy rate of a stationary source, denoted as \(H_{\infty}(X)\), is defined as the limiting entropy per symbol as \(k \to \infty\):

\[ \boxed{ H_{\infty}(X) \triangleq \lim_{k \to \infty} H_k(X) = \lim_{k \to \infty} \frac{1}{k} H(X_1, X_2, \ldots, X_k). } \]

For stationary sources, this limit exists due to the time-invariance of their statistical properties. An alternative formulation using the chain rule is:

\[ H_{\infty}(X) = \lim_{k \to \infty} H(X_k \mid X_1, X_2, \ldots, X_{k-1}). \]

This means that as \(k\) increases, the earlier terms average out, and the conditional entropy stabilizes. This formulation emphasizes that the entropy rate quantifies the irreducible uncertainty per symbol, accounting for all dependencies.

Entropy Rate for Memoryless Sources#

For a Discrete Memoryless Source (DMS), where symbols are independent, we have:

\[ H(X_i \mid X_1, X_2, \ldots, X_{i-1}) = H(X), \]

which implies:

\[ H(X_1, X_2, \ldots, X_k) = k H(X). \]

Dividing by \(k\), the entropy per symbol remains constant:

\[ H_k(X) = \frac{k H(X)}{k} = H(X). \]

Thus, the entropy rate simplifies to:

\[ \boxed{ H_{\infty}(X) = \lim_{k \to \infty} H_k(X) = H(X). } \]

This confirms that for a DMS, \(H(X)\) alone fully characterizes the compression limit, consistent with Shannon’s First Theorem.

Implications for Stationary Sources#

For stationary sources with memory, dependencies between symbols cause the conditional entropy to decrease:

\[ H(X_i \mid X_1, \ldots, X_{i-1}) \leq H(X_1). \]

Thus, the entropy rate satisfies:

\[ H_{\infty}(X) \leq H(X_1). \]

This inequality shows that dependencies reduce per-symbol uncertainty, leading to potential compression gains.

Example: English Text Compression

  • In English text, letter dependencies (e.g., “h” often follows “t”) reduce entropy.

  • If letters were independent, the entropy would be \(\log_2 26 \approx 4.7\) bits per letter.

  • However, due to predictability, the actual entropy rate is lower (~1.5 bits per letter, depending on the considered context).

  • This enables efficient coding using predictive models (e.g., Markov models, Huffman coding, or arithmetic coding).