Theory Huffman
theory Huffman
imports Main
begin
section ‹Introduction›
subsection ‹Binary Codes \label{binary-codes}›
text ‹
Suppose we want to encode strings over a finite source alphabet to sequences
of bits. The approach used by ASCII and most other charsets is to map each
source symbol to a distinct $k$-bit code word, where $k$ is fixed and is
typically 8 or 16. To encode a string of symbols, we simply encode each symbol
in turn. Decoding involves mapping each $k$-bit block back to the symbol it
represents.
Fixed-length codes are simple and fast, but they generally waste space. If
we know the frequency $w_a$ of each source symbol $a$, we can save space
by using shorter code words for the most frequent symbols. We
say that a (variable-length) code is {\sl optimum\/} if it minimizes the sum
$\sum_a w_a \vthinspace\delta_a$, where $\delta_a$ is the length of the binary
code word for $a$. Information theory tells us that a code is optimum if
for each source symbol $c$ the code word representing $c$ has length
$$\textstyle \delta_c = \log_2 {1 \over p_c}, \qquad
\hbox{where}\enskip p_c = {w_c \over \sum_a w_a}.$$
This number is generally not an integer, so we cannot use it directly.
Nonetheless, the above criterion is a useful yardstick and paves the way for
arithmetic coding \cite{rissanen-1976}, a generalization of the method
presented here.
\def\xabacabad{\xa\xb\xa\xc\xa\xb\xa\xd}%
As an example, consider the source string `$\xabacabad$'. We have
$$p_{\xa} = \tfrac{1}{2},\,\; p_{\xb} = \tfrac{1}{4},\,\;
p_{\xc} = \tfrac{1}{8},\,\; p_{\xd} = \tfrac{1}{8}.$$
The optimum lengths for the binary code words are all integers, namely
$$\delta_{\xa} = 1,\,\; \delta_{\xb} = 2,\,\; \delta_{\xc} = 3,\,\;
\delta_{\xd} = 3,$$
and they are realized by the code
$$C_1 = \{ \xa \mapsto 0,\, \xb \mapsto 10,\, \xc \mapsto 110,\,
\xd \mapsto 111 \}.$$
Encoding `$\xabacabad$' produces the 14-bit code word 01001100100111. The code
$C_1$ is optimum: No code that unambiguously encodes source symbols one at a
time could do better than $C_1$ on the input `$\xa\xb\xa\xc\xa\xb\xa\xd$'. In
particular, with a fixed-length code such as
$$C_2 = \{ \xa \mapsto 00,\, \xb \mapsto 01,\, \xc \mapsto 10,\,
\xd \mapsto 11 \}$$
we need at least 16~bits to encode `$\xabacabad$'.
›
subsection ‹Binary Trees›
text ‹
Inside a program, binary codes can be represented by binary trees. For example,
the trees\strut
$$\vcenter{\hbox{\includegraphics[scale=1.25]{tree-abcd-unbalanced.pdf}}}
\qquad \hbox{and} \qquad
\vcenter{\hbox{\includegraphics[scale=1.25]{tree-abcd-balanced.pdf}}}$$
correspond to $C_1$ and $C_2$. The code word for a given
symbol can be obtained as follows: Start at the root and descend toward the leaf
node associated with the symbol one node at a time; generate a 0 whenever the
left child of the current node is chosen and a 1 whenever the right child is
chosen. The generated sequence of 0s and 1s is the code word.
To avoid ambiguities, we require that only leaf nodes are labeled with symbols.
This ensures that no code word is a prefix of another, thereby eliminating the
source of all ambiguities.%
\footnote{Strictly speaking, there is another potential source of ambiguity.
If the alphabet consists of a single symbol $a$, that symbol could be mapped
to the empty code word, and then any string $aa\ldots a$ would map to the
empty bit sequence, giving the decoder no way to recover the original string's
length. This scenario can be ruled out by requiring that the alphabet has
cardinality 2 or more.}
Codes that have this property are called {\sl prefix codes}. As an example of a
code that does not have this property, consider the code associated with the
tree\strut
$$\vcenter{\hbox{\includegraphics[scale=1.25]{tree-abcd-non-prefix.pdf}}}$$
and observe that `$\xb\xb\xb$', `$\xb\xd$', and `$\xd\xb$' all map to the
code word 111.
Each node in a code tree is assigned a {\sl weight}. For a leaf node, the
weight is the frequency of its symbol; for an inner node, it is the sum of the
weights of its subtrees. Code trees can be annotated with their weights:\strut
$$\vcenter{\hbox{\includegraphics[scale=1.25]%
{tree-abcd-unbalanced-weighted.pdf}}}
\qquad\qquad
\vcenter{\hbox{\includegraphics[scale=1.25]%
{tree-abcd-balanced-weighted.pdf}}}$$
For our purposes, it is sufficient to consider only full binary trees (trees
whose inner nodes all have two children). This is because any inner node with
only one child can advantageously be eliminated; for example,\strut
$$\vcenter{\hbox{\includegraphics[scale=1.25]{tree-abc-non-full.pdf}}}
\qquad \hbox{becomes} \qquad
\vcenter{\hbox{\includegraphics[scale=1.25]{tree-abc-full.pdf}}}$$
›
subsection ‹Huffman's Algorithm›
text ‹
David Huffman \cite{huffman-1952} discovered a simple algorithm for
constructing an optimum code tree for specified symbol frequencies:
Create a forest consisting of only leaf nodes, one for each symbol in the
alphabet, taking the given symbol frequencies as initial weights for the nodes.
Then pick the two trees
$$\vcenter{\hbox{\includegraphics[scale=1.25]{tree-w1.pdf}}}
\qquad \hbox{and} \qquad
\vcenter{\hbox{\includegraphics[scale=1.25]{tree-w2.pdf}}}$$
\noindent\strut
with the lowest weights and replace them with the tree
$$\vcenter{\hbox{\includegraphics[scale=1.25]{tree-w1-w2.pdf}}}$$
Repeat this process until only one tree is left.
As an illustration, executing the algorithm for the frequencies
$$f_{\xd} = 3,\,\; f_{\xe} = 11,\,\; f_{\xf} = 5,\,\; f_{\xs} = 7,\,\;
f_{\xz} = 2$$
gives rise to the following sequence of states:\strut
\def\myscale{1}%
\setbox\myboxi=\hbox{(9)\strut}%
\setbox\myboxii=\hbox{\includegraphics[scale=\myscale]{tree-prime-step1.pdf}}%
\setbox\myboxiii=\hbox{\includegraphics[scale=\myscale]{tree-prime-step2.pdf}}%
$$(1)\quad\lower\ht\myboxii\hbox{\raise\ht\myboxi\box\myboxii} \qquad\qquad
(2)\enskip\lower\ht\myboxiii\hbox{\raise\ht\myboxi\box\myboxiii}$$
\vskip.5\smallskipamount
\noindent
\setbox\myboxii=\hbox{\includegraphics[scale=\myscale]{tree-prime-step3.pdf}}%
\setbox\myboxiii=\hbox{\includegraphics[scale=\myscale]{tree-prime-step4.pdf}}%
\setbox\myboxiv=\hbox{\includegraphics[scale=\myscale]{tree-prime-step5.pdf}}%
(3)\quad\lower\ht\myboxii\hbox{\raise\ht\myboxi\box\myboxii}\hfill\quad
(4)\quad\lower\ht\myboxiii\hbox{\raise\ht\myboxi\box\myboxiii}\hfill
(5)\enskip\lower\ht\myboxiv\hbox{\raise\ht\myboxi\box\myboxiv\,}
\smallskip
\noindent
Tree~(5) is an optimum tree for the given frequencies.
›
subsection ‹The Textbook Proof›
text ‹
Why does the algorithm work? In his article, Huffman gave some motivation but
no real proof. For a proof sketch, we turn to Donald Knuth
\cite[p.~403--404]{knuth-1997}:
\begin{quote}
It is not hard to prove that this method does in fact minimize the weighted
path length (i.e., $\sum_a w_a \vthinspace\delta_a$), by induction on $m$.
Suppose we have $w_1 \le w_2 \le w_3 \le \cdots \le w_m$, where $m \ge 2$, and
suppose that we are given a tree that minimizes the weighted path length.
(Such a tree certainly exists, since only finitely many binary trees with $m$
terminal nodes are possible.) Let $V$ be an internal node of maximum distance
from the root. If $w_1$ and $w_2$ are not the weights already attached to the
children of $V$, we can interchange them with the values that are already
there; such an interchange does not increase the weighted path length. Thus
there is a tree that minimizes the weighted path length and contains the
subtree\strut
$$\vcenter{\hbox{\includegraphics[scale=1.25]{tree-w1-w2-leaves.pdf}}}$$
Now it is easy to prove that the weighted path length of such a tree is
minimized if and only if the tree with
$$\vcenter{\hbox{\includegraphics[scale=1.25]{tree-w1-w2-leaves.pdf}}}
\qquad \hbox{replaced by} \qquad
\vcenter{\hbox{\includegraphics[scale=1.25]{tree-w1-plus-w2.pdf}}}$$
has minimum path length for the weights $w_1 + w_2$, $w_3$, $\ldots\,$, $w_m$.
\end{quote}
\noindent
There is, however, a small oddity in this proof: It is not clear why we must
assert the existence of an optimum tree that contains the subtree
$$\vcenter{\hbox{\includegraphics[scale=1.25]{tree-w1-w2-leaves.pdf}}}$$
Indeed, the formalization works without it.
Cormen et al.\ \cite[p.~385--391]{cormen-et-al-2001} provide a very similar
proof, articulated around the following propositions:
\begin{quote}
\textsl{\textbf{Lemma 16.2}} \\
Let $C$ be an alphabet in which each character $c \in C$ has frequency $f[c]$.
Let $x$ and $y$ be two characters in $C$ having the lowest frequencies. Then
there exists an optimal prefix code for $C$ in which the codewords for $x$ and
$y$ have the same length and differ only in the last bit.
\medskip
\textsl{\textbf{Lemma 16.3}} \\
Let $C$ be a given alphabet with frequency $f[c]$ defined for each character
$c \in C$. Let $x$ and $y$ be two characters in $C$ with minimum frequency. Let
$C'$ be the alphabet $C$ with characters $x$, $y$ removed and (new) character
$z$ added, so that $C' = C - \{x, y\} \cup {\{z\}}$; define $f$ for $C'$ as for
$C$, except that $f[z] = f[x] + f[y]$. Let $T'$ be any tree representing an
optimal prefix code for the alphabet $C'$. Then the tree $T$, obtained from
$T'$ by replacing the leaf node for $z$ with an internal node having $x$ and
$y$ as children, represents an optimal prefix code for the alphabet $C$.
\medskip
\textsl{\textbf{Theorem 16.4}} \\
Procedure \textsc{Huffman} produces an optimal prefix code.
\end{quote}
›
subsection ‹Overview of the Formalization›
text ‹
This document presents a formalization of the proof of Huffman's algorithm
written using Isabelle/HOL \cite{nipkow-et-al-2008}. Our proof is based on the
informal proofs given by Knuth and Cormen et al. The development was done
independently of Laurent Th\'ery's Coq proof \cite{thery-2003,thery-2004},
which through its ``cover'' concept represents a considerable departure from
the textbook proof.
The development consists of a little under 100 lemmas and theorems. Most of
them have very short proofs thanks to the extensive use of simplification
rules and custom induction rules. The remaining proofs are written using the
structured proof format Isar \cite{wenzel-2008}.
›
subsection ‹Head of the Theory File›
text ‹
The Isabelle theory starts in the standard way.
\myskip
\noindent
\isacommand{theory} ‹Huffman› \\
\isacommand{imports} ‹Main› \\
\isacommand{begin}
\myskip
\noindent
We attach the ‹simp› attribute to some predefined lemmas to add them to
the default set of simplification rules.
›
declare
Int_Un_distrib [simp]
Int_Un_distrib2 [simp]
max.absorb1 [simp]
max.absorb2 [simp]
section ‹Definition of Prefix Code Trees and Forests \label{trees-and-forests}›
subsection ‹Tree Type›
text ‹
A {\sl prefix code tree\/} is a full binary tree in which leaf nodes are of the
form @{term "Leaf w a"}, where @{term a} is a symbol and @{term w} is the
frequency associated with @{term a}, and inner nodes are of the form
@{term "Node w t⇩1 t⇩2"}, where @{term t⇩1} and @{term t⇩2} are the left and
right subtrees and @{term w} caches the sum of the weights of @{term t⇩1} and
@{term t⇩2}. Prefix code trees are polymorphic on the symbol datatype~@{typ 'a}.
›