Theory Compiler
section "Compiler formalization"
theory Compiler
imports
"HOL-IMP.Big_Step"
"HOL-IMP.Star"
begin
text ‹
\null
\emph{This paper is dedicated to Gaia and Greta, my sweet nieces, who fill my life with love and
happiness.}
\null\null
After introducing the didactic imperative programming language IMP, \<^cite>‹"Nipkow21-1"› specifies
compilation of IMP commands into a lower-level language based on a stack machine, and expounds a
formal verification of that compiler. Exercise 8.4 asks the reader to adjust such proof for a new
compilation target, consisting of a machine language that (i) accesses memory locations through
their addresses instead of variable names, and (ii) maintains a stack in memory via a stack pointer
rather than relying upon a built-in stack. A natural strategy to maximize reuse of the original
proof is keeping the original language as an assembly one and splitting compilation into multiple
steps, namely a source-to-assembly step matching the original compilation process followed by an
assembly-to-machine step. In this way, proving assembly code-machine code equivalence is the only
extant task.
\<^cite>‹"Noce21"› introduces a reasoning toolbox that allows for a compiler correctness proof shorter
than the book's one, as such promising to constitute a further enhanced reference for the formal
verification of real-world compilers. This paper in turn shows that such toolbox can be reused to
accomplish the aforesaid task as well, which demonstrates that the proposed approach also promotes
proof reuse in multi-stage compiler verifications.
The formal proof development presented in this paper consists of two theory files, as follows.
▪ The former theory, briefly referred to as ``the @{text Compiler} theory'', is derived from the
@{text "HOL-IMP.Compiler"} one included in the Isabelle2021-1 distribution \<^cite>‹"Nipkow21-2"›.\\
However, the signature of function @{text bcomp} is modified in the same way as in \<^cite>‹"Noce21"›.
▪ The latter theory, briefly referred to as ``the @{text Compiler2} theory'', is derived from the
@{text Compiler2} one developed in \<^cite>‹"Noce21"›.\\
However, unlike \<^cite>‹"Noce21"›, the original language IMP is considered here, without extending it
with non-deterministic choice. Hence, the additional case pertaining to non-deterministic choice in
the proof of lemma @{text ccomp_correct} is not present any longer.
Both theory files are split into the same subsections as the respective original theories, and only
the most salient differences with respect to the original theories are commented in both of them.
For further information about the formal definitions and proofs contained in this paper, see
Isabelle documentation, particularly \<^cite>‹"Paulson21"›, \<^cite>‹"Nipkow21-3"›, \<^cite>‹"Krauss21"›,
and \<^cite>‹"Nipkow11"›.
›
subsection "List setup"
declare [[coercion_enabled]]
declare [[coercion "int :: nat ⇒ int"]]
declare [[syntax_ambiguity_warning = false]]
abbreviation (output)
"isize xs ≡ int (length xs)"
notation isize ("size")
primrec (nonexhaustive) inth :: "'a list ⇒ int ⇒ 'a" (infixl "!!" 100) where
"(x # xs) !! i = (if i = 0 then x else xs !! (i - 1))"
lemma inth_append [simp]:
"0 ≤ i ⟹
(xs @ ys) !! i = (if i < size xs then xs !! i else ys !! (i - size xs))"
by (induction xs arbitrary: i, auto simp: algebra_simps)
subsection "Instructions and stack machine"
text ‹
Here below, both the syntax and the semantics of the instruction set are defined. As a deterministic
language is considered here, as opposed to the non-deterministic one addressed in \<^cite>‹"Noce21"›,
instruction semantics can be defined via a simple non-recursive function @{text iexec} (identical to
the one used in \<^cite>‹"Nipkow21-1"›, since the instruction set is the same). However, an inductive
predicate @{text iexec_pred}, resembling the @{text iexec} one used in \<^cite>‹"Noce21"› and denoted
by the same infix symbol @{text ↦}, is also defined. Though notation @{text "(ins, cf) ↦ cf'"} is
just an alias for @{text "cf' = iexec ins cf"}, it is used in place of the latter in the definition
of predicate @{text exec1}, which formalizes single-step program execution. The reason is that the
compiler correctness proof developed in the @{text Compiler2} theory of \<^cite>‹"Noce21"› depends on
the introduction and elimination rules deriving from predicate @{text iexec}'s inductive definition.
Thus, the use of predicate @{text iexec_pred} is a trick enabling Isabelle's classical reasoner to
keep using such rules, which restricts the changes to be made to the proofs in the @{text Compiler2}
theory to those required by the change of the compilation target.
The instructions defined by type @{text instr}, which refer to memory locations via variable names,
will keep being used as an assembly language. In order to have a machine language rather referring
to memory locations via their addresses, modeled as integers, an additional type @{text m_instr} of
machine instructions, in one-to-one correspondence with assembly instructions, is introduced. The
underlying idea is to reuse the proofs that source code and compiled (assembly) code simulate each
other built in \<^cite>‹"Nipkow21-2"› and \<^cite>‹"Noce21"›, so that the only extant task is proving
that assembly code and machine code in turn simulate each other. This is nothing but an application
of the \emph{divide et impera} strategy of considering multiple compilation stages mentioned in
\<^cite>‹"Nipkow21-1"›, section 8.5.
In other words, the solution developed in what follows does not require any change to the original
compiler completeness and correctness proofs. This result is achieved by splitting compilation into
multiple steps, namely a source-to-assembly step matching the original compilation process, to which
the aforesaid proofs still apply, followed by an assembly-to-machine step. In this way, to establish
source code-machine code equivalence, the assembly code-machine code one is all that is left to be
proven. In addition to proof reuse, this approach provides the following further advantages.
▪ There is no need to reason about the composition and decomposition of machine code sequences,
which would also involve the composition and decomposition of the respective mappings between
used variables and their addresses (as opposed to what happens with assembly code sequences).
▪ There is no need to change the original compilation functions, modeling the source-to-assembly
compilation step in the current context. In fact, the outputs of these functions are assembly
programs, namely lists of assembly instructions, which are in one-to-one correspondence with
machine ones. Thus, the assembly-to-machine compilation step can easily be modeled as a mapping
of such a list into a machine instruction one, where each referenced variable can be assigned an
unambiguous address based on the position of the first/last instruction referencing it within
the assembly program.
\null
›