Generating a Threaded ARM Interpreter with Templates in C++

I wrote this many years ago but for whatever reason I didn’t publish it. My old C++ article on generating switch statements still seems reasonably popular so perhaps someone will find use in this (for reference when I wrote this I was working off the ARMv5 manual from arm and the 946E-S info from I also realized wordpress mangled this post so I’ve also brushed it up in markdown and converted the post to literate C++ (grab the code here) extractable with sed via.

$ sed -n '/~~~~~{.cpp/,/~~~~~/p' |grep -v '~~~~~' > post.cpp

Recently I started getting into efficient interpreters and binary translation/JIT compilation. Although I’ve long been interested in the more abstract corners of PL and have written a few native compilers I’ve never bothered to learn about current day optimization. Given that I always learn more when I have something to hack on I decided to implement an ARM interpreter, but first…

I figure a good starting point is compilation of papers so I’ll share a few I liked. Again, I’m not in a position to make suggestions, but I thoroughly enjoyed "An Architectural Trail to Threaded-Code Systems" [3] by Kogge. After that take a look at Zaleski’s work on context threading and mixed mode execution [2] [5] . For low level architectural areas "QEMU, a Fast and Portable Dynamic Translator" [1] and "Virtualization Without Direct Execution or Jitting: Designing a Portable Virtual Machine Infrastructure" [4] offer great ideas.

Direct threading is a simple technique whereby instructions are directly "threaded" together using indirect jumps rather than say, a loop with a switch at the top. I find it easier to understand with some example code so I’ve implemented a simple stack machine—winning no awards for cleanliness here

#include <stdlib.h>
#include <stdio.h>

int main() {
        // We're using the GNU C, labels-as-values extension here.
        void *prog[] = {&&PUSH,(void*)6,
        void **vPC = prog;
        void *stack[4], **sp = stack;

        goto **vPC++;
        PUSH:  *sp++ = *vPC++; goto **vPC++;
        MUL:   *(sp-1) = (void*)(*(long*)(sp-1) * *(long*)(sp-2));
               goto **vPC++;;
        PRINT: printf("%li\n", *(long*)sp--); goto **vPC++;;
        HALT:  exit(0);

        return 0;

We have a virtual program counter called vPC that points to the next instruction to interpret within the array prog. In turn, prog stores the sequence of instructions and their operands. This is a stack machine so we have a stack, stack and a stack pointer sp. The big thing to notice here is that we have no loop fetching, decoding, and dispatching instructions. Instead, each instruction body finishes execution by jumping to the address of the next instruction tracked by vPC.

How does this translate to ARM instructions? Clearly because they won’t match up with label addresses like our tiny stack machine above, decode must involve some kind of lookup, preferably into a jump table.

For those unfamiliar every1 ARM instruction is 32 bits wide: how do they map? Every instruction can be conditionally executed (a la setcc) and this condition code is stored in the high nibble, the next nibble serves primarily to identify instructions, we then tend to find opcode selection, and the low 20 bits are mostly reserved for operands. This is of course a rough explanation so I’ve diagrammed the multiply and data processing instruction families

28 24 20 16 12 8 7 4 0
multiply Cond 0 0 0 0 0 0 A S Rd Rn Rs 1 0 0 1 Rm
dataproc Cond 0 0 I Opcode S Rn Rd Operand2

How can we make this into a jump table? While it’s not hard to spot ambiguity in bits 27-20 (consider all zeros in both instructions), the 12 bits 27-20 and 7-4 turn out to be enough to determine every operation. What then, will a sketched implementation of an ARM interpreter look like?

static int decode(int i)
{ return i >> 16 & 0xff0 | i >> 4 & 0xf; }
int main() {
    int32_t prog[] = { i1, i2, ..., iHalt };
    int32_t *vPC = prog;
    void *jmptbl[4096] = { /* instruction addresses */ };

    goto *jmptbl[decode(*vPC++)];
    /* instruction bodies */

Okay, but 4,000 entries? How can we exploit the clustering inherent in instruction families while retaining the possibility of static specialization?

Realistically we only have one choice: templates—we’ll classify instructions using the same bit patterns we would at runtime and then use the preprocessor to enumerate all 4,096 possibilities. Before proceeding notice that actually bits 7-4 of Operand2 and "1001" appear to collide in the above diagrams. If you take a look at the ARM documentation you’ll find that when bit "I" is 0, bit 7 must be 0 too (see A5.1.1 in DDI 0100I), providing resolution; however this must be made explicit to avoid template ambiguity. Luckily we can leverage SFINAE to express this fact, yielding the following template signature

#include <stdlib.h>
#include <iostream>
#include <boost/preprocessor/repeat.hpp>
#include <boost/utility/enable_if.hpp>

struct undefined_instruction {
        static void exec() {
                std::cout << "undefined instruction" << std::endl;
                // raise processor exception

template <int B27,int B26,int B25,int B24,int B23,int B22,int B21,int B20,
          int B7,int B6,int B5,int B4, typename Enable = void>
// Anything not explicitly implemented is an undefined instruction
struct instruction_pattern : undefined_instruction {};

Now we’re ready for a few example instructions (in DDI 0100I see A3.16.2 for multiply, and A3.4.1 for data processing instruction families)

// Multiply
template <int B23,int B22,int B21,int B20>
struct instruction_pattern<0,0,0,0,B23,B22,B21,B20,1,0,0,1> {
        static void exec() {
                std::cout << "mul" << std::endl;

// Data Processing
template <int B24,int B23,int B22,int B21>
struct dataproc : undefined_instruction {};

template <>
struct dataproc<0,0,0,0> {
      static void exec() {
              std::cout << "and" << std::endl;

template <int I,int B24,int B23,int B22,int B21,int S,int B7,int B6,int B5,int B4>
struct instruction_pattern<0,0,I,B24,B23,B22,B21,S,B7,B6,B5,B4,
                           typename boost::disable_if_c<I == 0 && B7 != 0>::type>
  : dataproc<B24,B23,B22,B21> {};
/* more definitions... */

While closer to a solution we still have the problem that a straight line enumeration of integers 1,2,3,… must be passed into our instruction_patterns as sequences of 12 bits. No tricks involved here, we first use a template, to_bits, to convert a value to a sequence of n bits and then we use a second template, with_bin, to apply a third template, f, to this bit sequence (we avoid variadic templates at the minor cost of generality here)

template <int NumberOfBitsInOutputStream, int N>
struct to_bits {
    static const int value = N % 2;
    typedef to_bits<NumberOfBitsInOutputStream - 1, N/2> s;
template <int N> struct to_bits<0,N> { static const int value = N % 2; };

template <int m,
          template <int,int,int,int,int,int,int,int,int,int,int,int,class> class f>
struct with_bin {
    typedef to_bits<12,m> bits;
    typedef f<bits::s::s::s::s::s::s::s::s::s::s::s::value,
             > type;

This implementation, similar to the threaded interpreter we started with will use label addresses as entries in the jump table so we have to generate (a) an address for each call to exec() with an associated indirect jump to the next instruction, and (b) the actual jump table. The boost preprocessor library is used for this step and because loops are limited to 255 iterations we have to make a recursive call rather than the preferable BOOST_PP_REPEAT(4096, OP_ENTRY, 0)

#define STEP() goto *jmptbl[decode(*vPC++)]

// This is part for (a)
#define OP_ENTRY(z, m, n) \
  op_##n##_##m: with_bin<m+n*32, instruction_pattern>::type::exec();\
#define OP_ENTRY_R(z, n, __) BOOST_PP_REPEAT(32,  OP_ENTRY,   n)
#define EMIT_OPS()           BOOST_PP_REPEAT(128, OP_ENTRY_R, 0)

// This is part for (b)
#define JMP_ENTRY(z, m, n)    &&op_##n##_##m,
#define JMP_ENTRY_R(z, n, __) BOOST_PP_REPEAT(32,  JMP_ENTRY,   n)
#define EMIT_JMPTBL()         BOOST_PP_REPEAT(128, JMP_ENTRY_R, 0)

Cool, our implementation is in a testable state, what does main look like?

// We're implementing halt here as supervisor-call-0
template <int B23,int B22,int B21,int B20, int B7,int B6,int B5,int B4>
struct instruction_pattern<1,1,1,1,B23,B22,B21,B20,B7,B6,B5,B4> {
        static void exec() {
                std::cout << "halting";

int main() {
    //                 mul r0,r1,r2, and r0,r1,r0, svc 0
    uint32_t prog[] = { 0xe0000291,   0xe0010000,   0xef000000 };
    uint32_t *vPC = prog;
    void *jmptbl[4096] = { EMIT_JMPTBL() };


You might want to grab some coffee while this is compiling…

$ g++ -O2 -Wa,-ahl=main.s -o main main.cpp

$ ./main

While this technique works and lends itself well to further optimization via context threading (also with gcc forced inline/noinline is easy to control) I’m curious to see how it actually performs on real, multi-megabyte binaries. We should also question whether things have actually been simplified, I personally expect to have a stronger position on this once I make it more of a simulator and less a (bad) disassembler. Furthermore not only is the compile time bad—gcc hits 40s (actually if you compile the code posted without implementing any other opcodes it takes 3s) on my i5-450m and clang++ 3.0 lost my interest 20 minutes in—but the code produced is gigantic. A simpler mechanism relying more on runtime disambiguation will presumably have a smaller footprint—although I could be wrong here too—and I’m curious how this will interact with cache. While I maintain skepticism I plan on using vtune to try out dispatch techniques and continue to play around with compile-time/runtime decode so hopefully I’ll have interesting results for discussion soon.

[1] Bellard, F. 2005. QEMU, a fast and portable dynamic translator. Proceedings of the FREENIX track: 2005 USENIX annual technical conference, april 10-15, 2005, anaheim, ca, USA (2005), 41–46.

[2] Berndl, M. et al. 2005. Context threading: A flexible and efficient dispatch technique for virtual machine interpreters. 3nd IEEE / ACM international symposium on code generation and optimization (CGO 2005), 20-23 march 2005, san jose, ca, USA (2005), 15–26.

[3] Kogge, P.M. 1982. An architectural trail to threaded-code systems. Computer. 15, 3 (Mar. 1982), 22–32. DOI:

[4] Mihocka, D. and Shwartsman, S. 2008. Virtualization without direct execution or jitting: Designing a portable virtual machine infrastructure. 1st workshop on architectural and microarchitectural support for binary translation in isca-35, beijing (2008), 32.

[5] Zaleski, M. et al. 2005. Mixed mode execution with context threading. Proceedings of the 2005 conference of the centre for advanced studies on collaborative research, october 17-20, 2005, toronto, ontario, canada (2005), 305–319.

  1. Yeah, thumb.


Writing an LLVM compiler using continuation passing

Continuation passing compilers have a strong history and Appel’s book is a must read for anyone interested in the area.

Despite the large body of existing work on CPS compilers, I wanted to give an introduction to writing a simple compile function on first order expressions using Idris. My aim is to give just enough information about the subject to support future posts—we won’t bother to make much use of the flexibility compiling with continuations grants.

This post is written as literate Idris (grab the code here) and passes totality checking. Syntax highlighted pieces of code are the extractable bits; unhighlighted code is merely illustrative and probably won’t compile.

To extract the code, try

$ sed -n '/~~~~{.hask/,/~~~~/p' |grep -v '~~~~' > compiling.idr

or for the final LLVM program

$ sed -n '/~~~~{.llvm/,/~~~~/p' |grep -v '~~~~' > a.ll

Compiling a language of addition

Consider a simple language of addition on integers. The language obviously needs an Add term which should take two add-language-sub-expressions; the language will also need support for integer literals/immediates and variables.

namespace Exp
  data AddExp : Type -> Type where
    Var : var -> AddExp var
    Lit : Int -> AddExp var
    Add : AddExp var -> AddExp var -> AddExp var

The goal is to compile add expressions to LLVM assembly. Intuitively this machine will have three instructions (add/print/halt) which all operate on values (i.e. don’t accept complicated expressions as arguments). A value in this case is either an integer literal/immediate (e.g. 2, 3, etc.) or a register (e.g. %x0). The type of values is readily expressed in Idris

data Value : (var : Type) -> Type where
  Var : var -> Value var
  Lit : Int -> Value var

We take registers to be of abstract type var allowing a choice of representation such as strings: e.g. “%tmp0”, “%tmp1”, etc.; a counter: 1, 2, 3 which map to %1, %2, %3, etc.; or any other preferred representation such as the use of real Idris variables.

Cobbling together the type of instructions in Idris requires just three constructors:

  1. Add: add two source values and produce a result in a destination register;
  2. Print: print any value; and
  3. Halt: exit the program.

Although these are broadly sufficient, let’s start with an obvious but problematic encoding of instructions to motivate our ultimate solution

data LlvmInstr : Type -> Type where
  Print1 : Value var -> LlvmInstr var
  Halt1 : LlvmInstr var
  Add1 : var -> Value var -> Value var -> LlvmInstr var

Of course a program is a sequence of instructions so as a necessary extension of LlvmInstr we should consider the type of programs LlvmProgram. Naturally we model programs as lists of instructions which are to be executed in order. This intuitively captures the flat, linear way that assembly programs are written down:

LlvmProgram : Type -> Type
LlvmProgram var = List (LlvmInstr var)

Unfortunately however, the naive approach to compiling addition expressions to a list of llvm instructions will fail. To understand why, consider the difficulty of compiling naked values directly to an assembly instruction

compile1 : AddExp String -> LlvmProgram String
compile1 (Var x) = ?compileVar
compile1 (Lit z) = ?compileLit
compile1 (Add e1 e2) = compile e1
                    ++ compile e2
                    ++ ?addE1E2 {- use results of compile e1/e2 in Add1 instr? -}

There are two obviously identifiable problems with the above

  1. as noted it’s unclear how to compile variables and literals; and
  2. extracting the results of e1 and e2 to add together is unobvious (additionally, we should ask where this result could be placed).

Variables and int values don’t have a natural representation in assembly because values aren’t valid instructions; contrast this with the expression language: values are valid expressions. A naked value in an expression is simply returned, e.g. the expression “2” simply evaluates to the result “2”; however even if we had some return instruction Ret, the compiled program compile (Lit z) = [Ret z] will be wrong. Compiling values to return statements means programs such as Add (Lit 2) (Lit 3) produce

compile1 (Add (Lit 2) (Lit 3))
   = compile1 (Lit 2)
  ++ compile1 (Lit 3)
   = [Ret 2]
  ++ [Ret 3]
   = [Ret 2, Ret 3, ...]

Which immediately returns without performing any addition! What about our second problem: where do the intermediate results of compile e1 and compile e2 go? Where is addition of those two results placed?

A simple solution to both of the above problems is for the compile function to produce not only an LlvmProgram, but additionally a value or register which is where the result of the list of instructions was placed. You probably saw this coming if you’ve written a toy compiler with an explicit flattening stage taking e.g. (x + y) + 2 to tmp1 = x + y and tmp2 = tmp1 + 2. In the case of flattening, tmp1 would need to be bubbled up from the flattening of sub-tree x+y to be used when processing the right sub-tree 2. In Idris a naive attempt at this will fail because we need to generate variable names (like tmp1, tmp2) and have no means of doing so

compile2 : AddExp String -> (String, LlvmProgram)
compile2 (Var x) = (x, [])
compile2 (Lit z) = (?litResult, [])
compile2 (Add e1 e2) = let (r1, is1) = compile e1 in
                       let (r2, is2) = compile e2 in
                       is1 ++ is2 ++ [Add1 ?res (Var r1) (Var r2)]

Annoyingly this generation of fresh names (tmp1, tmp2) for intermediate results requires state. Even a simple counter will fail unless the counter’s value is carefully threaded up and down the recursion tere. If this isn’t apparent consider that we can’t know a priori how many values will be required by a given sub-tree so we can’t properly update the counter when we recurse into the next sub-tree (e.g. what is the value of the counter after recursing into the left sub-tree of unknown size?). The standard functional solution to state threading would introduce a triplet of (Int, String, LlvmProgram) instead of the current pair (String, LlvmProgram). Of course using a state monad might improve the situation but under the hood that’s just continuation based!

The use of continuations begs the question: what do continuations provide other than state threading?

Continuations provide the introduction of fresh variable names (name binding, \tmp1 => ...) which is precisely the problem we’re trying to solve. Instead of e.g. generating some string name “tmp1” we can ask the compiler for a name by using a lambda \tmp1 => func tmp1. Of course this name can never escape above its scope so we can only push information down, not bubble it up (with e.g. a pair (tmp1, [])). This should inform our strategy: don’t use explicit intermediate names such as a concrete destination register in the Add1 instruction. Instead the instruction-list structure of a program should be folded into the LlvmInstr type directly, allowing for intermediate results to be directly passed as a bound variable to the remaining instruction-list/program.

Let’s replace the LlvmInstr type with a type of programs: Llvm. The intermediate name used by the Add1 instruction will be replaced with a continuation from the resulting sum, to the program’s tail.

data Llvm : Type -> Type where
  Add : (v : Value var) -> (w : Value var) -> (program : (z : var) -> Llvm var) -> Llvm var
  Print : Value var -> Llvm var -> Llvm var
  Halt : Llvm var

The Print and Halt statements haven’t changed much. Rather than a program being a list of a instructions we directly embed the list structure into the Llvm type just as promised e.g. Print takes a value to print along with the rest of the program (effectively a list tail) prog : Llvm var. The Add instruction has a continuation from its intermediate result z to the program tail.

Although Add is slightly strange, it’s intuitively read as: add takes two values v, and w, and produces a result z = v+w which is used in the proceeding list of instructions i.e. the rest of the program.

Lets consider a simple transformation from explicit destination registers to the continuation encoding to clarify the technique

  Add1 "%tmp1" (Lit 2) (Lit 3) ::
  Add1 "%tmp2" (Var "%tmp1") (Lit 4) ::
  Add1 "%tmp3" (Var "%tmp1") (Var "%tmp2") ::
  Add (Lit 2) (Lit 3) (\tmp1 =>
  Add (Var tmp1) (Lit 4) (\tmp2 =>
  Add (Var tmp1) (Var tmp2) (\tmp3

Writing compile is trivial using continuations. At each step of compilation, the expression being compiled produces some intermediate result value tmp which is fed into the compilation of the rest of the program

compile : AddExp var -> ((tmp : Value var) -> Llvm var) -> Llvm var
compile (Var x) compileProg = compileProg (Var x)
compile (Lit z) compileProg = compileProg (Lit z)
compile (Add e1 e2) compileProg =
  compile e1 (\tmp1 =>
  compile e2 (\tmp2 =>
  Add tmp1 tmp2 (\result =>
  compileProg (Var result))))

Of course at the end of the day we’re producing a string/file to feed into llvm so the need to generate names can’t ultimately be escaped. Fortunately since we’ve eliminated the tree structure and are left with a linear stream of instructions we no longer need state to track used names. In the case of LLVM it’s particularly easy because a counter can be used to directly generate names such as “%1”. Let’s make a pretty print function so we can compile and run our programs

llvmValue : Value String -> String
llvmValue (Var x) = x
llvmValue (Lit z) = show z

prettyLlvm : Int -> Llvm String -> String
prettyLlvm name (Add v w prog) =
  let result = "%" ++ show name
  in result ++ " = add nsw i32 " ++ llvmValue v ++ ", " ++ llvmValue w ++ "\n" ++
     prettyLlvm (name+1) (prog result) 
prettyLlvm name (Print z prog) =
  "call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([3 x i8], [3 x i8]* @int_format, i64 0, i64 0), i32 " ++ llvmValue z ++ ")\n" ++
  prettyLlvm name prog
prettyLlvm name Halt = "call void @exit(i32 0)\n" ++

We can try this out on a simple program

prog0 : AddExp var
prog0 = Lit 3 `Add` (Lit 2 `Add` Lit 4)

llvm0 : Llvm var
llvm0 = compile prog0 (\result => Print result Halt)

renderedLlvm0 : String
renderedLlvm0 = prettyLlvm 1 llvm0
*m> renderedLlvm0
"%0 = add nsw i32 2, 4\n%1 = add nsw i32 3, %0\ncall i32 (i8*, ...) @printf(i8* getelementptr inbounds ([3 x i8], [3 x i8]* @int_format, i64 0, i64 0), i32 %1call void @exit(i32 0)\nunreachable" : String

However this won’t quite compile because the functions printf and exit are undeclared and there’s no main function wrapping the program. These additions are a matter of missing header and footer:

llvmHeader : String
llvmHeader = "@int_format = private constant [3 x i8] c\"%i\\00\"\n" ++
             "declare i32 @printf(i8* nocapture readonly, ...)\n" ++
             "declare void @exit(i32)\n\n"

llvmMain : String -> String
llvmMain prog = "define i32 @main() {\n" ++
                prog ++

That’s everything needed to run the program renderedLlvm0, so let’s test it out.

*m> llvmHeader ++ llvmMain renderedLlvm0
"@int_format = private constant [3 x i8] c\"%i\\00\"\ndeclare i32 @printf(i8* nocapture readonly, ...)\ndeclare void @exit(i32)\n\ndefine i32 @main() {\n%1 = add nsw i32 2, 4\n%2 = add nsw i32 3, %1\ncall i32 (i8*, ...) @printf(i8* getelementptr inbounds ([3 x i8], [3 x i8]* @int_format, i64 0, i64 0), i32 %2)\ncall void @exit(i32 0)\nunreachable\n}" : String

Which pretty prints as

@int_format = private constant [3 x i8] c"%i\00"
declare i32 @printf(i8* nocapture readonly, ...)
declare void @exit(i32)

define i32 @main() {
%1 = add nsw i32 2, 4
%2 = add nsw i32 3, %1
call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([3 x i8], [3 x i8]* @int_format, i64 0, i64 0), i32 %2)
call void @exit(i32 0)

Placing this in a file a.ll, compiling and running gives the expected result

$ clang a.ll -o a
$ ./a

As a bonus exemplifying the use of abstract register type var in Llvm here’s an interpreter for the assembly language

evalVal : Value Int -> Int
evalVal (Var z) = z
evalVal (Lit z) = z

eval : Llvm Int -> IO Unit
eval (Add x y prog) = eval (prog (evalVal x + evalVal y))
eval (Print x prog) = do print (evalVal x) 
                         eval prog
eval Halt = pure ()
*m> eval llvm0
io_bind (io_bind (prim_write "9") (\__bindx => io_pure ()))
        (\__bindx => io_pure ()) : IO ()

Generating Switch Statements in C++

If you prefer to skip the english you can find a copy of the source in this post here

Recently I was refactoring some C++ and found myself needing to dispatch one of several calls via a template index. Of course while this is simple when we’re dealing with a static index—this is not so simple if the index is discovered at runtime because the only way to get a static copy is to inspect each possible value. My first intuition was to create a function table because it’s quick to implement and reflects a sort of runtime variant of a template over integers. However I wanted to give the optimizer a chance to inline each specialized call because the majority were a mere single expression and in a performance sensitive section. Since an indirect call has (rather obviously) no hope of being inlined I was left with no other choice than to make each call and its selection explicit. While possible with a recursive chain of if-else statements which has the bonus of an appealing simplicity, I didn’t expect the optimizer to do much with such an approach so I decided on a switch instead (I did actually check and though hopeful, gcc emitted a cascade of conditional jumps—I tried lambdas too which were significantly worse).

The first thing to ask is what a “switch call” will look like.
If we were designing a simple interpreter, we might switch on the operation as in the following

int run_operation(const int operation, const bool printResult,
                  const int x, const int y) {
    int result;
    switch (operation) {
    case ADD: result = x + y; break;
    case SUB: result = x - y; break;
    case MUL: result = x * y; break;
    case DIV: result = x / y; break;
    case MOD: result = x % y; break;
    default: assert(false);
    if (printResult) printf("%i\n", result);
    return result;

This is pretty suggestive of what a switch_ metafunction should look like. It will of course accept a template (my real motivation) which provides the mapping of case numbers to case functions; it should accept a value to switch on; and it should forward any other arguments onto the case bodies. Rewriting run_operation with such a function should look something like this

return switch_<operation_table>::run(operation, printResult, x, y);

Here the template argument int is the return type, the first argument to run is the value to switch on, and the remaining arguments are to be passed into the case’s body.
What will the template operation_table look like? We’d decided it should be a map of case numbers to functions so similar to a real switch we’ll use a (static) sequence of pairs, each with an integral value to match on and a body to be executed. This is enough information for a first stab at the signature of switch_ and its static function run

template <typename Xs>
struct switch_ {
    static R run(int n, zero or more variable parameters);

An example of Xs (the sequence of pairs) is

template <int N>
struct operation_table {
    static int value(const bool printResult, const int x, const int y) {

switch_	<list<
  pair<int_<0>, operation_table<ADD> >,
  pair<int_<1>, operation_table<SUB> >,
  pair<int_<2>, operation_table<MUL> >,
  pair<int_<3>, operation_table<DIV> >,
  pair<int_<4>, operation_table<MOD> >
> >::run(operation, printResult, x, y);

Since you may at this point be asking what we’ve gained, let’s start by providing a convenience function offering some concision

template  <int Lower, int Upper, template <int> class Table>
struct switch_table;

This at least simplifies things

switch_table<0, 4, operation_table>::run(operation, printResult, x, y)

Having defined what we’d like a generated switch to looks like, how do we actually go about implementing it?
A good starting place is in the generation of cases. Notice however that we can’t produce the cases for the switch using templates so we’re forced to fall back to the preprocessor, and while the preprocessor can send information to our templates, the templates obviously can’t send information back to the preprocessor. This is a problem because as everything is currently phrased, the preprocessor doesn’t know the number of cases to emit—the template does. To clarify, in the case of switch_table this is via the first two template arguments, Lower and Upper; and in the case of switch_ this is via the length of Xs.
To circumvent this problem we’ll create a family of metafunctions parametrized by the number of cases they include, template switch_impl, allowing the preprocessor to pass into the template the number of cases it intends to emit. The user exposed metafunction switch_ will then call switch_impl.

#ifndef _metalevel_switch_h_
#define _metalevel_switch_h_

#include <boost/mpl/at.hpp>
#include <boost/mpl/if.hpp>
#include <boost/mpl/list.hpp>
#include <boost/mpl/pair.hpp>
#include <boost/mpl/push_front.hpp>
#include <boost/mpl/size.hpp>

#include <boost/preprocessor/iterate.hpp>
#include <boost/preprocessor/repetition/enum_params.hpp>
#include <boost/preprocessor/repetition/enum_binary_params.hpp>
#include <boost/preprocessor/repetition/repeat.hpp>

#ifdef __GNUC__
#  define always_inline __attribute__((always_inline))
#elif defined(_MSC_VER)
#  define always_inline __forceinline
#  define always_inline inline



namespace metalevel {

namespace aux {
    template <int NumberOfCases, typename Xs>
    struct switch_impl;
#   define BOOST_PP_FILENAME_1 <switch_impl.hpp>
#   include BOOST_PP_ITERATE()
} // namespace aux

template <typename Xs>
struct switch_
  : aux::switch_impl<boost::mpl::size::type::value - 1, Xs> {};
} // namespace metalevel


So all we’ve really accomplished is delegating most of the work to switch_impl.hpp using boost preprocessor’s file iteration (if you’re unfamiliar, the preprocessor docs found here are good). The above iteratively includes switch_impl.hpp METALEVEL_MAX_SWITCH_SIZE times, each time defining the pass number as a sort of argument to the header.
Before proceeding let’s quickly reiterate what’s going on here: switch_impl now handles switches with every possible number of cases (up to METALEVEL_MAX_SWITCH_SIZE). Then switch_ merely selects the correct number based on the size of its input list. For example switch_impl implements run with case 0 and case 1 while switch_impl implements run with case 0, case 1, case 2, case 3, case 4, case 5, and case 6.
So what does switch_impl.hpp look like?



template <typename Xs>
struct switch_impl<NUMBER_OF_CASES, Xs> {
#   define TYPEDEF_X_SUB_I(_, i, __) \
      typedef typename boost::mpl::at_c<Xs, i>::type x_##i;

#   define METALEVEL_CASE_TABLE_ENTRY(_, i, ts) \
      case boost::mpl::first<x_##i>::type::value: \
        return boost::mpl::second<x_##i>::type::value ts ;

#   define DEFINE_CHOICE_RUN(_, argc, __) \
      template <typename R BOOST_PP_COMMA_IF(argc) \
                BOOST_PP_ENUM_PARAMS(argc, typename T)> \
      static always_inline R \
      run(int n BOOST_PP_COMMA_IF(argc) \
          BOOST_PP_ENUM_BINARY_PARAMS(argc, T, t)) { \
          switch(n) { \
                          METALEVEL_CASE_TABLE_ENTRY, \
                          (BOOST_PP_ENUM_PARAMS(argc, t))) \
          } \
                    DEFINE_CHOICE_RUN, ~)




While a bit obscured by preprocessor iteration, nothing too interesting is actually going on here. Writing out a simple example really illuminates how this works so lets consider a copy of switch_impl when NumberOfCases is 3.

template <>
struct switch_impl<3, Xs> {
    typedef boost::mpl::at_c<Xs, 0>::type x_0;
    typedef boost::mpl::at_c<Xs, 1>::type x_1;
    typedef boost::mpl::at_c<Xs, 2>::type x_2;

    static always_inline R run(int n, T0 t0, T1 t1, T2 t2) {
        switch (n) {
        case boost::mpl::first<x_0>::type::value:
            return boost::mpl::second<x_0>::type::value(t0, t1, t2);
        case boost::mpl::first<x_1>::type::value:
            return boost::mpl::second<x_1>::type::value(t0, t1, t2);
        case boost::mpl::first<x_2>::type::value:
            return boost::mpl::second<x_2>::type::value(t0, t1, t2);

We’re first calling upon the preprocessor to define several variables named x_i that are the ith types in the list Xs, if this is unclear they’re really just terms of the form Xs[i]. Continuing, we use the preprocessor to endow run with variadicness and rather than overloading run by hand we create a macro to define a copy of it taking argc arguments. Then we repeatedly call the macro, varying argc. Each copy of run generates a switch with NUMBER_OF_CASES cases, each of which in turn is generated by METALEVEL_CASE_TABLE_ENTRY. Recalling that the list Xs is a sequence of pairs, this macro simply takes the first item of the ith pair to be the case number, and the second item (which is a function) to be the case body.

At this point we’re basically ready to test things out, all that’s left to define is the convenience function switch_table. The idea here is to enumerate each item, building up a list of pairs to pass into switch_. This takes two steps because we need an extra parameter that tells us when to stop recursing, here we call that parameter Counter

namespace metalevel {
namespace aux {
    template <int Counter, int N, template <int> class Table>
    struct tabulate
      : boost::mpl::push_front<
          tabulate<Counter-1, N-1, Table>,
          boost::mpl::pair<boost::mpl::int_, Table >
        >::type {};

    template <int N, template <int> class Table>
    struct tabulate<0, N, Table>
      : boost::mpl::list<
          boost::mpl::pair<boost::mpl::int_, Table >
        > {};
} // namespace aux

template  class Table>
struct switch_table
  : switch_<aux::tabulate > {};
} // namespace metalevel

Admittedly a bit verbose but overall nothing terribly strange.

Of course we should check that all this trouble actually gets us the assembly we’ve hoped for. The test case we’ll use is just an extension of the interpreter fragment we started with into a full fledged program. For comparison we’ll also include an array of function pointers.

#include <stdio.h>
#include <switch.hpp>

enum { ADD = 0, SUB = 1, MUL = 2, DIV = 3, MOD = 4 };

template <int Operation>
struct interpreter {
    static inline int run(const int x, const int y);

template <> struct interpreter<ADD> {
    static const char *name() { return "ADD"; }
    static inline int run(const int x, const int y)
    { return x + y; }

template <> struct interpreter<SUB> {
    static const char *name() { return "SUB"; }
    static inline int run(const int x, const int y)
    { return x - y; }

template <> struct interpreter<MUL> {
    static const char *name() { return "MUL"; }
    static inline int run(const int x, const int y)
    { return x * y; }

template <> struct interpreter<DIV> {
    static const char *name() { return "DIV"; }
    static inline int run(const int x, const int y)
    { return x / y; }

template <> struct interpreter<MOD> {
    static const char *name() { return "MOD"; }
    static inline int run(const int x, const int y)
    { return x % y; }

template <int Operation>
struct operation_table {
    static always_inline int value(const bool printResult,
                                   const int x, const int y) {
        const int r = interpreter<Operation>::run(x, y);
        if (printResult) printf("operation_table<%s>::run(%i, %i) = %i\n",
                                 interpreter<Operation>::name(), x, y, r);
        return r;

int main() {
    int(*operations[])(const bool,const int,const int) = {
        operation_table<ADD>::value, operation_table<SUB>::value,
        operation_table<MUL>::value, operation_table<DIV>::value,

    bool printResult = true;
    int op,x,y;

    printf("op,x,y: ");
    scanf("%i,%i,%i", &op, &x, &y);

    metalevel::switch_table<0,4, operation_table>::run<int>(op,printResult,x,y);

    return 0;

Since printResult is constant we expect the optimizer to propagate it through the inlined calls and since op, x, and y are dynamic we can be sure switch_table won’t get crushed into a single call to printf.

Lets first look at the result of a call to operations[op](printResult,x,y),

  movslq  52(%rsp), %rax  # load op
  movl    $1, %edi        # set flag to true
  callq   *(%rsp,%rax,8)  # jump to operations[op]

Proceeding with an example function body, consider the case when op = MUL

  .string "operation_table<%s>::run(%i, %i) = %i\n"
  .string "MUL"
  pushq   %rbx
  movl    %esi, %ebx   # load x
  imull   %edx, %ebx   # take x*y
  testb   %dil, %dil   # check if we should printResult
  je      .L7          # jump over print if printResult is zero
  movl    %edx, %ecx   # load y for printing
  movl    %ebx, %r8d   # load x*y for printing
  movl    %esi, %edx   # load x for printing
  movl    $.LC1, %edi  # load format string
  movl    $.LC3, %esi  # load operation name for printing
  xorl    %eax, %eax   # no vector registers used in vararg call
  call    printf
  movl    %ebx, %eax   # return x*y
  popq    %rbx

This matches our expectation that gcc has no real opportunity for contextual optimization.

Moving on to the switch implementation, what does a call to run give us?

  cmpl    $4, 52(%rsp)     # check that op is within case bounds
  ...                      # load x and y for both switch_ and operations[]
  ja      .L13             # jump to default case if op's outside bounds
  mov     52(%rsp), %eax   # load op for jump table
  jmp     *.L19(,%rax,8)   # jump to case body of op
  .quad   .L14             # address of inlined add
  .quad   .L15             # address of inlined sub
  .quad   .L16             # address of inlined mul
  .quad   .L17             # address of inlined div
  .quad   .L18             # address of inlined mod

Which is fairly similar to the array implementation so lets compare the inlined function body when op = MUL

  .string "operation_table<%s>::run(%i, %i) = %i\n"
  .string "MUL"
  movl    %ecx, %r8d       # load y
  movl    %esi, %edx       # load x for printing
  imull   %esi, %r8d       # load x*y for printing
  movl    $.LC3, %esi      # load "MUL" string
  jmp     .L20             # jump to shared body printing the result
  movl    $.LC1, %edi      # load format string
  xorl    %eax, %eax       # no vector registers used in vararg call
  call    printf
  ...                      # run operations[op](printResult,x,y)

Unsurprisingly gcc has done a good job of condensing things and removed the check of printResult entirely.

Now, why would you ever prefer this over a handwritten switch, typical OO polymorphism, or anything else?
In answer to the first question I think this allows a more table driven approach to your standard switch which seems to be an increasingly popular way of solving problems. Additionally if you hadn’t already split your cases into modular components this will force you to. Of course these sorts of questions are always a matter of taste.
Addressing the second and third questions together: not that often. However I do think the utility while niche, is indispensable in generative programming. As a simple example if you have a statically defined map that you’d like to use with a runtime value, a switch may offer good optimization opportunities. Other issues which we addressed above relate to inlining and static extensibility. Ultimately of course if our only goal is performance we should save ourselves embarrassment and just measure things.

A-99 Lists 5 (*) Reverse a list

Reversing a list in a language like Haskell is fairly simple with plenty of solutions. The recursive algorithm we’ll use accumulates the result in a second list as the first is being traversed.

reverse :: [a] -> [a]
reverse xs = reverse' xs []
  where reverse' (x : xs) rxs = reverse' xs (x : rxs)
        reverse' []       rxs = rxs

However, using our Vector type in Agda requires we name the size of the two input lists as well as the output, each of which will clearly be changing every iteration. When looking for the invariant notice that

  1. the length of the output should be a function of the two inputs and
  2. that we’re really required to allow the input lists to range over all the natural numbers.

You’ve probably spotted it already but the sum of the two inputs will always be the size of the original list xs.

Unfortunately a simple transcription of the above Haskell function with the correct lengths still won’t type check for a rather infuriating reason surrounding how simplification of addition works.

reverse : ∀{n}{A : Set} → Vec A n → Vec A n
reverse xs = reverse' xs [] -- problem 1: n + 0 != n
  where reverse' : ∀{m n}{A : Set} → Vec A n → Vec A m → Vec A (n + m)
        reverse' (x  xs) rxs = reverse' xs (x  rxs) -- problem 2: n + suc m != suc (n + m)
        reverse' [] rxs = rxs

Obviously both problems should have solutions given that in fact the sum of n and 0 is n, and that addition is commutative. Unfortunately if you dig out the definition of addition (on my system /usr/share/Agda-2.2.10/lib-0.5/src/Data/Nat.agda) you’ll find the following.

_+_ : 
zero + n = n
suc m + n = suc (m + n)

So agda can see that 0 + n = n and that suc m + n = suc (m + n), but that n + 0 = 0 and m + suc n = suc (m + n) are unobvious.

Fortunately understanding the problem allows us to solve it fairly easily. There is a function subst that allows us to substitute equal terms for one another in our types—thus if we can show each of the above we can invoke subst to massage our function into agda’s liking.

First addressing problem 2 in a straightforward inductive style

n+m+1≡n+1+m : (n m : ) → n + suc m  suc n + m
n+m+1≡n+1+m (suc n) m = cong suc (n+m+1≡n+1+m n m)
n+m+1≡n+1+m 0 m = refl
  1. Show suc n + suc m is suc (suc n) + m.
    By induction (that’s the recursive call) we know that n + suc m is suc n + m and hence that (see the definition of cong) suc (n + suc m) is suc (suc n + m) from which agda deduces our goal.
  2. Show that 0 + suc m is suc 0 + m.
    Agda knows that 0 + suc m is suc m and that suc 0 + m is suc m so our goal is to show that suc m equals itself—which follows from reflexivity.

Having shown the required lemma we should take a closer look at subst to understand its application. Opening up Relation/Binary/PropositionalEquality/Core.agda we see that

subst : {A : Set} → (P : A → Set) → ∀{x y} → x  y → P x → P y
subst P refl p = p

So like magic if we can supply proofs that x is y and that P x holds, subst will give us that the same term has type P y. How did that happen? Uncovering the type of definitional equality gives us the answer (see Relation/Binary/Core.agda).

data _≡_ {A : Set} (x : A) : A → Set where
  refl : x  x

This strange definition actually states that the only inhabitant of the type _≡_ is the proof that a term equals itself. Thus when we hand subst a proof that x ≡ y along with a proof of P x, it deduces that y is defined as x, and hence the same term p is also a proof of P y.

We’re finally at the point of solving reverse’.

reverse' : ∀{n m}{A : Set} → Vector A n → Vector A m → Vector A (n + m)
reverse' {suc n}{m} (x  xs) rxs = subst (λ k → Vector A k) (n+m+1≡n+1+m n m) (reverse' xs (x  rxs))
reverse' [] rxs = rxs

Recalling that the type of reverse’ xs (x ∷ rxs) is Vector A (n + suc m), we would like to show that this is Vector A (suc n + m). Using subst we take P k = Vector A k along with the proof n+m+1≡n+1+m n m showing n + suc m is suc n + m. Reiterating with a table makes this less confusing

P Vector A
x n + suc m
y suc n + m
x ≡ y n+m+1≡n+1+m n m : n + suc m ≡ suc n + m
P x Vector A (n + suc m)
P y Vector A (suc n + m)
p reverse’ xs (x ∷ rxs)

Thus the p we get out is a proof of P y = Vector A (suc n + m) as required.

We’re finally left to show that n + 0 is n. Given that we’ll need the commutativity of addition for problem 6 anyway we may as well take this approach. To show that n + m ≡ m + n we proceed by induction on n (and in the base case n=0 by induction on m).

+-comm : ∀ n m → n + m  m + n
+-comm (suc n) m = subst (λ e → suc (m + n)  e) (sym (n+m+1≡n+1+m m n)) (cong suc (+-comm n m))
+-comm 0 (suc m) = cong suc (+-comm 0 m)
+-comm 0 0 = refl

That first case is bit of a mouthful so we’ll break the substitution down with a table

P _≡_ (suc (m + n))
x suc m + n
y m + suc n
x ≡ y sym (n+m+1≡n+1+m m n) : suc m + n ≡ m + suc n
p cong suc (+-comm n m) : suc n + m ≡ suc m + n

The final solution to reverse should now be simple.

module P5 where
open import Data.Nat using (ℕ; suc; zero)
open import Relation.Binary.PropositionalEquality using (_≡_; refl; sym; subst; cong)

n+m+1≡n+1+m : (n m : ) → n + suc m  suc n + m
n+m+1≡n+1+m (suc n) m = cong suc (n+m+1≡n+1+m n m)
n+m+1≡n+1+m 0 m = refl

+-comm : ∀ n m → n + m  m + n
+-comm (suc n) m = subst (λ e → suc (m + n)  e) (sym (n+m+1≡n+1+m m n)) (cong suc (+-comm n m))
+-comm 0 (suc m) = cong suc (+-comm 0 m)
+-comm 0 0 = refl

reverse : ∀{n}{A : Set} → Vector A n → Vector A n
reverse {n} xs = subst (λ k → Vector A k) (+-comm n 0) (reverse' xs [])
  where reverse' : ∀{n m}{A : Set} → Vector A n → Vector A m → Vector A (n + m)
        reverse' {suc n}{m} (x  xs) rxs = subst (λ k → Vector A k) (n+m+1≡n+1+m n m) (reverse' xs (x  rxs))
        reverse' [] rxs = rxs

A-99 Lists 3 (*) Find the K’th element of a list

Given that inspecting an arbitrary index in a list can result in a runtime failure (what if we request index 12 of a length 2 list) we have two options

  1. return a datatype with an error code such as Maybe or -⊎-, or
  2. restrict the index by the size of the list.

We choose the second option because it’s not traditionally simple or even possible in many languages.

The type used to express this is Fin, the family of finite sets. As an example the type Fin 4 is the collection of numbers less than 4.

data Fin (A : Set) : Set where
  zero : {n : } → Fin (suc n) 
  suc : {n : } → (i : Fin n) → Fin (suc n)

The first constructor says that for any natural number n, zero is less than n+1 (we add one because zero is not less than zero).

The second constructor asserts that if we have a value less than some n, it’s successor is less than n+1.

Equipped with the above type we can express the problem exactly as one would expect: the function element-at is from a list of size n and an index less than n to an element in the list.

module P3 where
open import Data.Nat using (ℕ; suc; zero)
open import Data.Fin using (Fin; suc; zero)

data Vector (A : Set) : (n : ) → Setwhere
  [] : Vector A 0
  _∷_ : ∀{n} → A → Vector A n → Vector A (suc n)
infixr 5 _∷_

element-at : ∀{A}{n} → Vector A n → Fin n → A
element-at [] ()
element-at (x  xs) zero = x
element-at (x  xs) (suc n) = element-at xs n

Topoi 3.8: Products


  1. <pra, prb> = 1a × b
  2. if <f, g> = <k, h> then f = k and g = h
  3. <f ∘ h, g ∘ h> = <f, g> ∘ h
  4. if category 𝒞 has a terminal object 1 and products, then for any 𝒞-object a, a ≅ a × 1 and indeed <1a, Ia> is iso

The arrow 1a × b by definition satisfies

  1. pra ∘ 1a × b = pra and
  2. prb ∘ 1a × b = prb

and thus <pra, prb> = 1a × b.

Clearly pra ∘ <f, g> = pra ∘ <k, h>, hence f = k, moreover since prb ∘ <f, g> = prb ∘ <k, h>, g = h.

At this point we really only have two options for proving the isomorphism: we could (1) find an iso arrow or (2) show that a is a product of a × 1;. Opting for the second we must show the following diagram commutes for all c, f and g
a is a product of ax1
Notice however that we only have one choice of g, namely !c and regardless of <f, g> we have that !a ∘ <f, g> = !c again by the definition of the terminal arrow. Clearly then taking <f, g> = f gives

  1. 1a ∘ f = f and
  2. !a ∘ f = !c

as required. Given that all products are isomorphic we have that a is isomorphic to a × 1.