If you prefer to skip the english you can find a copy of the source in this post here
Recently I was refactoring some C++ and found myself needing to dispatch one of several calls via a template index. Of course while this is simple when we’re dealing with a static index—this is not so simple if the index is discovered at runtime because the only way to get a static copy is to inspect each possible value. My first intuition was to create a function table because it’s quick to implement and reflects a sort of runtime variant of a template over integers. However I wanted to give the optimizer a chance to inline each specialized call because the majority were a mere single expression and in a performance sensitive section. Since an indirect call has (rather obviously) no hope of being inlined I was left with no other choice than to make each call and its selection explicit. While possible with a recursive chain of if-else statements which has the bonus of an appealing simplicity, I didn’t expect the optimizer to do much with such an approach so I decided on a switch instead (I did actually check and though hopeful, gcc emitted a cascade of conditional jumps—I tried lambdas too which were significantly worse).
The first thing to ask is what a “switch call” will look like.
If we were designing a simple interpreter, we might switch on the operation as in the following
int run_operation(const int operation, const bool printResult,
const int x, const int y) {
int result;
switch (operation) {
case ADD: result = x + y; break;
case SUB: result = x - y; break;
case MUL: result = x * y; break;
case DIV: result = x / y; break;
case MOD: result = x % y; break;
default: assert(false);
}
if (printResult) printf("%i\n", result);
return result;
}
This is pretty suggestive of what a switch_ metafunction should look like. It will of course accept a template (my real motivation) which provides the mapping of case numbers to case functions; it should accept a value to switch on; and it should forward any other arguments onto the case bodies. Rewriting run_operation with such a function should look something like this
return switch_<operation_table>::run(operation, printResult, x, y);
Here the template argument int is the return type, the first argument to run is the value to switch on, and the remaining arguments are to be passed into the case’s body.
What will the template operation_table look like? We’d decided it should be a map of case numbers to functions so similar to a real switch we’ll use a (static) sequence of pairs, each with an integral value to match on and a body to be executed. This is enough information for a first stab at the signature of switch_ and its static function run
template <typename Xs>
struct switch_ {
template
static R run(int n, zero or more variable parameters);
};
An example of Xs (the sequence of pairs) is
template <int N>
struct operation_table {
static int value(const bool printResult, const int x, const int y) {
...
}
};
switch_<list<
pair<int_<0>, operation_table<ADD> >,
pair<int_<1>, operation_table<SUB> >,
pair<int_<2>, operation_table<MUL> >,
pair<int_<3>, operation_table<DIV> >,
pair<int_<4>, operation_table<MOD> >
> >::run(operation, printResult, x, y);
Since you may at this point be asking what we’ve gained, let’s start by providing a convenience function offering some concision
template <int Lower, int Upper, template <int> class Table> struct switch_table;
This at least simplifies things
switch_table<0, 4, operation_table>::run(operation, printResult, x, y)
Having defined what we’d like a generated switch to looks like, how do we actually go about implementing it?
A good starting place is in the generation of cases. Notice however that we can’t produce the cases for the switch using templates so we’re forced to fall back to the preprocessor, and while the preprocessor can send information to our templates, the templates obviously can’t send information back to the preprocessor. This is a problem because as everything is currently phrased, the preprocessor doesn’t know the number of cases to emit—the template does. To clarify, in the case of switch_table this is via the first two template arguments, Lower and Upper; and in the case of switch_ this is via the length of Xs.
To circumvent this problem we’ll create a family of metafunctions parametrized by the number of cases they include, template <int NumberOfCases, typename Xs> switch_impl, allowing the preprocessor to pass into the template the number of cases it intends to emit. The user exposed metafunction switch_ will then call switch_impl<length of Xs, Xs>.
#ifndef _metalevel_switch_h_
#define _metalevel_switch_h_
#include <boost/mpl/at.hpp>
#include <boost/mpl/if.hpp>
#include <boost/mpl/list.hpp>
#include <boost/mpl/pair.hpp>
#include <boost/mpl/push_front.hpp>
#include <boost/mpl/size.hpp>
#include <boost/preprocessor/iterate.hpp>
#include <boost/preprocessor/repetition/enum_params.hpp>
#include <boost/preprocessor/repetition/enum_binary_params.hpp>
#include <boost/preprocessor/repetition/repeat.hpp>
#ifdef __GNUC__
# define always_inline __attribute__((always_inline))
#elif defined(_MSC_VER)
# define always_inline __forceinline
#else
# define always_inline inline
#endif
#ifndef METALEVEL_MAX_CHOICE_ARGS
# define METALEVEL_MAX_CHOICE_ARGS 20
#endif
#ifndef METALEVEL_MAX_SWITCH_SIZE
# define METALEVEL_MAX_SWITCH_SIZE 20
#endif
namespace metalevel {
namespace aux {
template <int NumberOfCases, typename Xs>
struct switch_impl;
# define BOOST_PP_ITERATION_LIMITS (0, METALEVEL_MAX_SWITCH_SIZE)
# define BOOST_PP_FILENAME_1 <switch_impl.hpp>
# include BOOST_PP_ITERATE()
} // namespace aux
template <typename Xs>
struct switch_
: aux::switch_impl<boost::mpl::size::type::value - 1, Xs> {};
} // namespace metalevel
#endif
So all we’ve really accomplished is delegating most of the work to switch_impl.hpp using boost preprocessor’s file iteration (if you’re unfamiliar, the preprocessor docs found here are good). The above iteratively includes switch_impl.hpp METALEVEL_MAX_SWITCH_SIZE times, each time defining the pass number as a sort of argument to the header.
Before proceeding let’s quickly reiterate what’s going on here: switch_impl now handles switches with every possible number of cases (up to METALEVEL_MAX_SWITCH_SIZE). Then switch_ merely selects the correct number based on the size of its input list. For example switch_impl<2,Xs> implements run with case 0 and case 1 while switch_impl<7,Xs> implements run with case 0, case 1, case 2, case 3, case 4, case 5, and case 6.
So what does switch_impl.hpp look like?
#ifdef BOOST_PP_IS_ITERATING
#define NUMBER_OF_CASES BOOST_PP_FRAME_ITERATION(1)
template <typename Xs>
struct switch_impl<NUMBER_OF_CASES, Xs> {
# define TYPEDEF_X_SUB_I(_, i, __) \
typedef typename boost::mpl::at_c<Xs, i>::type x_##i;
BOOST_PP_REPEAT(NUMBER_OF_CASES, TYPEDEF_X_SUB_I, ~)
# define METALEVEL_CASE_TABLE_ENTRY(_, i, ts) \
case boost::mpl::first<x_##i>::type::value: \
return boost::mpl::second<x_##i>::type::value ts ;
# define DEFINE_CHOICE_RUN(_, argc, __) \
template <typename R BOOST_PP_COMMA_IF(argc) \
BOOST_PP_ENUM_PARAMS(argc, typename T)> \
static always_inline R \
run(int n BOOST_PP_COMMA_IF(argc) \
BOOST_PP_ENUM_BINARY_PARAMS(argc, T, t)) { \
switch(n) { \
BOOST_PP_REPEAT(NUMBER_OF_CASES, \
METALEVEL_CASE_TABLE_ENTRY, \
(BOOST_PP_ENUM_PARAMS(argc, t))) \
} \
}
BOOST_PP_REPEAT(METALEVEL_MAX_CHOICE_ARGS,
DEFINE_CHOICE_RUN, ~)
# undef DEFINE_CHOICE_RUN
# undef METALEVEL_CASE_TABLE_ENTRY
};
#undef NUMBER_OF_CASES
#endif
While a bit obscured by preprocessor iteration, nothing too interesting is actually going on here. Writing out a simple example really illuminates how this works so lets consider a copy of switch_impl when NumberOfCases is 3.
template <>
struct switch_impl<3, Xs> {
typedef boost::mpl::at_c<Xs, 0>::type x_0;
typedef boost::mpl::at_c<Xs, 1>::type x_1;
typedef boost::mpl::at_c<Xs, 2>::type x_2;
template
static always_inline R run(int n, T0 t0, T1 t1, T2 t2) {
switch (n) {
case boost::mpl::first<x_0>::type::value:
return boost::mpl::second<x_0>::type::value(t0, t1, t2);
case boost::mpl::first<x_1>::type::value:
return boost::mpl::second<x_1>::type::value(t0, t1, t2);
case boost::mpl::first<x_2>::type::value:
return boost::mpl::second<x_2>::type::value(t0, t1, t2);
}
}
};
We’re first calling upon the preprocessor to define several variables named x_i that are the ith types in the list Xs, if this is unclear they’re really just terms of the form Xs[i]. Continuing, we use the preprocessor to endow run with variadicness and rather than overloading run by hand we create a macro to define a copy of it taking argc arguments. Then we repeatedly call the macro, varying argc. Each copy of run generates a switch with NUMBER_OF_CASES cases, each of which in turn is generated by METALEVEL_CASE_TABLE_ENTRY. Recalling that the list Xs is a sequence of pairs, this macro simply takes the first item of the ith pair to be the case number, and the second item (which is a function) to be the case body.
At this point we’re basically ready to test things out, all that’s left to define is the convenience function switch_table. The idea here is to enumerate each item, building up a list of pairs to pass into switch_. This takes two steps because we need an extra parameter that tells us when to stop recursing, here we call that parameter Counter
namespace metalevel {
namespace aux {
template <int Counter, int N, template <int> class Table>
struct tabulate
: boost::mpl::push_front<
tabulate<Counter-1, N-1, Table>,
boost::mpl::pair<boost::mpl::int_, Table >
>::type {};
template <int N, template <int> class Table>
struct tabulate<0, N, Table>
: boost::mpl::list<
boost::mpl::pair<boost::mpl::int_, Table >
> {};
} // namespace aux
template class Table>
struct switch_table
: switch_<aux::tabulate > {};
} // namespace metalevel
Admittedly a bit verbose but overall nothing terribly strange.
Of course we should check that all this trouble actually gets us the assembly we’ve hoped for. The test case we’ll use is just an extension of the interpreter fragment we started with into a full fledged program. For comparison we’ll also include an array of function pointers.
#include <stdio.h>
#include <switch.hpp>
enum { ADD = 0, SUB = 1, MUL = 2, DIV = 3, MOD = 4 };
template <int Operation>
struct interpreter {
static inline int run(const int x, const int y);
};
template <> struct interpreter<ADD> {
static const char *name() { return "ADD"; }
static inline int run(const int x, const int y)
{ return x + y; }
};
template <> struct interpreter<SUB> {
static const char *name() { return "SUB"; }
static inline int run(const int x, const int y)
{ return x - y; }
};
template <> struct interpreter<MUL> {
static const char *name() { return "MUL"; }
static inline int run(const int x, const int y)
{ return x * y; }
};
template <> struct interpreter<DIV> {
static const char *name() { return "DIV"; }
static inline int run(const int x, const int y)
{ return x / y; }
};
template <> struct interpreter<MOD> {
static const char *name() { return "MOD"; }
static inline int run(const int x, const int y)
{ return x % y; }
};
template <int Operation>
struct operation_table {
static always_inline int value(const bool printResult,
const int x, const int y) {
const int r = interpreter<Operation>::run(x, y);
if (printResult) printf("operation_table<%s>::run(%i, %i) = %i\n",
interpreter<Operation>::name(), x, y, r);
return r;
}
};
int main() {
int(*operations[])(const bool,const int,const int) = {
operation_table<ADD>::value, operation_table<SUB>::value,
operation_table<MUL>::value, operation_table<DIV>::value,
operation_table<MOD>::value
};
bool printResult = true;
int op,x,y;
printf("op,x,y: ");
scanf("%i,%i,%i", &op, &x, &y);
metalevel::switch_table<0,4, operation_table>::run<int>(op,printResult,x,y);
operations[op](printResult,x,y);
return 0;
}
Since printResult is constant we expect the optimizer to propagate it through the inlined calls and since op, x, and y are dynamic we can be sure switch_table won’t get crushed into a single call to printf.
Lets first look at the result of a call to operations[op](printResult,x,y),
movslq 52(%rsp), %rax # load op movl $1, %edi # set flag to true callq *(%rsp,%rax,8) # jump to operations[op]
Proceeding with an example function body, consider the case when op = MUL
.LC1: .string "operation_table<%s>::run(%i, %i) = %i\n" ... .LC3: .string "MUL" ... pushq %rbx movl %esi, %ebx # load x imull %edx, %ebx # take x*y testb %dil, %dil # check if we should printResult je .L7 # jump over print if printResult is zero movl %edx, %ecx # load y for printing movl %ebx, %r8d # load x*y for printing movl %esi, %edx # load x for printing movl $.LC1, %edi # load format string movl $.LC3, %esi # load operation name for printing xorl %eax, %eax # no vector registers used in vararg call call printf .L7: movl %ebx, %eax # return x*y popq %rbx ret
This matches our expectation that gcc has no real opportunity for contextual optimization.
Moving on to the switch implementation, what does a call to run give us?
cmpl $4, 52(%rsp) # check that op is within case bounds ... # load x and y for both switch_ and operations[] ja .L13 # jump to default case if op's outside bounds mov 52(%rsp), %eax # load op for jump table jmp *.L19(,%rax,8) # jump to case body of op .L19: .quad .L14 # address of inlined add .quad .L15 # address of inlined sub .quad .L16 # address of inlined mul .quad .L17 # address of inlined div .quad .L18 # address of inlined mod
Which is fairly similar to the array implementation so lets compare the inlined function body when op = MUL
.LC1: .string "operation_table<%s>::run(%i, %i) = %i\n" ... .LC3: .string "MUL" ... .L16: movl %ecx, %r8d # load y movl %esi, %edx # load x for printing imull %esi, %r8d # load x*y for printing movl $.LC3, %esi # load "MUL" string jmp .L20 # jump to shared body printing the result ... .L20: movl $.LC1, %edi # load format string xorl %eax, %eax # no vector registers used in vararg call call printf ... # run operations[op](printResult,x,y)
Unsurprisingly gcc has done a good job of condensing things and removed the check of printResult entirely.
Now, why would you ever prefer this over a handwritten switch, typical OO polymorphism, or anything else?
In answer to the first question I think this allows a more table driven approach to your standard switch which seems to be an increasingly popular way of solving problems. Additionally if you hadn’t already split your cases into modular components this will force you to. Of course these sorts of questions are always a matter of taste.
Addressing the second and third questions together: not that often. However I do think the utility while niche, is indispensable in generative programming. As a simple example if you have a statically defined map that you’d like to use with a runtime value, a switch may offer good optimization opportunities. Other issues which we addressed above relate to inlining and static extensibility. Ultimately of course if our only goal is performance we should save ourselves embarrassment and just measure things.
