There are two common ways to implement single dispatch polymorphism, virtual functions and switch statements. Object oriented design encourages virtual functions since it is much easier to extend and manage, but people often complain about the overhead of virtual calls. I’ve also wondered if the compiler can do any optimizations with C++ virtual functions that are not possible in C with a manually managed v-table.
This post explores the performance of three different implementations, C with v-tables, C++ native virtual functions, and C with switch statements. It uses the Foo object which is just an integer of state and an operation. In C++, the interface (I ignore the issue of creation in this post) is
struct Foo {
virtual int func(int,int*) const = 0;
long state;
};
Ideally, this structure would be presented to the user as an opaque object, thus facilitating a stable ABI. We’ll see that this doesn’t necessarily cause a measurable performance penalty. Note that the state is chosen to be type long so that alignment is equivalent to the alternative implementation
/* Forward declaration in interface header */
typedef struct _p_FooStat *FooStat;
/* Implementation header */
struct _p_FooStat {
FooType type;
long state;
};
which will use a switch statement to determine the type. The version with explicit virtual functions looks like this
/* Forward declaration in interface header */
typedef struct _p_FooVirt *FooVirt;
/* Implementation header */
struct FooOps {
int (*func)(FooVirt,int,int*);
};
struct _p_FooVirt {
struct FooOps *ops;
size_t state;
};
The interface declarations are the typedefs above and these functions
int FooVirtCall(FooVirt,int,int*); int FooStatCall(FooStat,int,int*);
In both cases, Foo can change completely without any change to the ABI. Interface definitions go in the shared library and look like
int FooVirtCall(FooVirt f,int i,int *r)
{
int e;
e = f->ops->func(f,i,r);
if (e) exit(e);
return 0;
}
The error handling is for illustration only, obviously the interface function can do an arbitrary amount of setup before the virtual call and cleanup afterwards. The alternative is
int FooStatCall(FooStat f,int i,int *r)
{
int e = 0;
switch (f->type) {
case FOO_ADD: e = FooStat_Add(f,i,r); break;
case FOO_SUBTRACT: e = FooStat_Subtract(f,i,r); break;
case FOO_MULT: e = FooStat_Mult(f,i,r); break;
}
if (e) exit(e);
return 0;
}
For performance testing, we create 50000 objects with type equal to index modulo 3 and iterate through all of the objects 10000 times. The timing is done using gettimeofday. The architecture is T9300 2.5Ghz Core 2 Duo running Linux 2.6.27.5 compiled using GCC-4.3.2 with -O3.
FooVirtCall: 2.1234 sec 235.4714 M calls / sec
FooStatCall: 2.9190 sec 171.2892 M calls / sec
FooCxxCall: 2.2018 sec 227.0916 M calls / sec
The first important thing to note is that we are executing over 200 million polymorphic function calls per second. Optimal memory bandwidth is 10.6 GB/s so if much more than 48 bytes (6 doubles) are used per function call (assuming the working set is much larger than cache), aggressively minimizing function call overhead is not going to help much. For my purposes, the simplest case uses 8 floating point numbers that cannot be in cache already, hence we’re barely in the realm where optimization of the virtual call matters.
We’ll start by comparing C and C++ virtual calls and come back to FooStatCall later. The C++ version is the obvious implementation using an abstruct base class and virtual functions. The assembly is identical for FooVirtCall and FooCxxCall.
0x0000000000402020 <FooVirtCall+0>: sub $0x8,%rsp
0x0000000000402024 <FooVirtCall+4>: mov (%rdi),%rax
0x0000000000402027 <FooVirtCall+7>: callq *(%rax)
0x0000000000402029 <FooVirtCall+9>: test %eax,%eax
0x000000000040202b <FooVirtCall+11>: jne 0x402034 <FooVirtCall+20>
0x000000000040202d <FooVirtCall+13>: xor %eax,%eax
0x000000000040202f <FooVirtCall+15>: add $0x8,%rsp
0x0000000000402033 <FooVirtCall+19>: retq
0x0000000000402034 <FooVirtCall+20>: mov %eax,%edi
0x0000000000402036 <FooVirtCall+22>: callq 0x4010b8 <exit@plt>
Now, with C++ virtual functions, the header usually contains the struct definition so adding a virtual function or a data member changes the ABI. Also, there isn’t usually any error checking in the interface function. That is, the compiler essentially inlines the following at the call site.
int FooVirtCallTail(FooVirt f,int i,int *r)
{
return f->ops->func(f,i,r);
}
To maintain ABI stability, we keep the definition of struct _p_FooVirt (hence necessarily this interface function) out of the public header.
FooVirtCallTail: 1.7889 sec 279.4943 M calls / sec
The assembly is obviously optimal
0x00007f2ad6c9e130 <FooVirtCallTail+0>: mov (%rdi),%rax
0x00007f2ad6c9e133 <FooVirtCallTail+3>: mov (%rax),%r11
0x00007f2ad6c9e136 <FooVirtCallTail+6>: jmpq *%r11
and is identical to the C++ analogue. How much better can we do if we allow inlining of the interface function as is normally done in C++? Defining the struct and interface function (static inline) in the header, we see
FooVirtCallInlineable: 1.8097 sec 276.2880 M calls / sec
FooVirtCallInlineableTail: 1.7789 sec 281.0789 M calls / sec
(The latter is what you get from C++ by default.) So the error checking version is about as fast as the tail call, which hasn’t really benefited from inlining. The reason for this is that the interface function is trivial and already in cache, so the cost of not inlining it is only one perfectly predicted jmp to a hot location. When there is error checking, the compiler can obviously take advantage of inlining to produce a faster call.
The conclusion here is that using a C-style polymorphism allow us to keep the structure definition and v-table out of the ABI with essentially zero runtime cost compared to putting it in the header. If we do error checking in the interface function, we pay about a 20% price compared to the C++ version where the static interface method could be inlined. In C, if this was an issue, we would put the struct definition and interface function in the public header, thus getting identical assembly to C++.
While there are ways to obtain ABI stability in C++, it’s not the default as it is with C. C also provides convenient introspection and reflection which is not really available in C++. For instance, you can determine whether a given object implements a virtual function by checking whether the function pointer is NULL. It is possible to create new aggregate types by allocating memory for the v-table on a per-object basis and then selectively rewriting parts of the v-table.
Back to the early loser, FooStatCall. There are lots of ways to speed this up. Timing for some variations, including the original:
FooStatCall: 2.9190 sec 171.2892 M calls / sec
FooStatCallTail: 3.8625 sec 129.4489 M calls / sec
FooStatCallArr: 1.7841 sec 280.2485 M calls / sec
FooStatCallInline: 1.6448 sec 303.9879 M calls / sec
FooStatCallInlineable: 2.4053 sec 207.8724 M calls / sec
FooStatCallInlineableTail: 2.8158 sec 177.5725 M calls / sec
FooStatCallAllInline: 1.1480 sec 435.5427 M calls / sec
Strangely, using tail calls is actually slower for this calling sequence. FooStatCallArr is a tail call with an array of static function pointers. The assembly is clearly optimal and reveals why it is basically the same speed as virtual calls.
0x0000000000401e50 <FooStatCallArr+0>: mov (%rdi),%eax
0x0000000000401e52 <FooStatCallArr+2>: mov 0x402460(,%rax,8),%r11
0x0000000000401e5a <FooStatCallArr+10>: jmpq *%r11
The interface function FooStatCallInline is not inlined, rather, it inlines the implementation (the operation is written into each of the cases in the switch statement). Making the interface function inlineable at the call site doesn’t help. Of course the fastest solution is to inline absolutely everything.
The conclusion here is that making switch statements faster than virtual calls requires at a minimum that the implementations be in the same compilation unit as the interface function. In this case, the benefits are still very small; to really win, everything must be inlineable by the user (i.e. the entire implementation must be written in the public header). This is unacceptable in most environments.
If you’d like to try some variations, you can start with this tarball.
The motivation for this post is fast application of tensor product operations like interpolation and differentiation between collocation and quadrature nodes in an finite element method. The mesh is not necessarily homogeneous with respect to topology and spectral order. C++ templates are not directly applicable since we work with large arrays of mixed type. It is useful to have the innermost loop completely unrolled, thus essentially templating over the last dimension in the tensor product. If it was practical to sort by element type, it would be possible to hoist the polymorphism out of the loop over elements (use separate loops over elements of each type). However, changing the order of element traversal negatively impacts data locality so the polymorphic function call is actually cheaper.