How does implementation of polymorphism affect performance?

There are two common ways to implement single dispatch polymorphism, virtual functions and switch statements.  Object oriented design encourages virtual functions since it is much easier to extend and manage, but people often complain about the overhead of virtual calls.  I’ve also wondered if the compiler can do any optimizations with C++ virtual functions that are not possible in C with a manually managed v-table.

This post explores the performance of three different implementations, C with v-tables, C++ native virtual functions, and C with switch statements.  It uses the Foo object which is just an integer of state and an operation.  In C++, the interface (I ignore the issue of creation in this post) is

struct Foo {
  virtual int func(int,int*) const = 0;
  long state;
};

Ideally, this structure would be presented to the user as an opaque object, thus facilitating a stable ABI. We’ll see that this doesn’t necessarily cause a measurable performance penalty.  Note that the state is chosen to be type long so that alignment is equivalent to the alternative implementation

/* Forward declaration in interface header */
typedef struct _p_FooStat *FooStat;

/* Implementation header */
struct _p_FooStat {
  FooType type;
  long state;
};

which will use a switch statement to determine the type.  The version with explicit virtual functions looks like this

/* Forward declaration in interface header */
typedef struct _p_FooVirt *FooVirt;

/* Implementation header */
struct FooOps {
  int (*func)(FooVirt,int,int*);
};
struct _p_FooVirt {
  struct FooOps *ops;
  size_t state;
};

The interface declarations are the typedefs above and these functions

int FooVirtCall(FooVirt,int,int*);
int FooStatCall(FooStat,int,int*);

In both cases, Foo can change completely without any change to the ABI.  Interface definitions go in the shared library and look like

int FooVirtCall(FooVirt f,int i,int *r)
{
  int e;
  e = f->ops->func(f,i,r);
  if (e) exit(e);
  return 0;
}

The error handling is for illustration only, obviously the interface function can do an arbitrary amount of setup before the virtual call and cleanup afterwards.  The alternative is

int FooStatCall(FooStat f,int i,int *r)
{
  int e = 0;
  switch (f->type) {
    case FOO_ADD: e = FooStat_Add(f,i,r); break;
    case FOO_SUBTRACT: e = FooStat_Subtract(f,i,r); break;
    case FOO_MULT: e = FooStat_Mult(f,i,r); break;
  }
  if (e) exit(e);
  return 0;
}

For performance testing, we create 50000 objects with type equal to index modulo 3 and iterate through all of the objects 10000 times.  The timing is done using gettimeofday.  The architecture is T9300 2.5Ghz Core 2 Duo running Linux 2.6.27.5 compiled using GCC-4.3.2 with -O3.


FooVirtCall:     2.1234 sec  235.4714 M calls / sec
FooStatCall:     2.9190 sec  171.2892 M calls / sec
 FooCxxCall:     2.2018 sec  227.0916 M calls / sec

The first important thing to note is that we are executing over 200 million polymorphic function calls per second.  Optimal memory bandwidth is 10.6 GB/s so if much more than 48 bytes (6 doubles) are used per function call (assuming the working set is much larger than cache), aggressively minimizing function call overhead is not going to help much.  For my purposes, the simplest case uses 8 floating point numbers that cannot be in cache already, hence we’re barely in the realm where optimization of the virtual call matters.

We’ll start by comparing C and C++ virtual calls and come back to FooStatCall later.  The C++ version is the obvious implementation using an abstruct base class and virtual functions.  The assembly is identical for FooVirtCall and FooCxxCall.


0x0000000000402020 <FooVirtCall+0>:     sub    $0x8,%rsp
0x0000000000402024 <FooVirtCall+4>:     mov    (%rdi),%rax
0x0000000000402027 <FooVirtCall+7>:     callq  *(%rax)
0x0000000000402029 <FooVirtCall+9>:     test   %eax,%eax
0x000000000040202b <FooVirtCall+11>:    jne    0x402034 <FooVirtCall+20>
0x000000000040202d <FooVirtCall+13>:    xor    %eax,%eax
0x000000000040202f <FooVirtCall+15>:    add    $0x8,%rsp
0x0000000000402033 <FooVirtCall+19>:    retq
0x0000000000402034 <FooVirtCall+20>:    mov    %eax,%edi
0x0000000000402036 <FooVirtCall+22>:    callq  0x4010b8 <exit@plt>

Now, with C++ virtual functions, the header usually contains the struct definition so adding a virtual function or a data member changes the ABI.  Also, there isn’t usually any error checking in the interface function.  That is, the compiler essentially inlines the following at the call site.

int FooVirtCallTail(FooVirt f,int i,int *r)
{
  return f->ops->func(f,i,r);
}

To maintain ABI stability, we keep the definition of struct _p_FooVirt (hence necessarily this interface function) out of the public header.


FooVirtCallTail:     1.7889 sec  279.4943 M calls / sec

The assembly is obviously optimal


0x00007f2ad6c9e130 <FooVirtCallTail+0>: mov    (%rdi),%rax
0x00007f2ad6c9e133 <FooVirtCallTail+3>: mov    (%rax),%r11
0x00007f2ad6c9e136 <FooVirtCallTail+6>: jmpq   *%r11

and is identical to the C++ analogue.  How much better can we do if we allow inlining of the interface function as is normally done in C++? Defining the struct and interface function (static inline) in the header, we see


    FooVirtCallInlineable:     1.8097 sec  276.2880 M calls / sec
FooVirtCallInlineableTail:     1.7789 sec  281.0789 M calls / sec

(The latter is what you get from C++ by default.)  So the error checking version is about as fast as the tail call, which hasn’t really benefited from inlining.  The reason for this is that the interface function is trivial and already in cache, so the cost of not inlining it is only one perfectly predicted jmp to a hot location.  When there is error checking, the compiler can obviously take advantage of inlining to produce a faster call.

The conclusion here is that using a C-style polymorphism allow us to keep the structure definition and v-table out of the ABI with essentially zero runtime cost compared to putting it in the header.  If we do error checking in the interface function, we pay about a 20% price compared to the C++ version where the static interface method could be inlined.  In C, if this was an issue, we would put the struct definition and interface function in the public header, thus getting identical assembly to C++.

While there are ways to obtain ABI stability in C++, it’s not the default as it is with C.  C also provides convenient introspection and reflection which is not really available in C++.  For instance, you can determine whether a given object implements a virtual function by checking whether the function pointer is NULL.  It is possible to create new aggregate types by allocating memory for the v-table on a per-object basis and then selectively rewriting parts of the v-table.

Back to the early loser, FooStatCall.  There are lots of ways to speed this up.  Timing for some variations, including the original:


              FooStatCall:     2.9190 sec  171.2892 M calls / sec
          FooStatCallTail:     3.8625 sec  129.4489 M calls / sec
           FooStatCallArr:     1.7841 sec  280.2485 M calls / sec
        FooStatCallInline:     1.6448 sec  303.9879 M calls / sec
    FooStatCallInlineable:     2.4053 sec  207.8724 M calls / sec
FooStatCallInlineableTail:     2.8158 sec  177.5725 M calls / sec
     FooStatCallAllInline:     1.1480 sec  435.5427 M calls / sec

Strangely, using tail calls is actually slower for this calling sequence.  FooStatCallArr is a tail call with an array of static function pointers.  The assembly is clearly optimal and reveals why it is basically the same speed as virtual calls.


0x0000000000401e50 <FooStatCallArr+0>:  mov    (%rdi),%eax
0x0000000000401e52 <FooStatCallArr+2>:  mov    0x402460(,%rax,8),%r11
0x0000000000401e5a <FooStatCallArr+10>: jmpq   *%r11

The interface function FooStatCallInline is not inlined, rather, it inlines the implementation (the operation is written into each of the cases in the switch statement).  Making the interface function inlineable at the call site doesn’t help.  Of course the fastest solution is to inline absolutely everything.

The conclusion here is that making switch statements faster than virtual calls requires at a minimum that the implementations be in the same compilation unit as the interface function.  In this case, the benefits are still very small; to really win, everything must be inlineable by the user (i.e. the entire implementation must be written in the public header).  This is unacceptable in most environments.

If you’d like to try some variations, you can start with this tarball.

The motivation for this post is fast application of tensor product operations like interpolation and differentiation between collocation and quadrature nodes in an hp finite element method. The mesh is not necessarily homogeneous with respect to topology and spectral order. C++ templates are not directly applicable since we work with large arrays of mixed type. It is useful to have the innermost loop completely unrolled, thus essentially templating over the last dimension in the tensor product. If it was practical to sort by element type, it would be possible to hoist the polymorphism out of the loop over elements (use separate loops over elements of each type). However, changing the order of element traversal negatively impacts data locality so the polymorphic function call is actually cheaper.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.