<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Dual order hp blog</title>
	<atom:link href="http://dohp.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://dohp.wordpress.com</link>
	<description>Just another WordPress.com weblog</description>
	<lastBuildDate>Wed, 22 Apr 2009 14:08:55 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='dohp.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Dual order hp blog</title>
		<link>http://dohp.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://dohp.wordpress.com/osd.xml" title="Dual order hp blog" />
	<atom:link rel='hub' href='http://dohp.wordpress.com/?pushpress=hub'/>
		<item>
		<title>How does implementation of polymorphism affect performance?</title>
		<link>http://dohp.wordpress.com/2008/10/31/implementation-of-polymorphism/</link>
		<comments>http://dohp.wordpress.com/2008/10/31/implementation-of-polymorphism/#comments</comments>
		<pubDate>Fri, 31 Oct 2008 20:06:14 +0000</pubDate>
		<dc:creator>Jed</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false"></guid>
		<description><![CDATA[There are two common ways to implement single dispatch polymorphism, virtual functions and switch statements.  Object oriented design encourages virtual functions since it is much easier to extend and manage, but people often complain about the overhead of virtual calls.  I&#8217;ve also wondered if the compiler can do any optimizations with C++ virtual functions that [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dohp.wordpress.com&amp;blog=5366703&amp;post=1&amp;subd=dohp&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>There are two common ways to implement single dispatch polymorphism, virtual functions and switch statements.  Object oriented design encourages virtual functions since it is much easier to extend and manage, but people often complain about the overhead of virtual calls.  I&#8217;ve also wondered if the compiler can do any optimizations with C++ virtual functions that are not possible in C with a manually managed v-table.</p>
<p>This post explores the performance of three different implementations, C with v-tables, C++ native virtual functions, and C with switch statements.  It uses the <code>Foo</code> object which is just an integer of state and an operation.  In C++, the interface (I ignore the issue of creation in this post) is</p>
<p><pre class="brush: cpp;">
struct Foo {
  virtual int func(int,int*) const = 0;
  long state;
};
</pre></p>
<p>Ideally, this structure would be presented to the user as an opaque object, thus facilitating a stable ABI.  We&#8217;ll see that this doesn&#8217;t necessarily cause a measurable performance penalty.  Note that the state is chosen to be type <code>long</code> so that alignment is equivalent to the alternative implementation</p>
<p><pre class="brush: cpp;">
/* Forward declaration in interface header */
typedef struct _p_FooStat *FooStat;

/* Implementation header */
struct _p_FooStat {
  FooType type;
  long state;
};
</pre></p>
<p>which will use a switch statement to determine the type.  The version with explicit virtual functions looks like this</p>
<p><pre class="brush: cpp;">
/* Forward declaration in interface header */
typedef struct _p_FooVirt *FooVirt;

/* Implementation header */
struct FooOps {
  int (*func)(FooVirt,int,int*);
};
struct _p_FooVirt {
  struct FooOps *ops;
  size_t state;
};
</pre></p>
<p>The interface declarations are the <code>typedef</code>s above and these functions</p>
<p><pre class="brush: cpp;">
int FooVirtCall(FooVirt,int,int*);
int FooStatCall(FooStat,int,int*);
</pre></p>
<p>In both cases, <code>Foo</code> can change completely without any change to the ABI.  Interface definitions go in the shared library and look like</p>
<p><pre class="brush: cpp;">
int FooVirtCall(FooVirt f,int i,int *r)
{
  int e;
  e = f-&gt;ops-&gt;func(f,i,r);
  if (e) exit(e);
  return 0;
}
</pre></p>
<p>The error handling is for illustration only, obviously the interface function can do an arbitrary amount of setup before the virtual call and cleanup afterwards.  The alternative is</p>
<p><pre class="brush: cpp;">
int FooStatCall(FooStat f,int i,int *r)
{
  int e = 0;
  switch (f-&gt;type) {
    case FOO_ADD: e = FooStat_Add(f,i,r); break;
    case FOO_SUBTRACT: e = FooStat_Subtract(f,i,r); break;
    case FOO_MULT: e = FooStat_Mult(f,i,r); break;
  }
  if (e) exit(e);
  return 0;
}
</pre></p>
<p>For performance testing, we create 50000 objects with type equal to index modulo 3 and iterate through all of the objects 10000 times.  The timing is done using <code>gettimeofday</code>.  The architecture is T9300 2.5Ghz Core 2 Duo running Linux 2.6.27.5 compiled using GCC-4.3.2 with <code>-O3</code>.</p>
<pre><code>
FooVirtCall:     2.1234 sec  235.4714 M calls / sec
FooStatCall:     2.9190 sec  171.2892 M calls / sec
 FooCxxCall:     2.2018 sec  227.0916 M calls / sec
</code></pre>
<p>The first important thing to note is that we are executing over 200 million polymorphic function calls per second.  Optimal memory bandwidth is 10.6 GB/s so if much more than 48 bytes (6 <code>double</code>s) are used per function call (assuming the working set is much larger than cache), aggressively minimizing function call overhead is not going to help much.  For my purposes, the simplest case uses 8 floating point numbers that cannot be in cache already, hence we&#8217;re barely in the realm where optimization of the virtual call matters.</p>
<p>We&#8217;ll start by comparing C and C++ virtual calls and come back to <code>FooStatCall</code> later.  The C++ version is the obvious implementation using an abstruct base class and virtual functions.  The assembly is identical for <code>FooVirtCall</code> and <code>FooCxxCall</code>.</p>
<pre><code>
0x0000000000402020 &lt;FooVirtCall+0&gt;:     sub    $0x8,%rsp
0x0000000000402024 &lt;FooVirtCall+4&gt;:     mov    (%rdi),%rax
0x0000000000402027 &lt;FooVirtCall+7&gt;:     callq  *(%rax)
0x0000000000402029 &lt;FooVirtCall+9&gt;:     test   %eax,%eax
0x000000000040202b &lt;FooVirtCall+11&gt;:    jne    0x402034 &lt;FooVirtCall+20&gt;
0x000000000040202d &lt;FooVirtCall+13&gt;:    xor    %eax,%eax
0x000000000040202f &lt;FooVirtCall+15&gt;:    add    $0x8,%rsp
0x0000000000402033 &lt;FooVirtCall+19&gt;:    retq
0x0000000000402034 &lt;FooVirtCall+20&gt;:    mov    %eax,%edi
0x0000000000402036 &lt;FooVirtCall+22&gt;:    callq  0x4010b8 &lt;exit@plt&gt;
</code></pre>
<p>Now, with C++ virtual functions, the header usually contains the struct definition so adding a virtual function or a data member changes the ABI.  Also, there isn&#8217;t usually any error checking in the interface function.  That is, the compiler essentially inlines the following at the call site.</p>
<p><pre class="brush: cpp;">
int FooVirtCallTail(FooVirt f,int i,int *r)
{
  return f-&gt;ops-&gt;func(f,i,r);
}
</pre></p>
<p>To maintain ABI stability, we keep the definition of <code>struct _p_FooVirt</code> (hence necessarily this interface function) out of the public header.</p>
<pre><code>
FooVirtCallTail:     1.7889 sec  279.4943 M calls / sec
</code></pre>
<p>The assembly is obviously optimal</p>
<pre><code>
0x00007f2ad6c9e130 &lt;FooVirtCallTail+0&gt;: mov    (%rdi),%rax
0x00007f2ad6c9e133 &lt;FooVirtCallTail+3&gt;: mov    (%rax),%r11
0x00007f2ad6c9e136 &lt;FooVirtCallTail+6&gt;: jmpq   *%r11
</code></pre>
<p>and is identical to the C++ analogue.  How much better can we do if we allow inlining of the interface function as is normally done in C++?  Defining the struct and interface function (<code>static inline</code>) in the header, we see</p>
<pre><code>
    FooVirtCallInlineable:     1.8097 sec  276.2880 M calls / sec
FooVirtCallInlineableTail:     1.7789 sec  281.0789 M calls / sec
</code></pre>
<p>(The latter is what you get from C++ by default.)  So the error checking version is about as fast as the tail call, which hasn&#8217;t really benefited from inlining.  The reason for this is that the interface function is trivial and already in cache, so the cost of not inlining it is only one perfectly predicted <code>jmp</code> to a hot location.  When there is error checking, the compiler can obviously take advantage of inlining to produce a faster call.</p>
<p>The conclusion here is that using a C-style polymorphism allow us to keep the structure definition and v-table out of the ABI with essentially zero runtime cost compared to putting it in the header.  If we do error checking in the interface function, we pay about a 20% price compared to the C++ version where the static interface method could be inlined.  In C, if this was an issue, we would put the struct definition and interface function in the public header, thus getting identical assembly to C++.</p>
<p>While there are <a href="http://www.ddj.com/cpp/205918714?pgno=1">ways</a> to obtain ABI stability in C++, it&#8217;s not the default as it is with C.  C also provides convenient introspection and reflection which is not really available in C++.  For instance, you can determine whether a given object implements a virtual function by checking whether the function pointer is <code>NULL</code>.  It is possible to create new aggregate types by allocating memory for the v-table on a per-object basis and then selectively rewriting parts of the v-table.</p>
<p>Back to the early loser, <code>FooStatCall</code>.  There are lots of ways to speed this up.  Timing for some variations, including the original:</p>
<pre><code>
              FooStatCall:     2.9190 sec  171.2892 M calls / sec
          FooStatCallTail:     3.8625 sec  129.4489 M calls / sec
           FooStatCallArr:     1.7841 sec  280.2485 M calls / sec
        FooStatCallInline:     1.6448 sec  303.9879 M calls / sec
    FooStatCallInlineable:     2.4053 sec  207.8724 M calls / sec
FooStatCallInlineableTail:     2.8158 sec  177.5725 M calls / sec
     FooStatCallAllInline:     1.1480 sec  435.5427 M calls / sec
</code></pre>
<p>Strangely, using tail calls is actually slower for this calling sequence.  <code>FooStatCallArr</code> is a tail call with an array of static function pointers.  The assembly is clearly optimal and reveals why it is basically the same speed as virtual calls.</p>
<pre><code>
0x0000000000401e50 &lt;FooStatCallArr+0&gt;:  mov    (%rdi),%eax
0x0000000000401e52 &lt;FooStatCallArr+2&gt;:  mov    0x402460(,%rax,8),%r11
0x0000000000401e5a &lt;FooStatCallArr+10&gt;: jmpq   *%r11
</code></pre>
<p>The interface function <code>FooStatCallInline</code> is not inlined, rather, it inlines the implementation (the operation is written into each of the cases in the <code>switch</code> statement).  Making the interface function inlineable at the call site doesn&#8217;t help.  Of course the fastest solution is to inline absolutely everything.</p>
<p>The conclusion here is that making switch statements faster than virtual calls requires at a minimum that the implementations be in the same compilation unit as the interface function.  In this case, the benefits are still very small; to really win, <em>everything</em> must be inlineable by the user (i.e. the entire implementation must be written in the public header).  This is unacceptable in most environments.</p>
<p>If you&#8217;d like to try some variations, you can start with <a href="http://59A2.org/files/virt.tar.gz">this</a> tarball.</p>
<p>The motivation for this post is fast application of tensor product operations like interpolation and differentiation between collocation and quadrature nodes in an <img src='http://s0.wp.com/latex.php?latex=hp&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='hp' title='hp' class='latex' /> finite element method.  The mesh is not necessarily homogeneous with respect to topology and spectral order.  C++ templates are not directly applicable since we work with large arrays of mixed type.  It is useful to have the innermost loop completely unrolled, thus essentially templating over the last dimension in the tensor product.  If it was practical to sort by element type, it would be possible to hoist the polymorphism out of the loop over elements (use separate loops over elements of each type).  However, changing the order of element traversal negatively impacts data locality so the polymorphic function call is actually cheaper.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dohp.wordpress.com/1/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dohp.wordpress.com/1/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dohp.wordpress.com/1/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dohp.wordpress.com/1/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dohp.wordpress.com/1/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dohp.wordpress.com/1/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dohp.wordpress.com/1/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dohp.wordpress.com/1/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dohp.wordpress.com/1/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dohp.wordpress.com/1/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dohp.wordpress.com/1/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dohp.wordpress.com/1/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dohp.wordpress.com/1/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dohp.wordpress.com/1/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dohp.wordpress.com&amp;blog=5366703&amp;post=1&amp;subd=dohp&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dohp.wordpress.com/2008/10/31/implementation-of-polymorphism/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/8aedf55ba623f3a7c85e9d834b55002b?s=96&#38;d=identicon" medium="image">
			<media:title type="html">dohp</media:title>
		</media:content>
	</item>
	</channel>
</rss>
