Optimising Compiler Performance: A Case For Devirtualisation | Hudson River Trading

The HRT Beat | Tech Blog

Optimising Compiler Performance: A Case For Devirtualisation
Written
Topic
Published

May 21, 2024

In the realm of high frequency trading, speed is critical. Programmers working on production systems at firms such as HRT must write code that is highly performant without compromising correctness. To achieve that, we should be aware of how compilers work, the ways they can optimise our code, and potential limitations. Let’s dive a bit deeper into one type of optimisation that compilers perform – devirtualisation.

Introduction – virtual functions

In an inheritance hierarchy, derived classes often specialise some behaviour by overriding certain methods of their base class. This allows for polymorphism: a call made through a reference to the same base type (Shape&) can behave differently depending on the derived type of the object to which it refers (CircleSquare, etc). In C++, methods whose behaviour can be overridden in this way must be marked as virtual, which instructs the compiler to emit additional logic needed to resolve the proper implementation at runtime. Specifically, for each type with virtual functions, the compiler generates a virtual table (vtable) containing one function pointer for the appropriate implementation of each virtual method of that type. Each object of that type stores a pointer (v-pointer) to the vtable for its type, and virtual method calls are implemented by emitting code that loads the appropriate function pointer from the vtable and calls it.

While virtual functions are an extremely useful feature, they come at a price. The v-pointers make objects larger (which impacts cache efficiency), the vtables take up additional cache space and introduce potential cache misses, the indirect calls can result in branch mispredictions, and the reduced compile-time knowledge prevents inlining and other optimisations. Fortunately, in many cases compilers can give us both flexibility and speed through devirtualisation — deducing which function is needed at compile time and producing machine code that calls it directly instead of going through the vtable. This post will explore when devirtualisation is possible, what the potential gains are, and how to write code for the compiler to optimise this way.

What virtual functions get compiled to

To understand the performance impact of virtualisation, let’s start by examining what a virtual call gets compiled to. Here’s a simple example, in which we call a regular function:

class Base {
public:
    int foo();
};

int f(Base* b) {
    return b->foo() + 0xc0ffee;
}

To inspect what it compiles to, let’s look at the relevant part of the output of objdump -dC -M intel -r after compiling the code with gcc on x86_64:

0000000000000000 <f(Base*)>:
   0:   48 83 ec 08             sub    rsp,0x8
   4:   e8 00 00 00 00          call   9 <f(Base*)+0x9>
   5: R_X86_64_PLT32            Base::foo()-0x4
   9:   48 83 c4 08             add    rsp,0x8
   d:   05 ee ff c0 00          add    eax,0xc0ffee
  12:   c3                      ret

Now, consider a call to a virtual function:

class Base {
public:
    virtual int foo() = 0;
};

int f(Base* b) {
    return b->foo() + 0xc0ffee;
}

Function f in this case compiles to:

0000000000000000 <f(Base*)>:
   0:   48 83 ec 08             sub    rsp,0x8
   4:   48 8b 07                mov    rax,QWORD PTR [rdi]
   7:   ff 10                   call   QWORD PTR [rax]
   9:   48 83 c4 08             add    rsp,0x8
   d:   05 ee ff c0 00          add    eax,0xc0ffee
  12:   c3                      ret

The register rdi initially holds the b argument, and since Base::foo is a virtual method, a lookup in the vtable is necessary. Instead of directly calling a function at a known address, we need two memory accesses: one to obtain the v-pointer and one to obtain the function address. Each memory load is an opportunity for an expensive cache miss. When the call does occur, it is calling an address that was loaded from memory rather than one written into the machine code, which is harder for the processor to predict. Finally, since the compiler doesn’t know what implementation will be invoked, it can’t perform optimisations that involve the context from which the function is called, such as constant propagation and inlining.

Let’s take a look at a benchmark to see how a lack of this optimisation impacts performance.

The benchmark does a simple O(N3) matrix multiplication multiple times. In one case, the function is in the same compilation unit as main, so it can be inlined and constant-propagated; in the other case, it’s in another compilation unit and compiled separately. All functions are compiled with -O3 and -mavx. The benchmarks were run with CPU pinning on a modern AMD processor — the first one averaged 19.157s per run, while the second one averaged 19.519s per run. This difference may seem small, but decreasing run time by 2% on the critical path is a big win.

Further, the results are statistically significant; performing a t-test on them gives a p-value of 10-60.

Compilers are relatively smart

In most real-life cases, the compiler tries to predict which code paths are most likely to be accessed. In certain situations, it can deduce which function needs to be called at compile time and call it directly instead of going through the vtable. Let’s dig into some cases where it’s possible.

Modern compilers can devirtualise a call when they are able to prove that a particular object has a fixed type, such as when we directly create the object of a particular derived type and then call its virtual methods. Devirtualisation is even possible when we have a pointer to the base class, as long as the compiler is sure where the object was created and what type it has.1 However, this is often not enough in practice, where objects are passed around between different compilation units.

To understand why splitting code between different compilation units matters, let’s look at how compilers optimise code. We’ll look closely at gcc, but the general ideas are shared among other implementations as well. In a traditional build workflow, code is first compiled into gcc’s IR, GIMPLE. Then, the optimiser performs various passes through GIMPLE before producing an object file. Multiple object files then get linked into an executable.

This approach makes a lot of sense from a programmer’s point of view – different parts of code are separated into modules and only modules that change need to be recompiled, so the feedback loop is shorter.

However, from the compiler’s point of view, a lot of possible optimisations are missed as a result.

There are two noteworthy compiler flags that change this behaviour and can mitigate these problems.

The first one is -fwhole-program. This flag is useful when optimising small programs that fit into one compilation unit – it tells the compiler that what it is compiling now is the whole program. As a result, it can assume that if nothing derives from a certain class in the current context, then calls to methods on objects of this class do not have to be dynamically resolved through the vtable.  Since there are no derived classes that could override the functions, it is safe to devirtualise them. However, fitting everything into one compilation unit can be too restrictive for well-factored programs. To help with that, there is another flag that can result in similar optimisations.

This flag is -flto. LTO stands for link time optimisation. This flag allows the compiler to collectively treat multiple source files as a single compilation unit for the purpose of optimisation. It significantly changes gcc’s output, as it causes it to write GIMPLE to the resulting ELF object file instead of standard code sections. During linking, all the object files that were created with this flag can be optimised as a single GIMPLE unit, allowing for more aggressive devirtualisation, inlining, const propagation, and other optimisations. For example, functions from one file can be inlined into another one, which can result in propagation of more constants, devirtualisation of more methods, and so on. A noteworthy feature of LTO is that it allows for optimisation across multiple programming languages supported by gcc, as they all get compiled into GIMPLE first, optimised, and then compiled into assembly.

Worth mentioning here is the final keyword. It is a way C++ allows programmers to hint to the compiler to better optimise the code, achieving similar results to -fwhole-program if done consistently. It is extremely useful in cases where inheritance is mainly used for sharing some base functionality between multiple classes and instances are kept as objects of their final derived types. For example:

class Base {
  public:
    virtual int foo();
    virtual int bar();
};

class Derived1 : public Base {
  public:
    int foo() override;
    int bar() final;
};

int test(Derived1* d1) {
    return d1->foo() + d1->bar();
}

Function test compiles to:

0000000000000000 <test(Derived1*)>:
   0:   55                      push   rbp
   1:   48 89 fd                mov    rbp,rdi
   4:   53                      push   rbx
   5:   48 83 ec 08             sub    rsp,0x8
   9:   48 8b 07                mov    rax,QWORD PTR [rdi]
   c:   ff 10                   call   QWORD PTR [rax]
   e:   48 89 ef                mov    rdi,rbp
  11:   89 c3                   mov    ebx,eax
  13:   e8 00 00 00 00          call   18 <test(Derived1*)+0x18>
  14: R_X86_64_PLT32            Derived1::bar()-0x4
  18:   48 83 c4 08             add    rsp,0x8
  1c:   01 d8                   add    eax,ebx
  1e:   5b                      pop    rbx
  1f:   5d                      pop    rbp
  20:   c3                      ret

We already know from the earlier function call investigation that the first call to foo() is virtual (since d1 could be something that derived from Derived1, not necessarily just Derived1). However, the second call to bar() is not virtual since the function is marked as final, so the compiler knows it’s safe to devirtualise.

Help the compiler be smart

In cases where the compiler doesn’t have the ability to devirtualise on its own, it is possible to use a Curiously Recurring Template Pattern, a C++ idiom important enough to earn a place on cppreference. The basic premise of CRTP is for the derived class to inherit from the base class, templated by the derived class. This means that for every derived class, there is a separate instance of the base class, and instead of calling virtual methods on children, casting this to the derived class and calling normal members is possible. This way, static polymorphism can be implemented for the cases when the only calls to virtual methods are made from regular class methods. A simple example showing CRTP is:

template <typename Derived>
class Base {
  public:
    int foo() {
        return 0xdead0000 + static_cast<Derived*>(this)->bar();
    }
};

class Derived : public Base<Derived> {
  public:
    int bar() {
        return 0xbabe;
    }
};

int test() {
    Derived d;
    return d.foo();
}

Function test compiles (with -O3) into:

0000000000000000 <test()>:
   0:   b8 be ba ad de          mov    eax,0xdeadbabe
   5:   c3                      ret

This shows the compiler was able to optimise out both the call to foo and to bar, and no virtual tables were involved.

Though this method offers performance improvements, it can also significantly limit the way the derived class can be used and make the code more complicated to understand. While sharing similar code across multiple derived classes is possible, there is no real common base class between them. As a result, every functionality that is supposed to be shared between derived classes has to be a template, even if it’s not encapsulated in the base class. 

C++23 offers an interesting alternative to CRTP – deducing this. It allows for moving the Derived template parameter to Self template parameter for methods that actually need to call the derived class. This allows our previous example to be rewritten as something along the lines of:

class Base {
  public:
    template <typename Self>
    int foo(this Self&& self) {
        return 0xdead0000 + self->bar();
    }
};

class Derived : public Base {
  public:
    int bar() {
        return 0xbabe;
    }
};

int test() {
    Derived d;
    return d.foo();
}

An interesting feature of this proposal is that, contrary to CRTP, derived classes have a common base, making it possible to cast up to that class. However, since there is no dynamic dispatch, calling a method that is overridden in the derived class still calls this method from the base class, which is counterintuitive and different from the expected behaviour.

Overall, this proposal offers an alternative to CRTP , and makes resulting code easier to understand — especially in cases of deep inheritance hierarchies, where double-derived classes derive from derived classes.

More musings about the compiler trying to be smart and failing

There is one more optimisation worth mentioning, though it can also have harmful effects. It is called speculative devirtualisation, and it works by checking if the vtable entry is equal to the address of one of the possible functions, the one deemed most likely to be the one that needs to be called. It only works if the compiler is relatively confident which function to check for; if the guess is correct, a direct call is made instead of a virtual one. This is a great solution if the compiler is able to guess correctly, as it allows that branch to be potentially inlined (which, as explored previously, brings enormous benefits) and it might help with the binary layout. However, some heuristics are easy to trigger accidentally, especially in a large codebase shared across many programmers.

As an example, let’s say only one implementation is visible at compile time, for instance because it was placed in a header file rather than in a source file. The compiler likes to assume this is the most probable function,  but it may not be! It may just have been short enough to be reasonably placed in the source file! Or it may have been completely arbitrary! Using this implementation for speculative execution can result in very bad branch prediction, cache misses, and dumb inlining. While this can be fixed by making none of the implementations visible at compile time and using -flto, C++ programmers should still be aware of these consequences.

To illustrate this interesting case, let’s look at a very simple example.

struct Base {
    long long foo() {
        return bar() + 0xf01dab1e00000000LL;
    }
    virtual long long bar() = 0;
};

struct Derived : public Base {
    long long bar() override {
        return 0x5ca1ab1e;
    }
};

long long test(Base* b) {
    return b->foo();
}

Function test compiles to:

0000000000000000 <test(Base*)>:
   0:   48 8b 07                mov    rax,QWORD PTR [rdi]
   3:   48 8d 0d 00 00 00 00    lea    rcx,[rip+0x0]        # a <test(Base*)+0xa>
   6: R_X86_64_PC32             Derived::bar()-0x4
   a:   48 8b 10                mov    rdx,QWORD PTR [rax]
   d:   48 39 ca                cmp    rdx,rcx
  10:   75 0e                   jne    20 <test(Base*)+0x20>
  12:   48 b8 1e ab a1 5c 1e    movabs rax,0xf01dab1e5ca1ab1e
  19:   ab 1d f0
  1c:   c3                      ret
  1d:   0f 1f 00                nop    DWORD PTR [rax]
  20:   48 83 ec 08             sub    rsp,0x8
  24:   ff d2                   call   rdx
  26:   48 83 c4 08             add    rsp,0x8
  2a:   48 ba 00 00 00 00 1e    movabs rdx,0xf01dab1e00000000
  31:   ab 1d f0
  34:   48 01 d0                add    rax,rdx
  37:   c3                      ret

The compiler has no reason to believe that the provided implementation of function bar will be the one most commonly used, but as we can see in the assembly, it compares the address to the vtable (QWORD PTR [rdi]) with something (as we can guess, address of vtable of Derived), and if they are not equal, it does a jump. If they are equal, it simply returns 0xf01dab1e5ca1ab1e. This layout of code shows the expectation here is that they will be equal, even though the compiler has no reason to believe it.

Summary

Modern C++ compilers are smart and try to find the best solutions to make code faster. As C++ programmers, we should be aware of the way compilers were designed and why they behave the way that they do so we can write code that is likely to be optimised. This involves using certain patterns such as CRTP, but also avoiding certain shortfalls, such as making only one definition of a function visible. It also always makes sense to measure, measure, and measure some more the performance of the code, and occasionally look into what assembly was actually generated, to better understand what the compiler is doing and how to optimise performance of the resulting binaries.

  1. It is also possible to confuse the compiler with code where it is obvious what the type of a particular object is to a human reader: example here. ↩︎

Don't Miss a Beat

Follow us here for the latest in engineering, mathematics, and automation at HRT.