My team at work has been working on porting our core technology stack to a variety of platforms over the last long while. The total supported platform count for us currently is fifteen, with many more coming soon.

The biggest benefit to come of all of this, however, is an increase in the quality of our code. In addition to dealing with the pecularities of over fifteen compilers, especially around how some compilers deal with aliasing, comes an great exposure to extra tools. Supporting Xbox 360 gives you access to Microsoft’s static code analyzer, and Mac OS and Linux both provide access to clang’s static analyzer and valgrind.

Valgrind is a collection of tools based around a VM, not entirely unlike the JVM or the .NET CLR, but where the opcodes are the native instructions for your platform. The Valgrind runtime breaks down the instruction stream into its own SSA format which its plugins can then operate on. One plugin, memcheck, has been especially informative as it keeps track of memory that is initialized (both stack and heap), memory that isn’t, reads/writes beyond allocation boundaries, and ensures that calls to certain standard library functions (like strcat, strncat, strcpy, strncpy, memcpy) are conformant in regards to overlapping memory regions and necessary destination sizes.

Why is this handy? Consider this code:

  int AddThree(const int *a)
      return *a + 3;
  int Foo()
      int a;
      return AddThree(&a);

Most compilers will run right by this code without seeing any problems. Down the road when the result of Foo() is used, valgrind will point out that the memory was uninitialized.

  ==22064== Syscall param write(buf) points to uninitialised byte(s)
  ==22064==    at 0x2C11BA: write$NOCANCEL (in /usr/lib/system/libsystem_kernel.dylib)
  ==22064==    by 0x17B59D: __sflush (in /usr/lib/system/libsystem_c.dylib)
  ==22064==    by 0x1A6F6C: __sfvwrite (in /usr/lib/system/libsystem_c.dylib)
  ==22064==    by 0x175990: __vfprintf (in /usr/lib/system/libsystem_c.dylib)
  ==22064==    by 0x17118D: vfprintf_l (in /usr/lib/system/libsystem_c.dylib)
  ==22064==    by 0x17A2CF: printf (in /usr/lib/system/libsystem_c.dylib)
  ==22064==    by 0x100000E26: main (test.cpp:26)
  ==22064==  Uninitialised value was created by a stack allocation
  ==22064==    at 0x100000D70: Foo() (test.cpp:10)

Unfortunately it’s much more memory and CPU intensive, with memcheck your program will run 20-30x slower than normal, other plugins are even more demanding. It also doesn’t get the same info if you use replacement standard library functions, like the aforementioned strcpy, strncpy, strcat, strncat, and memcpy. Although valgrind provides macros for indicating to the runtime the state of memory blocks for your own custom allocators they don’t seem to work quite as well as the ones they provide for the system malloc/free functions.

In addition to memcheck, valgrind comes with a couple other plugins. Cachegrind profiles how your program interacts with the processor caches and tracks branch (mis)prediction. Here’s some sample output from a naive matrix transposition function, including all the stats on a translation unit and function level. The ‘I’ stats are for the instruction cache, the D stats for data cache. The ‘r’, ‘w’, ‘mr’, ‘mw’ are counts of read hits, write hits, read misses, and write misses. The data cache data is split between level 1 and last level which could be L2 or L3 depending on your processor’s architecture.

         Ir  I1mr  ILmr        Dr   D1mr   DLmr        Dw      D1mw      DLmw
  8,989,376 1,739 1,491 1,588,449 70,122 67,907 1,210,659 1,049,203 1,049,073  PROGRAM TOTALS
         Ir I1mr ILmr        Dr   D1mr   DLmr        Dw      D1mw      DLmw  file:function
  6,298,642    2    2 1,048,579 65,537 65,537 1,048,580 1,048,576 1,048,576  /Users/max/src/tmp//test.cpp:main
  -- Auto-annotated source: /Users/max/src/tmp//test.cpp
         Ir I1mr ILmr        Dr      D1mr      DLmr        Dw   D1mw   DLmw 
          .    .    .         .         .         .         .      .      .  void transpose(float * __restrict out, const float * __restrict in)
          .    .    .         .         .         .         .      .      .  {
      1,027    0    0         0         0         0         0      0      0      for (int y = 0; y < 1024; ++y)
          .    .    .         .         .         .         .      .      .      {
  1,054,721    0    0         0         0         0         0      0      0          for (int x = 0; x < 1024; ++x)
          .    .    .         .         .         .         .      .      .          {
  5,242,880    0    0 1,048,576 1,048,576 1,048,576 1,048,576 65,536 65,536              out[x * 1024 + y] = in[y * 1024 + x];
          .    .    .         .         .         .         .      .      .          }
          .    .    .         .         .         .         .      .      .      }
          .    .    .         .         .         .         .      .      .  }

As well there is callgrind which is similar to cachegrind but generates call graphs as well, helgrind and DRD which help detect errors in multithreaded code such as data races and incorrect use of threading primitives, massif and DHAT which profiles your heap (and stack) usage. There is an experimental tool called SGCheck which aims to detect global and stack array overruns.

Even though I’ve barely scratched the surface of what valgrind is capable of, I’ve found it to be immensely useful when tracking down memory related bugs. Because it’s free and really easy to use — no special libraries required, just run ‘valgrind foo’ — it’s easy to promote the use of to others on your team, and also easy to hook into your automated tests if you have any.

Besides, one can never have too many tools at their disposal!