This is the second part of my API design post, and part three in my series of posts about building a profiling library. at github (released to the public domain). The code for the final implementation will be pushed to this repository tomorrow.

This post is very code-oriented, but I promise to have more pretty pictures next time! I would have liked to go into more details about the implementation, but my time for coding/writing has been severely limited by my one-month old daughter.

To recap, these are the API design guidelines I think are important to keep in mind during the implementation:

  • Orthogonal. A function should not have any side effects, and there should be only one way to perform an operation in the system.
  • Contained. Following from the rant in my previous post, we should avoid third party dependencies and prefer to use primitive or well-defined data types.
  • Specialized. A function in an API should perform a single task. Don’t create functions that do completely different unrelated tasks depending on the contents of the variables passed to the function.
The first thing that I think about is memory management. Most games tend to have their own memory management and debugging systems, which in my book immediately makes raw new/delete/malloc/free calls invalid in a library. If you need to make allocations, you should provide means in the API to set callbacks (or similar constructs) for memory management to let the library use the same implementation as the game using the library.
In this small library however, I decided to take the easy route and simply require a pre-allocated memory buffer and limit all in-library memory requirements to that buffer. The drawback is of course that the library can only handle a fixed amount of data, but since it’s also continuously writing the data to an output stream the required buffer size is proportional to the rate at which the data is being written. I find that an 8k buffer is plenty of space for this purpose. The internal data structures are laid out as follows:
typedef struct _profile_block_data   profile_block_data;
  typedef struct _profile_block        profile_block;
  struct _profile_block_data
      uint32_t           id;
      uint32_t           parentid;
      uint32_t           processor;
      uint32_t           thread;
      uint64_t           start;
      uint64_t           end;
      char               name[20];
  }; //sizeof( profile_block_data ) == 52
  struct _profile_block
      profile_block_data data;
      uint32_t           previous;
      uint32_t           sibling;
      uint32_t           child;
  }; //sizeof( profile_block ) == 64
The blocks can be arranged in a tree structure using the child, sibling and previous pointers (stored as an index which could be 16 bits, but stored as 32 bits to allow atomic swap instructions). The array of blocks is initialized in the library entry point:
void profile_initialize( char* identifier, void* buffer, profilesize_t size )
      profile_block* block = buffer;
      profilesize_t num_blocks = size / sizeof( profile_block );
      profilesize_t i;
      for( i = 0; i < ( num_blocks - 1 ); ++i, ++block )
          block->child = ( i + 1 );
      block->child = 0;
The library keeps a single block as the root of all currently queued blocks, and the operation of adding a block with no parent becomes a simple task of swapping the child pointer (index) of this root block. To make it thread safe we need to use an atomic compare-and-swap instruction:
static void _profile_put_root_block( profile_block* block )
      uint32_t self = (uint32_t)((uintptr_t)( block - _profile_root ));
          block->sibling = _profile_root.child;
      } while( !_atomic_cas_32( &_profile_root.child, self, block->sibling ) );

Each block can also be a subblock inside another block. To push such a block we implement a thread-local variable storing the current active block, and can perform a swap of pointers without worrying about thread safety:

static void _profile_put_simple_block( profile_block* block )
      //Add to current block, or if no current add to array
      profile_block* masterblock = _thread_data();
      if( masterblock )
          uint32_t self = (uint32_t)((uintptr_t)( block - _profile_root ));
          block->previous = masterblock;
          block->sibling = masterblock->child;
          if( masterblock->child )
              masterblock->child->previous = self;
          masterblock->child = self;
          _profile_put_root_block( block );

Once we want to flush the queued blocks we simply iterate the list of queued blocks in the root block, and process each tree of blocks. Again, we can get away without a lock by doing a atomic compare-and-swap on the single child pointer of the root block:

       block = _profile_root.child;
  } while( !_atomic_cas_ptr( &_profile_root.child, 0, block ) );
A recursive function can then traverse the block tree and rearrange the child/sibling pointers in order to create a single list of blocks arranged through the child pointer. Since only closed blocks are added to the root list, the traversal can be done without worrying about other threads modifying the data:
static profile_block* _process_profile_block( profile_block* block )
      profile_block* leaf = block;
      //[...] process block and write to output stream 
      if( block->child )
          leaf = _process_profile_block( _profile_blocks[ block->child ] );
      if( block->sibling && !block->child )
          block->child = block->sibling;
          block->sibling = 0;
          leaf = _process_profile_block( _profile_blocks + block->child );
      if( block->sibling )
          profile_block* subleaf = _process_profile_block( _profile_blocks + block->sibling );
          subleaf->child = block->child;
          block->child = block->sibling;
      return leaf;
In the end, we get a stream of profiling block data detailing the time spent in selected parts of the code. The implementation goals are maintained. The code is lock-free and has very low overhead, and no third-party dependencies through the use of a single pre-allocated buffer for all storage. Each function is specialized and has no side effects through the use of thread-local storage and atomic swap instructions. 

In the next post I’ll present a simple UI for analyzing the data.