Database programming is fun!

Alex Evans 7:23 pm on March 7, 2011 Counting comments...

In recent years, there’s been a resurgence of interest in the low-levels of databases, most often in the form of key-value stores. My favourite amongst the current crop is the same chap) and suddenly had a brainwave: linux implements ‘copy on write’ semantics for memory blocks when a process forks. That means that my database could just ‘fork’ when it wanted to save a checkpoint. The child process gets an unchanging copy of all of the memory, just for the cost of a page-table copy in the OS; any writes in the parent process just cause the OS to do all the work for us, and duplicate the pages that are changing. The child can checkpoint at its leisure.

all fork()s lead to Redis

The simplicity of this was striking – the entire checkpoint algorithm effectively boiled down to fork(); fopen(); fwrite(); exit(); – but, not being a linux guru, I was worried there must be something wrong that I was missing. A bit of googling led me straight to the then-nascent redis project, which uses the technique itself, or used to: back then it was a single source file of elegant C and hadn’t really been ‘noticed’. I’ve been following it ever since, and I really think its creator @antirez is a bit of a genius…

Anyway, beyond finding a cool project to follow, the redis source gave me hope that the fork() technique was going to work; now I just had to tackle the long loading times (ooh, game programming parallel again!)

AlexDB (cringe) has a text serialisation format, making it very easy to read, debug, and filter its write-log files. Originally, the checkpoint files were written & read in this format. However, the time taken to read, parse and act on a 500,000,000 line file was proving prohibitive – and you paid on every boot. That made the compile-edit-test cycle somewhat awful.

The solution I present here for your amusement, as much as your edification:

malloc(34359738368)

I decided that the server should just dump its state to disk with a single fwrite() type command. That way, I can just be disk-bound. But how to deal with pointers?In true embedded system style, the database program calls malloc() once on boot, grabbing a single 32 gig block of ram. All pointers in the program were replaced (with a bit of C++ template magic and some macros for the C bits) with 32 bit indices, indexing from the single base pointer in steps of 8 bytes. This made the entire memory image completely re-locatable, and had the added advantage that all pointers were now only 4 bytes instead of 8, while still allowing 32 gigs to be addressed. (This was a huge win, in a system where memory savings are EVERYTHING.) The checkpoint code became the desired single fwrite() (in fact, a gzwrite – we ended up compressing, trading a bit of CPU usage to avoid being quite so disk-write-bound) – and the system felt suitably simple, eccentric, fast that I could happily move on to the next problem.

(You may ask why I didn’t use mmap() to just keep the big 32 gig block mapped to a file. The answer is – I wasn’t sure how that would work, or how you might control the rate at which the OS flushed pages to disk. I also wanted to be able to have a ‘history’ of checkpoints, written to a temporary file and atomically renamed when they were known good – so that I could avoid coming across ‘half baked’ checkpoints. but, I’m sure one could come up with a nice system built around something like mmap. Go ahead!)

AlexDB went on sprouting more and more features – as these things do – growing HTTP clients and servers, efficient resource serving, image manipulation, and even an embedded ruby interpreter; and while such a large system inevitably involves its fair share of screaming and hate, overall, I’m very happy that we put the effort in to build & understand such a system. The resulting code-base is still smaller (by LOC) and cleaner than its predecessor, despite implementing everything itself, and can be built very simply with few dependencies.

What was the point of this post again? To share a few small insights gleaned while trying to write an efficient database; to praise the genius of the authors & maintainers of such brilliant bits of software as Redis & postgresql – for whom I have a new, experience-based respect; and to encourage game engine programmers to venture into the muddy waters of online server development – there’s so much FUD around it, the only way to dispell the fog is to dive in and find out for yourself.

hey h8ers, you gonna h8 this. so comment!

If you’re the sort of person who’d rather use someone else’s graphics engine, you’ll probably rather use someone else’s database – and that’s fine! I’m surprised you’ve read this far; thankyou! But, if you get a kick out of understanding your whole stack, top to bottom – in the time honoured old school tradition of game development – I can’t recommend the world of data storage programming enough. After all, online, data-driven and community-creating features are going to be writ large in the future of pretty much every type of game. And understanding how your MySQL database works from top to bottom, or writing your own, is no bad thing.

Automattic/*<![CDATA[*/// var disqus_shortname = 'altdevblogaday'; var disqus_domain = 'disqus.com'; (function () { var nodes = document.getElementsByTagName('span'); for (var i = 0, url; i < nodes.length; i++) { if (nodes[i].className.indexOf('dsq-postid') != -1) { nodes[i].parentNode.setAttribute('data-disqus-identifier', nodes[i].getAttribute('rel')); url = nodes[i].parentNode.href.split('#', 1); if (url.length == 1) url = url[0]; else url = url[1] nodes[i].parentNode.href = url + '#disqus_thread'; } } var s = document.createElement('script'); s.async = true; s.type = 'text/javascript'; s.src = 'http://' + disqus_domain + '/forums/' + disqus_shortname + '/count.js'; (document.getElementsByTagName('HEAD')[0] || document.getElementsByTagName('BODY')[0]).appendChild(s); }()); ///*]]>*/

// PrintFriendly var e = document.createElement('script'); e.type="text/javascript"; e.async = true; e.src = '//cdn.printfriendly.com/printfriendly.js'; document.getElementsByTagName('head')[0].appendChild(e); st_go({v:'ext',j:'1',blog:'24522286',post:'1574'}); var load_cmc = function(){linktracker_init(24522286,1574,2);}; if ( typeof addLoadEvent != 'undefined' ) addLoadEvent(load_cmc); else load_cmc(); /* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE * * */ var disqus_shortname = 'altdevblogaday'; // required: replace example with your forum shortname /* * * DON'T EDIT BELOW THIS LINE * * */ (function () { var s = document.createElement('script'); s.async = true; s.type = 'text/javascript'; s.src = 'http://' + disqus_shortname + '.disqus.com/count.js'; (document.getElementsByTagName('HEAD')[0] || document.getElementsByTagName('BODY')[0]).appendChild(s); }()); (function (){var a=document.createElement("http://www.apture.com/js/apture.js?siteToken=FNSUbCQ";document.getElementsByTagName("head")[0].appendChild(a);})();