A colleague of mine has a server performance problem. It’s a regularly scheduled task that has to work through tons of raw data in a database. The task is reasonably fast except in rare cases. Naturally, over time the volume of data will only increase; so what can we do to make it more reliable and predictable?
Okay, first off, I cheated a bit. This problem hasn’t just happened to one colleague, it’s happened to virtually all of them at one point or another. Slow tasks occur naturally due to the inevitable deterioration of software systems over time – more data accumulates, other features get added, the operating system gets patched, but the task still runs on the same old hardware.
When you first write the task it runs great: Compile, build, tune, and it’s done. Then you forget about it for a few months and it gets slower and slower every time it runs. It’s like the frog in the saucepan – because the change is so gradual, you don’t think about it until it reaches the point where it causes pain to everyone.
If you have a chance to design these tasks fresh from the start, there are lots of great ideas about how to build them so they can scale. But let’s spend today’s article talking about what you can do to diagnose and fix a program – one that isn’t yours – without radical surgery or rewriting it from the ground up.
Isolate your Environment
Before beginning a performance tuning project, you need to isolate your program. It’s generally not a good idea to do performance tuning directly on a live server, but most development environments need to be carefully configured in order to provide useful performance tuning work. This may actually be the toughest step in your work: the programs that get ignored (and gradually accumulate performance problems) tend to be the ones with lots of hidden dependencies.
So, let’s take a few first steps:
- Set up an isolated development environment. If you can run the entire program on a single desktop or laptop, great; if not, let’s restore to a new machine or VM. If the program requires multiple machines, go get your disaster recovery plans and use them to restore a working environment. If the plans don’t work, this is a great opportunity to fix them and get them right!
- Use virtualization liberally. Hopefully you have a big VMWare or Xen cluster in your office; if not, just pay Amazon or Rackspace to host an instance for a few days. Most importantly, write scripts to set up these servers. Write a script to do all the manual fixups that the environment requires (i.e. installing programs, changing configuration settings, and copying files). Eventually, the goal is to hit a button and have a test / development environment magically appear.
- Once the environment is restored, get your unit tests running. Verify that all your unit tests work. You’d be surprised how often these unit tests fail! Keep at it until your isolated environment passes all the tests. If you don’t have tests, well … now would be a great time to write a few.
- Refactor the program and break out the task you want to improve. I virtually guarantee you that your task is wrapped up in lots of layers of unnecessary code. Try to redo the task so it can be run all by itself, without triggering any other systems. Ideally, it could be a command line program run with a small selection of options.
With this in place, you have everything you need to start breaking down the problem.
Monitoring Is People
Next, you need to know how the task is working in order to know what to improve. You need some statistics; meaningful ones, something better than CPU/Memory/Disk utilization. Here are some useful statistics you might want to gather:
- Duration – How long does the task take to run? How long did it take back when everyone used to be happy with it?
- Errors & Warnings – How many errors, if any, does the task encounter when it runs? Do you have a full log output from each time the task runs?
- How often is it supposed to run? Is it supposed to work daily but they only run it weekly? Do they wish it was continuous?
- Size of the task queue. How many items are waiting to be processed?
- Average wait time. How long does an average item take to process?
Monitoring can go great with a command line task. I like to build long running server tasks as command line executables which record their progress in the database when they launch and when they finish. I can then use a clever performance monitoring package to check how long the task took to run. I take the task’s console output logs and write them to disk and cycle them out after a reasonable period of time – say 90 days.
Fixing Performance
So now that you’ve got a working environment, try drawing up a flowchart of the steps the application goes through. Break it down to meaningful levels and explain it to the guy or girl sitting next to you, even if that person doesn’t know anything about the program (in fact, if you have to explain it to somebody new, it’s often better – the challenge forces you to clarify your thoughts).
With this flowchart in hand, let’s start trying to figure out what kind of a performance problem we have.
- Sequential tasks that could be run in parallel – Check the dependencies in each stage; can you fire off three steps at once and then just wait for them all to finish?
- Data that can be pre-computed and cached – Is there any work that can be moved outside of the task? For example, let’s imagine my program builds a taxi price chart for London and updates it each week. It doesn’t make sense to have my program query Google Maps for distance information each week; instead, I should gather that data once, cache it, and only update it on demand. Perhaps you should set an “age” limit on each cached data element and have a secondary task go through and parse the old data periodically?
- Work that can be divided – Sooner or later, in every big complex task, you find the core: the gigantic loop that hits every data element in your project. When you finally find this part you’ll know you’re getting close to the end. This is the component that you want to “Map/Reduce”: you have tons of data and a small element of code that must be run on each item of data.
What happens if you can’t map/reduce your data? Don’t worry: even if you’re not writing for a Hadoop environment you can still split up the work. Consider writing an inherently parallel version of your program that runs multiple items at once. What prevents you from running two threads at once? Ten? Twenty? What prevents you from having dozens of servers doing this work at the same time?
The answer is likely to be well understood by anyone who does parallel work: synchronization and bottlenecks. You synchronize your tasks by making sure each server doesn’t interfere with the next one, and you identify the bottlenecks by noticing where the pain occurs. Let’s start with some simple pseudocode that could be used as a wrapper around your project:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | function MyGiganticLoop() { // Retrieve all the cached information I need in order to work Setup(); while (true) { // Contact a centralized server to obtain one work item var WorkItem = RequestWorkFrom(DISPATCHER_HOST_NAME); if (WorkItem != null) { // This is my "high intensity" function that takes a lot of processing time var WorkResults = DoMyTask(WorkItem); // Update the work item with my results MarkWorkItemComplete(WorkItem, WorkResults, DISPATCHER_HOST_NAME); } else { // The queue must be empty - I'll sleep for a resonable time and check again Sleep(1000); } } } |
This code is useful because it uses a centralized dispatcher server to ensure that each task client continues to work as fast as possible. Ideally, the centralized server keeps a list of all the work items that need to be executed. Each time a client contacts it, the dispatcher marks that item as “in progress” and hands it to a client. If the client crashes or fails to complete a work item in time, the dispatcher can return it into the queue and allow another client to try.
More importantly, in this pattern, you have the option to scale pretty much linearly (provided your central dispatcher isn’t a performance hog itself). You can simply monitor the work queue and spin up new instances when it falls behind, and shut them down when you catch up.
But what if this doesn’t resolve the issue? You may find, when you get down to this level, that the function DoMyTask(WorkItem) isn’t actually where all the time is being spent. Perhaps there’s a database call that is the culprit. But now you have isolated your environment enough to be sure you’re on the right track.
Oh, and when you’re done – Finish up your work with a code review of the change. Walk through every line of code you had to modify and explain it. It will take a long time, but it will be worthwhile. Happy tuning!