I’m continuing the series of actual problem I/O patterns that I’ve seen, heard about, or debugged in games or game-related situations. The title refers to the fact that these are all real-world situations, they often cause unusual or unexpected behavior, and sometimes they’re just real unusual!
Remember my last post? If you missed it, it might be worth a read, because we’re building directly upon it for this next one.
The Problem
We begin with a similar situation to the last one: a hard disk with two partitions, C and D.
A game has added DLC, and as a result during certain parts of the game they’re now reading from two partitions on the hard drive (DLC on one, cached data from optical disc on the other). Performance is unexpectedly bad, though — they are getting just 2MiB/s from the hard drive.
Looks like the same thing — but it’s not
Of course, fresh from the last post, you’d say it must be the same thing as last time, right? Their I/O must be bouncing back and forth between the partitions.
So you fire off a quick email describing the issue and suggest ways for them to easily issue larger reads.
Unfortunately, it comes right back. Turns out they’ve done all the right things with their I/O. Since they are old hands at making stuff load fast from optical disc, they planned their I/O carefully and it’s entirely made up of large reads – each one is about 1MiB.
Looking through some debug traces, you find out that indeed, 1MiB reads are going down to the filesystem, but only about 2 of them manage to complete per second.
235.832: read 1MiB from C 235.850: read 1MiB from D 236.603: read 1MiB from C 236.728: read 1MiB from D
This doesn’t look like the same problem after all.
…Or is it?
But wait! Let’s dig a bit deeper.
Specifically, let’s go back to the debug traces – always an invaluable tool when figuring out what’s going on in a real-time system. Let’s turn on the display of thread IDs and log both the start and the end of the requests.
235.832: thread 0xCA782080: read 1MiB from C started 235.840: thread 0xCB9F0100: read 1MiB from D complete 235.850: thread 0xCB9F0100: read 1MiB from D started 236.601: thread 0xCA782080: read 1MiB from C complete 236.603: thread 0xCA782080: read 1MiB from C started 236.610: thread 0xCB9F0100: read 1MiB from D complete 236.628: thread 0xCB9F0100: read 1MiB from D started
So, first of all, we can see that game I/O is handled by multiple threads. (In my original case I knew this already, but it’s an important clue.) And these threads seem to be actually executing the I/O simultaneously to the two partitions.
Let’s colorize to make it a bit clearer. I’ll make a manual annotation to call out the big gap in time, and then make each thread a different color and fade the irrelevant detail to gray:
235.832: thread 0xCA782080: read 1MiB from C started
235.840: thread 0xCB9F0100: read 1MiB from D complete
235.850: thread 0xCB9F0100: read 1MiB from D started
- – - gap of 750ms – - -
236.601: thread 0xCA782080: read 1MiB from C complete
236.603: thread 0xCA782080: read 1MiB from C started
236.610: thread 0xCB9F0100: read 1MiB from D complete
236.628: thread 0xCB9F0100: read 1MiB from D started
I’ve focused in on a single pair of reads, one from C and one from D. They start within 20ms of each other. And they both take over 750ms to complete!
What does your mental model of the situation look like? It’s natural to imagine that the filesystem will service the first read, and then it will service the second read. Like this:
It’s natural… and it’s also wrong.
Here’s what actually happens
The reason this mental model isn’t correct is because the filesystem is not the only place that reads could get broken up.
In this case we need to remember that there are layers beneath the filesystem too. In this case, there was at least a software encryption layer and a SATA driver layer. One of the layers — my guess would be encryption, but I never did bother to find out for sure — actually took those 1MiB chunks and broke them up into smaller pieces. 128KiB pieces, to be precise.
The interesting thing about this is that this layer which was breaking up the reads didn’t do anything to force these chunks to stay together. So when two 1MiB reads came in on different threads, instead of two 1MiB reads, we effectively got sixteen 128KiB reads!
And as luck would have it, running the I/O simultaneously from two threads like this caused the reads to interleave in exactly the wrong way. Instead of moving the read head just once for these two reads, it would bounce all the way back and forth across the disk eight times:
And hey, look, we’re now back at the same situation as Part 1, after all:
Brutal!
How to fix it
Ideally, if you controlled the entire system, it would probably be best to fix the layer which is doing this chunking to service an entire read request at once before switching to any other read requests on the same physical device.
(In other words, make the chunking preserve atomicity.)
But these days, where you have an actual OS running on the console, chances are you don’t have that level of direct access to the hardware. What then?
Well, the problem is still the same — a race condition caused by simultaneity — and the fix is the same too: use a mutex or semaphore. You just have to move it up higher.
In our case we were able to create a HDD-access mutex of our own, and then we acquired it in the I/O worker threads when we issued a read to either partition. The I/O still bounced between partitions, but it did so after a full 1MiB of data transfer, which means it did it 1/8 as much — exactly as originally planned, and perfectly acceptable.
Problem solved.
Next time
I’m still thinking about a good topic for next time. I thought I might talk a bit about discussed that a bit so it might not be that interesting.
I’ve been on hard drives for two posts; it might be time to give optical drives or flash media a kick in the nads too. I’ve certainly seen my share of craziness from both. Any thoughts? Any questions you’d like answered?