Links to other parts:

In this part I will cover three tests. The first test examines how long it takes to do a set number of allocations wth a given size. The second test examines how long it takes to exhaust thirty-two megabytes of memory with four kilobyte allocations a set number of times. The third test examines how long it takes to allocate a set number of times in a worst case scenario. Each of the tests takes a file pointer as a parameter, so that it can write the results of the test out in a comma seperated value (CSV) format.

Due to my old laptop giving up the ghost I’m having to change the specs for the system I am using. I’ve done the best with what I have to keep the specs as close to the original ones as possible. The system now used for development and testing is an old desktop running Windows XP Pro Service Pack 3, with a Pentium-4 3.0GHz 32-bit processor, 800MHz FSB, and 512MB of RAM.

First, here are a few important details.

static const double MAXALLOCATIONS = 10000;
 
  static const u32 STEPMULTIPLIER = 10;
 
  static const u32 HEADERSIZE = 8; // sizeof(TALLOCATION_HEADER)
 
  static const u32 NUMREPORTS = (u32)ceil(log(MAXALLOCATIONS)/log((d64)STEPMULTIPLIER)) + 1;

The MAXALLOCATIONS is the maximum set number of allocations to run for each test. The STEPMULTIPLIER is the multiplicand to use when stepping up the number of allocations to run for each subsequent iteration (i.e. going from 100 allocations to 1000 allocatons by multiplying 100 * STEPMULTIPLIER). I’ve pre-calculated HEADERSIZE as a convenience, so it doesn’t need to keep being calculated. Also since TALLOCATION_HEADER is declared in a separate .cpp file, with no declaration in the corresponding header, I’ve gone ahead and hard coded the value here. Last the NUMREPORTS represents the number of iterations that will be performed and reported for each test.

The structure for recording the times and reporting them later is a simple collection of values, with the size used for the allocations, and the hours, minutes and seconds for both allocating and freeing the memory. The seconds are stored as a double to keep the milliseconds.

struct Record
 
  {
 
  	s32 allocationSize;
 
  	s32 allocHours;
 
  	s32 allocMinutes;
 
  	d64 allocSeconds;
 
  	s32 freeHours;
 
  	s32 freeMinutes;
 
  	d64 freeSeconds;
 
  };

Two other structures are used to associate the number of allocations, or repetitions of allocating / freeing the memory, performed at the given allocation size. The first structure is used with the first test and stores an array of records, so that differing allocation sizes (1Byte to 32MB, by multiples of 2) can be associated with the number of times the allocation is performed. The second structure is used with the second and third tests, associating a single record with the number of times the test is performed.

struct AllocReportSizes
 
  {
 
  	static const s32 NUMALLOCATIONSIZES = 25;
 
   
 
  	s32 numAllocations;
 
  	Record records[NUMALLOCATIONSIZES];
 
  };
 
   
 
  typedef struct AllocReportExhaustion
 
  {
 
  	s32 numRepetitions;
 
  	Record record;
 
  } AllocWorstCase;

1. How long does it take to allocate X times at size Y?

First thing to do is set up an array of reporting structures, one for each number of allocations to be tested, along with an entry for each size to be reported (1Byte to 32MB). The number of allocations increase by a multiple of STEPMULTIPLIER for each successive reporting structure.

// Allocate the given size, then free it, a specified number of times
 
  bool TestSizes(FILE* pOutputFile)
 
  {
 
  	if (!pOutputFile)
 
  		return false;
 
   
 
  	AllocReportSizes* pReports = new AllocReportSizes[NUMREPORTS];
 
   
 
  	for (u32 i = 0, numAllocs = 1; i < NUMREPORTS; ++i, numAllocs *= STEPMULTIPLIER)
 
  	{
 
  		pReports[i].numAllocations = numAllocs;
 
  		pReports[i].records[0].allocationSize = MEM_1B;
 
  		pReports[i].records[1].allocationSize = MEM_4B;
 
  		pReports[i].records[2].allocationSize = MEM_8B;
 
  		pReports[i].records[3].allocationSize = MEM_16B;
 
  		pReports[i].records[4].allocationSize = MEM_32B;
 
  		pReports[i].records[5].allocationSize = MEM_64B;
 
  		pReports[i].records[6].allocationSize = MEM_128B;
 
  		pReports[i].records[7].allocationSize = MEM_256B;
 
  		pReports[i].records[8].allocationSize = MEM_512B;
 
  		pReports[i].records[9].allocationSize = MEM_1KB;
 
  		pReports[i].records[10].allocationSize = MEM_2KB; // Header space in a block doesn't become an issue until 4KB.
 
  		pReports[i].records[11].allocationSize = MEM_4KB - HEADERSIZE;
 
  		pReports[i].records[12].allocationSize = MEM_8KB - HEADERSIZE;
 
  		pReports[i].records[13].allocationSize = MEM_16KB - HEADERSIZE;
 
  		pReports[i].records[14].allocationSize = MEM_32KB - HEADERSIZE;
 
  		pReports[i].records[15].allocationSize = MEM_64KB - HEADERSIZE;
 
  		pReports[i].records[16].allocationSize = MEM_128KB - HEADERSIZE;
 
  		pReports[i].records[17].allocationSize = MEM_256KB - HEADERSIZE;
 
  		pReports[i].records[18].allocationSize = MEM_512KB - HEADERSIZE;
 
  		pReports[i].records[19].allocationSize = MEM_1MB - HEADERSIZE;
 
  		pReports[i].records[20].allocationSize = MEM_2MB - HEADERSIZE;
 
  		pReports[i].records[21].allocationSize = MEM_4MB - HEADERSIZE;
 
  		pReports[i].records[22].allocationSize = MEM_8MB - HEADERSIZE;
 
  		pReports[i].records[23].allocationSize = MEM_16MB - HEADERSIZE;
 
  		pReports[i].records[24].allocationSize = MEM_32MB - HEADERSIZE;
 
  	}
 
   
 
  	// Write the headers.
 
  	fprintf(pOutputFile, "Action,Num Allocations,MEM_1B,MEM_4B,MEM_8B,MEM_16B,MEM_32B,MEM_64B,MEM_128B,MEM_256B,MEM_512B,MEM_1KB,MEM_2KB,MEM_4KB,MEM_8KB,MEM_16KB,MEM_32KB,MEM_64KB,MEM_128KB,MEM_256KB,MEM_512KB,MEM_1MB,MEM_2MB,MEM_4MB,MEM_8MB,MEM_16MB,MEM_32MB\n");

Next iterate through each of the reporting structures to determine how long it takes to allocate the specified number of allocations with each of the given sizes. The allocation is performed, then the memory is released, so the next allocation doesn’t operate under different conditions. The time tracked is only the time spent in the allocation and free functions, and those two are tracked seperately.

	for (u32 i = 0; i < NUMREPORTS; ++i)
 
  	{
 
  		s32 numAllocations = pReports[i].numAllocations;
 
  		fprintf(pOutputFile, "Allocate,%i", numAllocations);
 
  		for (s32 j = 0; j < AllocReportSizes::NUMALLOCATIONSIZES; ++j)
 
  		{
 
  			s32 allocationSize = pReports[i].records[j].allocationSize;
 
   
 
  			CHighPerfTimer timerAlloc;
 
  			CHighPerfTimer timerFree;
 
   
 
  			void* pMem;
 
  			for (s32 k = 0; k < numAllocations; ++k)
 
  			{
 
  				timerAlloc.Resume();
 
  				pMem = malloc(allocationSize);
 
  				timerAlloc.Pause();
 
  				timerFree.Resume();
 
  				free(pMem);
 
  				timerFree.Pause();
 
  			}
 
   
 
  			timerAlloc.GetTime(pReports[i].records[j].allocHours, pReports[i].records[j].allocMinutes, pReports[i].records[j].allocSeconds);
 
  			timerFree.GetTime(pReports[i].records[j].freeHours, pReports[i].records[j].freeMinutes, pReports[i].records[j].freeSeconds);
 
   
 
  			fprintf(pOutputFile, ",%02i:%02i:%02.4f", pReports[i].records[j].allocHours, pReports[i].records[j].allocMinutes, pReports[i].records[j].allocSeconds);
 
  		}
 
  		fprintf(pOutputFile, "\n");
 
  	}
 
   
 
  	for (u32 i = 0; i < NUMREPORTS; ++i)
 
  	{
 
  		s32 numAllocations = pReports[i].numAllocations;
 
  		fprintf(pOutputFile, "Free,%i", numAllocations);
 
  		for (s32 j = 0; j < AllocReportSizes::NUMALLOCATIONSIZES; ++j)
 
  		{
 
  			fprintf(pOutputFile, ",%02i:%02i:%02.4f", pReports[i].records[j].freeHours, pReports[i].records[j].freeMinutes, pReports[i].records[j].freeSeconds);
 
  		}
 
  		fprintf(pOutputFile, "\n");
 
  	}
 
  	return true;
 
  }

Allocating a bunch of objects of the same size can add up in time spent allocating, and if lots of them are done at the same time, it can be a performance issue. The problem becomes more prevalent with larger allocations when the system has to search for larger blocks of contiguous memory.

When the test is run using the heap system here, allocation times range from less than a millisecond, for 1 Byte being allocated 1 time, to 27.3 milliseconds, for 32 MB being allocated 10000 times. However, when the test is run using the standard memory allocator, allocation times range from less than a millisecond, for 1 Byte being allocated 1 time, to 9 minutes 17 seconds 442.7 milliseconds, for 32 MB being allocated 10000 times. Now that’s a pretty sensational performance improvement, but something a bit more realistic in real world use might be 64KB allocated 1000 times, and with that the improvement still shows through with 0.7 milliseconds and 66.3 milliseconds respectively.

Windows XP cannot be limited to allocating the requested memory from a specific range of addresses, short of implementing a memory allocation system, which lends to increased allocation times while searching for available memory with the standard allocation system, and that is partly what is being demonstrated here.

2. How long does it take to allocate all the memory X times at 4KB per allocation?

Similar to the previous test, an array of reporting structures is setup, but this time it is for each number of repetitions to exhaust the memory. Only the smallest size possible from the system is tested, since larger allocations decrease the amount allocations and time.

// Allocate one block at a time, until memory is full, then free the memory, a specified number of times
 
  bool TestExhaustion(FILE* pOutputFile)
 
  {
 
  	if (!pOutputFile)
 
  		return false;
 
   
 
  	u32 blockSize = _KB(4);
 
  	AllocReportExhaustion* pReports = new AllocReportExhaustion[NUMREPORTS];
 
   
 
  	for (u32 i = 0, numRepetitions = 1; i < NUMREPORTS; ++i, numRepetitions *= STEPMULTIPLIER)
 
  	{
 
  		pReports[i].numRepetitions = numRepetitions;
 
  		pReports[i].record.allocationSize = blockSize - HEADERSIZE; // blockSize - sizeof(TALLOCATION_HEADER)
 
  	}
 
   
 
  	// Write the headers.
 
  	fprintf(pOutputFile, "Action,Num Repetitions,Time to exhaust 32MB using 4KB per allocation\n");

Then the reporting structures are iterated over to determine how long it takes to allocate all the memory, at 4KB per allocation, the specified number of repetitions. Each time the memory is exhausted it is released and the next repetition begins.

	u32 uTotalMem = _MB(32);
 
  	for (u32 i = 0; i < NUMREPORTS; ++i)
 
  	{
 
  		s32 numRepetitions = pReports[i].numRepetitions;
 
  		s32 allocationSize = pReports[i].record.allocationSize;
 
   
 
  		CHighPerfTimer timerAlloc;
 
  		CHighPerfTimer timerFree;
 
   
 
  		void** ppMem = new void*[uTotalMem / blockSize];
 
  		for (s32 j = 0; j < numRepetitions; ++j)
 
  		{
 
  			for (u32 k = 0; k < (uTotalMem / blockSize); ++k)
 
  			{
 
  				timerAlloc.Resume();
 
  				ppMem[k] = malloc(allocationSize);
 
  				timerAlloc.Pause();
 
  			}
 
  			for (u32 k = 0; k < (uTotalMem / blockSize); ++k)
 
  			{
 
  				timerFree.Resume();
 
  				free(ppMem[k]);
 
  				timerFree.Pause();
 
  			}
 
  		}
 
  		timerAlloc.GetTime(pReports[i].record.allocHours, pReports[i].record.allocMinutes, pReports[i].record.allocSeconds);
 
  		timerFree.GetTime(pReports[i].record.freeHours, pReports[i].record.freeMinutes, pReports[i].record.freeSeconds);
 
   
 
  		fprintf(pOutputFile, "Allocate,%i,%02i:%02i:%02.4f\n", numRepetitions, pReports[i].record.allocHours, pReports[i].record.allocMinutes, pReports[i].record.allocSeconds);
 
  		fprintf(pOutputFile, "Free, %i,%02i:%02i:%02.4f\n", numRepetitions, pReports[i].record.freeHours, pReports[i].record.freeMinutes, pReports[i].record.freeSeconds);
 
  	}
 
   
 
  	return true;
 
  }

Allocating all the memory at 4KB per allocation demonstrates some good and bad allocation conditions, though not the absolute worst (that’s saved for the last test), .When no memory is previously allocated, available memory is discovered quickly, and because each subsequent allocation is contiguous whole blocks eventually end up getting examined with a single check, reducing the time to discover available memory. Performing lots of small allocations means that more searches are performed than if larger allocations were made, because larger allocations would exhaust the memory quicker, and more searches adds up to more time spent searching for available memory.

Exhausting 32 MB of memory at 4KB per allocation with the heap system here ranges in time from 10.1 milliseconds, to do it 1 time, to 1 minute 39 seconds 995 milliseconds, to do it 10000 times. When using the standard memory allocation system, the times range from 101.5 milliseconds, to exhaust 32MB of memory 1 time, to 15 minutes 52 seconds 593.3 milliseconds, to do it 10000 times. Even more interesting is the increase in performance of the Free method. To free the same memory that was allocated, the heap system here ranges from 6.1 milliseconds to 1 minute 1 second 891.4 milliseconds, while the standard memory allocation system clocks in a range from 613.6 milliseconds to 1 hour 43 minutes 35 seconds 161.5 milliseconds. At first I didn’t believe it took more than an hour and a half to free the memory, so I re-ran the test and came up with a similar time (1:41:58.453.5). While the idea of exhausting the memory 10000 times in a row may not be a real world problem, it still highlights the time gains that can be made over many allocations and de-allocations with a specialized system.

Allocating all the memory in Windows XP isn’t the same as allocating all the memory in the system described herein, since Windows XP will begin page-swapping once the physical memory has been exhausted. However, the time it takes Windows XP to allocate 32MB worth of memory at 4KB per allocation versus the time it takes the system described herein to do the same can still be compared for time to complete.

3. How long does it take to allocate 8KB of memory X times in the worst case?

As previously done, an array of reporting structures is created, but this one is used to report the time it takes to allocate an 8KB request in a worst case scenario.

// With every other block available, until the end of the memory where two blocks are available, allocate two blocks of memory, a specified number of times
 
  bool TestWorstCase(FILE* pOutputFile)
 
  {
 
  	if (!pOutputFile)
 
  		return false;
 
   
 
  	u32 blockSize = _KB(4);
 
  	AllocWorstCase* pReports = new AllocWorstCase[NUMREPORTS];
 
   
 
  	for (u32 i = 0, numRepetitions = 1; i < NUMREPORTS; ++i, numRepetitions *= STEPMULTIPLIER)
 
  	{
 
  		pReports[i].numRepetitions = numRepetitions;
 
  		pReports[i].record.allocationSize = (2* blockSize) - HEADERSIZE; // (2 * blockSize) - sizeof(TALLOCATION_HEADER)
 
  	}
 
   
 
  	// Write the headers.
 
  	fprintf(pOutputFile, "Action,Num Repetitions,Worst case allocation time for an 8KB block\n");

The worst case sets up the memory to be entirely allocated in 4KB blocks, then goes back through and frees up every other block. At the end of the memory the test ensures an 8KB block is available to fullfill the request. Any searches for free memory will discover there are free blocks within each page and will search through each page to check if enough contiguous blocks exist to fullfill the requested 8KB.

	u32 uTotalMem = _MB(32);
 
  	void** ppMem = new void*[uTotalMem / blockSize];
 
   
 
  	// Allocate all the memory, one block at a time, then free every other block, then free the next to last block,
 
  	// creating two contiguous blocks at the end of the memory.
 
  	for (u32 i = 0; i < uTotalMem / blockSize; ++i)
 
  	{
 
  		ppMem[i] = malloc(blockSize - HEADERSIZE);	// blockSize - sizeof(TALLOCATION_HEADER)
 
  	}
 
  	for (u32 i = 1; i < uTotalMem / blockSize; i += 2)
 
  	{
 
  		free(ppMem[i]);
 
  		ppMem[i] = NULL;
 
  	}
 
  	free(ppMem[uTotalMem / blockSize - 2]);

The test iterates through the reporting structures, performing the 8KB request for the specified number of repetitions in each one. Every time the available 8KB is found at the end of the memory it is immediately deallocated so the test can be performed for the next iteration.

	// Allocate two blocks of memory, with the only two contiguous blocks of memory being the last two, then free it a given number of times.
 
  	for (u32 i = 0; i < NUMREPORTS; ++i)
 
  	{
 
  		s32 numRepetitions = pReports[i].numRepetitions;
 
  		s32 allocationSize = pReports[i].record.allocationSize;
 
   
 
  		CHighPerfTimer timerAlloc;
 
  		CHighPerfTimer timerFree;
 
   
 
  		for (s32 j = 0; j < numRepetitions; ++j)
 
  		{
 
  			void* pMem;
 
  			for (u32 k = 0; k < (uTotalMem / blockSize); ++k)
 
  			{
 
  				timerAlloc.Resume();
 
  				pMem = malloc(allocationSize);
 
  				timerAlloc.Pause();
 
  				timerFree.Resume();
 
  				free(pMem);
 
  				timerFree.Pause();
 
  			}
 
  		}
 
  		timerAlloc.GetTime(pReports[i].record.allocHours, pReports[i].record.allocMinutes, pReports[i].record.allocSeconds);
 
  		timerFree.GetTime(pReports[i].record.freeHours, pReports[i].record.freeMinutes, pReports[i].record.freeSeconds);
 
   
 
  		fprintf(pOutputFile, "Allocate, %i,%02i:%02i:%02.4f\n", numRepetitions, pReports[i].record.allocHours, pReports[i].record.allocMinutes, pReports[i].record.allocSeconds);
 
  		fprintf(pOutputFile, "Free, %i,%02i:%02i:%02.4f\n", numRepetitions, pReports[i].record.freeHours, pReports[i].record.freeMinutes, pReports[i].record.freeSeconds);
 
  		printf("\n");
 
  	}
 
   
 
  	// Free the memory used to setup the testcase
 
  	for (u32 i = 0; i < uTotalMem / blockSize - 1; ++i)
 
  	{
 
  		if (ppMem[i] != NULL)
 
  			free(ppMem[i]);
 
  	}
 
   
 
  	return true;
 
  }

The worst case test gives a sense of the time an allocation can take with this heap system at its worst, and shows how important it is to avoid fragmentation of the memory. Times to perform the test range from 390.5 milliseconds, to complete 1 time, to 1 hour 5 minutes 12 seconds 634.8 milliseconds, to complete 10000 times. In the real world one hopes to never get into a fragmentation predicament like the one setup here, and careful planning of heaps is usually done to help avert this kind of situation.

Since there is no way to limit where Windows XP will allocate memory, and because Windows XP will start page swapping to the harddrive when it runs out of physical memory, this last test isn’t something that can be fairly compared, so it has been skipped for the standard memory allocator.

While this heap system has its advantages, it can be improved upon in a number of ways. Here are three ideas that spring to mind . The heap system could be made thread-safe. Second, small allocations, less than the block size, need to be handled at the sub-block level to avoid wasting space. One way that could be done is to allocate a heap containing a set number of blocks (e.g. 128 blocks (512KB)), then any time a small request comes in, it can be redirected to that heap. This method will require some overhead to track the sub-block allocations. Third, this heap system could be improved for debugging by incorporating tracking information to determine where an allocation was made.

I hope this series has been helpful, and look forward to hearing all the great ideas other people have on how to improve upon what has been shown here.