Emulation has been something I’ve been interested in for a very long time now, ever since I first fired up Pasofami after school that one day and saw Super Mario Brothers running in a window on my 486 PC.  The performance at the time was horribly slow, but seeing my favorite Nintendo games running on my PC was like magic to me.  Since then, emulation techniques have improved drastically and emulators are now quite commonplace — executing Java byte code, for example, can be considered a form of emulation.  Regardless of the actual platform being emulated, the process of changing foreign machine code into native code is the same for every emulator.

Many many moons ago, I developed a Nintendo 64 emulator called Project UnReality.  Because the CPU emulation core I wrote used interpreted emulation, it was also horribly slow.  Basically, an interpreted emulator performs the following steps:

  1. Fetch the next CPU instruction (opcode) at the current instruction pointer,
  2. Decode the opcode, calling the appropriate function which implements that opcode’s functionality,
  3. Increment the instruction pointer,
  4. Repeat until the user quits your emulator because he’s tired of looking at a blank screen, and proceeds to send you hate mail complaining that the emulator doesn’t work.

Although this technique is fairly easy to implement, it doesn’t perform very well.  In order to get any kind of decent speed out of an emulator, a technique known as JIT (Just-In-Time compilation) or dynamic recompilation is used.  This technique recompiles a group of CPU instructions — typically until a branch instruction is encountered — into exact or similar instructions for the native CPU, and then directly executes them.  The recompilation operation itself of course takes time to execute, but these recompiled blocks can be cached and reused next time the emulated CPU executes that specific area of memory, hence providing near real-time performance.

Since describing how to write a complete emulator would easily fill a large book, I’ve decided to just start off with the minimal basics for this first part.  I will show how to implement a recompiler for a very small subset of the 6502 CPU instruction set which will compile into 32-bit Intel x86 code, and then show how to execute the recompiled code in a Win32 application.  Although the recompiler itself will be presented in C++, having a basic understanding of x86 assembly will be helpful.

All of the source code shown here is attached to the end of this post, so you don’t need to worry about copying the code from this text.  I will also attach useful links at the end for those who are interested in doing further research on their own.

There’s a fair bit of ground to cover, so you might want to grab a cup of your favorite beverage before diving in!

Enter: The 6502

The 6502 was a popular CPU for its time, being used in the Atari 2600, Apple IIe, Commodore 64, BBC Micro, the original Nintendo Entertainment System (the CPU for the NES was actually a custom Ricoh chip, but it executed 6502 instructions), and many others.  The important things about this processor that we need to know in order to emulate it are:

  1. The 6502 has three 8-bit registers (accumulator, X, and Y), an 8-bit stack pointer, and an 8-bit flags/status register,
  2. Memory is stored in little endian format (the same as x86 CPUs),
  3. Instructions can be multiple bytes long, but opcodes are exactly 8-bits in size,
  4. The largest accessible memory range is limited to 16-bits (64KB),
  5. The stack is fixed at addresses 0100h-01FFh.

To keep things simple for now, we won’t bother with either the status register or the stack in this part.  Simply knowing where the stack is located in memory is sufficient enough so we don’t attempt to overwrite it.

Let’s start out by writing a simple program that we’d like the initial version of our emulator to execute.  The program below performs both read and write memory accesses, and has a simple add computation.  I’ll comment each line for those of you who aren’t familiar with 6502 assembly:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  
lda #0          ; load 00h into accumulator
 
  ldx #$1F	; load 1Fh into X
 
  sta $3000, x	; store accumulator into memory at 3000h + X
 
   
 
  loop:
 
  lda $3000, x	; load memory at 3000h + X into accumulator
 
  adc #1		; add 1 to accumulator
 
  sta $3000, x	; store accumulator into memory at 3000h + X
 
  jmp loop	; jump back to start of loop

This program first initializes the memory at 301Fh to 00h, then performs a loop which reads the value from 301Fh, adds one, then writes the value back to 301Fh.  When this program is assembled with the base address set to 2000h, the following byte code is generated:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  
2000: a9 00	lda #$00
 
  2002: a2 1F	ldx #$1F
 
  2004: 9d 00 30	sta $3000,x
 
  2007: bd 00 30	lda $3000,x
 
  200a: 69 01	adc #$01
 
  200c: 9d 00 30	sta $3000,x
 
  200f: 4c 07 20	jmp $2007

In order to execute this program, we will need to implement the following six opcodes (since the same STA opcode is used twice):

  1. A9: LDA Immediate
  2. A2: LDX Immediate
  3. 9D: STA Absolute, X
  4. BD: LDA Absolute, X
  5. 69: ADC Immediate
  6. 4C: JMP Absolute

Register Mappings And Memory Layout

The x86 processor has eight 32-bit general purpose registers — which can also be accessed in 8-bit sections — allowing us to map all of the emulated 6502 registers directly to x86 registers.  This means our recompiler has the potential to do a 1-to-1 mapping of instructions, and only access memory when the emulated code requires it.  The register mapping we’ll be using in our implementation is as follows:

6502		x86
 
  ---------------	-------
 
  Accumulator	AL
 
  Stack Pointer	AH
 
  X		BL
 
  Y		BH
 
  Status/Flags	CL
 
  IP		DI

So for example, if our recompiler encounters a TAX instruction (transfer accumulator to X — nothing to do with the evil tax man), it should generate an equivalent x86 instruction of: MOV BL, AL.  This mapping also leaves three 32-bit registers (EDX, ESI, and EBP), and one 8-bit (CH) available for the recompiler to use for any temporary calculations.

As the 6502 doesn’t have any I/O instructions (like IN and OUT on the x86), all I/O is memory mapped.  This means reads or writes to specific memory locations will trigger something in the hardware, such as disk access, graphics/video display, or sound.  Since we are only interested in doing pure CPU emulation in this first part, we won’t deal with memory mapped I/O yet.

This also means emulating the memory is incredibly easy: we can simply allocate a single 64KB block to cover the entire 16-bit accessible address range of the 6502, and be done with it.

Building The Emulator Framework

Armed with the knowledge we have so far, let’s start to lay out our JIT class a bit:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  16
 
  17
 
  18
 
  19
 
  20
 
  21
 
  22
 
  23
 
  24
 
  25
 
  26
 
  27
 
  28
 
  29
 
  30
 
  31
 
  32
 
  33
 
  34
 
  35
 
  36
 
  37
 
  38
 
  39
 
  40
 
  41
 
  42
 
  43
 
  44
 
  45
 
  46
 
  47
 
  48
 
  49
 
  
#include <cstdint>
 
  #include <iostream>
 
  #include <map>
 
  #include "windows.h"
 
   
 
  class JIT
 
  {
 
  public:
 
  	struct CPU6502
 
  	{
 
  		uint8_t		acc, x, y, sp;
 
  		uint8_t		status;
 
  		uint16_t	ip;
 
  	};
 
   
 
  	struct NativeCode
 
  	{
 
  		uint8_t *	ptr;
 
  		int		size;
 
  	};
 
   
 
  public:
 
  	JIT(void)
 
  	{
 
  		// allocate emulated memory and reset the CPU state
 
  		mMemory = new uint8_t[65536];
 
  		memset(&mCPU, 0, sizeof(mCPU));
 
  	}
 
   
 
  	~JIT(void)
 
  	{
 
  		// free emulated memory
 
  		delete []mMemory;
 
  	}
 
   
 
  	CPU6502 &cpu(void)
 
  	{
 
  		return mCPU;
 
  	}
 
   
 
  	uint8_t &operator[](uint16_t offs)
 
  	{
 
  		return mMemory[offs];
 
  	}
 
   
 
  private:
 
  	uint8_t *	mMemory;
 
  	CPU6502		mCPU;
 
  };

Two accessor functions are also provided to allow the application to easily control the emulated CPU and memory.

Implementing The Code Cache

As I mentioned earlier, a dynamic recompiler does not execute a single instruction at a time, but instead batches a group of instructions together and executes them.  Since our intent is to be able to reuse previously compiled code blocks, it makes sense to compile until a branch instruction is reached.  For example, a call to a function will change the CPU’s instruction pointer to a new location, and the function will be recompiled on the initial call.  The next time the function is called, the already compiled block will be reused.  Note that this isn’t restricted to only functions, and will also work for compiling the inner portion of loops as well.

A cache holding these recompiled code blocks can be easily implemented by utilizing std::map.  The 16-bit 6502 Instruction Pointer can be used to map directly to our NativeCode structure for the compiled code at that location:

1
 
  2
 
  
typedef std::map<uint16_t, NativeCode>	CacheMap;
 
  CacheMap mCache;

Our recompiler’s main run() function would then perform the following simple logic:

  1. Check if the cache contains an already recompiled block at the current emulated Instruction Pointer,
  2. If it does not, recompile the code then add it into the cache,
  3. Execute the block of code in the cache.

Which can then be implemented like so:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  
void run(void)
 
  {
 
  	const NativeCode *code;
 
   
 
  	// check to see if the compiled code already exists in our cache
 
  	CacheMap::const_iterator iter = mCache.find(mCPU.ip);
 
  	if (iter != mCache.end())
 
  		code = &iter->second;
 
  	else
 
  		code = compile(mCPU.ip);
 
   
 
  	// execute this code block
 
  	execute(code);
 
  }

Allocating Executable Memory

As of Windows XP SP2, a security feature called PAGE_EXECUTE_READWRITE flag to the VirtualAlloc() function, like so:

1
 
  
ptr = VirtualAlloc(NULL, size, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);

And to free the memory, a call to VirtualFree() will do the job:

1
 
  
VirtualFree(ptr, size, MEM_DECOMMIT | MEM_RELEASE);

Note that unlike the C library free() function which only takes a single parameter, VirtualFree() requires the size of the page to be passed back to it as well.

Recompiling A Block Of Code

Since x86 instructions are of variable length, it is not directly possible to determine the required size of the recompiled buffer.  We’ll solve this by doing two passes on the incoming code: the first will just parse the 6502 instructions, totaling up a size required for the x86 instruction buffer (which we pass to VirtualAlloc()), then the second pass will actually perform the conversion and write the x86 instructions.  On the first pass, we will pass NULL as the output pointer so it knows we only want to calculate the size and not write any output.

At the end of our recompiled block, we will append the x86 RET instruction, which is a single byte of C3h.  This means a simple CALL instruction to the address of our recompiled block will allow us to execute it, and then return control back to our JIT class.  You could also consider the single RET instruction to be similar to a null terminator for a string — if there wasn’t any terminator, the CPU would continue to execute uninitialized memory and swiftly crash the emulator.

After the block has been compiled, we’ll then proceed to insert it into our cache, then return a pointer to the newly created NativeCode structure:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  16
 
  17
 
  18
 
  19
 
  20
 
  21
 
  
const NativeCode *compile(uint16_t addr)
 
  {
 
  	const uint8_t *in = &(*this)[addr];
 
  	NativeCode code;
 
   
 
  	// do a first pass to determine the size of the compiled memory, adding one byte for our native return opcode
 
  	code.size = compileBlock(in, NULL) + 1;
 
   
 
  	// allocate executable memory
 
  	code.ptr = static_cast<uint8_t *>(::VirtualAlloc(NULL, code.size, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE));
 
   
 
  	// now do a second pass, compiling the actual code
 
  	compileBlock(in, code.ptr);
 
   
 
  	// write an x86 return opcode terminator
 
  	code.ptr[code.size - 1] = 0xC3;
 
   
 
  	// and store this block into our cache
 
  	mCache[addr] = code;
 
  	return &mCache[addr];
 
  }

We’ll also want to update our class destructor to ensure the cache is freed properly on destruction:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  
~JIT(void)
 
  {
 
  	// free all executable code in our cache
 
  	for (CacheMap::iterator iter = mCache.begin(); iter != mCache.end(); ++iter)
 
  		::VirtualFree(iter->second.ptr, iter->second.size, MEM_DECOMMIT | MEM_RELEASE);
 
   
 
  	// free emulated memory
 
  	delete []mMemory;
 
  }

Now it’s time for the compileBlock() function.  As I mentioned earlier, each 6502 opcode is fixed at 8-bits in size, so there are a maximum of 256 different possible opcodes.  In reality, the 6502 only has 54, but being able to dispatch individual opcodes into separate functions rather than trying to build everything within a huge switch() block would be ideal.  We’ll use function pointers combined with a lookup table built at compile time to achieve this.   Since we’re only implementing six of the opcodes, the table will look a bit bare:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  16
 
  17
 
  18
 
  19
 
  20
 
  21
 
  22
 
  23
 
  24
 
  25
 
  26
 
  27
 
  28
 
  29
 
  30
 
  31
 
  32
 
  
int compileBlock(const uint8_t *in, uint8_t *out)
 
  {
 
  	typedef int (JIT::*CompileFunc)(const uint8_t *&in, uint8_t *&out, bool &stop);
 
   
 
  	static const CompileFunc opcodeTable[256] = {
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,		NULL,	NULL,	NULL,		NULL,		NULL,	NULL,		// 00h
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,		NULL,	NULL,	NULL,		NULL,		NULL,	NULL,		// 10h
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,		NULL,	NULL,	NULL,		NULL,		NULL,	NULL,		// 20h
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,		NULL,	NULL,	NULL,		NULL,		NULL,	NULL,		// 30h
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,		NULL,	NULL,	&JIT::jmpabs,	NULL,		NULL,	NULL,		// 40h
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,		NULL,	NULL,	NULL,		NULL,		NULL,	NULL,		// 50h
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	&JIT::adcimm,	NULL,	NULL,	NULL,		NULL,		NULL,	NULL,		// 60h
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,		NULL,	NULL,	NULL,		NULL,		NULL,	NULL,		// 70h
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,		NULL,	NULL,	NULL,		NULL,		NULL,	NULL,		// 80h
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,		NULL,	NULL,	NULL,		&JIT::staabx,	NULL,	NULL,		// 90h
 
  		NULL,	NULL,	&JIT::ldximm,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	&JIT::ldaimm,	NULL,	NULL,	NULL,		NULL,		NULL,	NULL,		// A0h
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,		NULL,	NULL,	NULL,		&JIT::ldaabx,	NULL,	NULL,		// B0h
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,		NULL,	NULL,	NULL,		NULL,		NULL,	NULL,		// C0h
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,		NULL,	NULL,	NULL,		NULL,		NULL,	NULL,		// D0h
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,		NULL,	NULL,	NULL,		NULL,		NULL,	NULL,		// E0h
 
  		NULL,	NULL,	NULL,		NULL,	NULL,	NULL,	NULL,	NULL,	NULL,	NULL,		NULL,	NULL,	NULL,		NULL,		NULL,	NULL		// F0h
 
  	};
 
   
 
  	bool stop = false;
 
  	int size = 0;
 
   
 
  	// compile instructions until we're told to stop
 
  	while (!stop)
 
  		size += (*this.*opcodeTable[*in])(in, out, stop);
 
   
 
  	return size;
 
  }

The above strange-looking code works by dereferencing the incoming in pointer to obtain the 6502 opcode, then uses it as a direct lookup into the opcode table.  The function pointer stored at that specific index is then called.  If you look back at the beginning of this post, you’ll see that the 6502 LDX Immediate opcode has a value of A2h.  In the above table, the address of ldximm() is stored in the table at exactly index A2h, allowing the appropriate function to be called.

In all of our functions, three parameters are passed, and they all have a return value of type int.  The input pointer points directly into the emulated 6502 memory area, and is incremented by the size of that 6502 instruction by the function responsible for that opcode (since it knows how large each 6502 instruction is).  The output pointer is either NULL — which means nothing should be written to the output — or it points to a valid location in the already allocated executable memory buffer where x86 instructions should be written.  The function for that opcode is also responsible for incrementing the output pointer as well (since it knows how large each x86 instruction is).  The stop bool value is initialized to false at the start of compilation, and the compiler will stop when it is set to true — for example, at a branch instruction.  Finally, the return value of each function is the required size (in bytes) of the x86 instructions necessary for that 6502 instruction.

Writing The x86 Instructions

Encoding x86 opcodes can be a bit tricky due to all of the different addressing schemes available in the architecture.  Intel’s site provides the official this one.  A full-fledged dynamic recompiler would likely have a nice set of routines to write out a stream of x86 instructions, but that would take a fair amount of code and explanation to cover it all.  Instead, we’ll just encode the x86 byte code by hand for now.

One method to determine what byte code a specific x86 instruction compiles into is to assemble that instruction, then look at the resulting output in a hex editor.  Any x86 assembler will do, but Visual Studio debugger can be (ab)used to  do this as well.  Simply enter the instruction inside of an __asm { } block anywhere in the code, then set a breakpoint on the instruction:

Run the program, and when the breakpoint is hit, right-click and select “Go To Disassembly”, and the debugger will show the address in memory (which we’re not interested in), the byte code (in this case, the LEA instruction is six bytes long), and the disassembled instruction itself:

Right, so let’s get started!

Our first opcode is A9h: LDA Immediate.  It should simply store the 8-bit immediate value into the x86 register AL, where our emulated accumulator resides:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  
int ldaimm(const uint8_t *&in, uint8_t *&out, bool &stop)
 
  {
 
  	if (out != NULL)
 
  	{
 
  		*out++ = 0xB0;		// mov al, imm
 
  		*out++ = in[1];		// <immediate value>
 
  	}
 
   
 
  	in += 2;
 
  	return 2;
 
  }

Our second opcode is A2h: LDX Immediate.  It is nearly identical to LDA Immediate, but it should store the 8-bit immediate value into the x86 register BL, where our emulated X register resides:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  
int ldximm(const uint8_t *&in, uint8_t *&out, bool &stop)
 
  {
 
  	if (out != NULL)
 
  	{
 
  		*out++ = 0xB3;		// mov bl, imm
 
  		*out++ = in[1];		// <immediate value>
 
  	}
 
   
 
  	in += 2;
 
  	return 2;
 
  }

Our third opcode is 9Dh: STA Absolute, X, and is also our first memory opcode.  It stores the accumulator (in AL) to the absolute memory address specified plus an offset in X (in BL).  In order to convert the 6502 memory address into a valid pointer on the native x86 side, we’ll need to add the absolute address specified in this opcode with the base address of our emulated memory block.  This addition can be performed at compile time since the offset is hardcoded, and our emulated memory block isn’t ever going to move.

In order to add the offset in BL, we’ll need to first zero-extend it to a 32-bit register using MOVZX.  We can then do a register plus 32-bit immediate addition in one LEA opcode:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  16
 
  17
 
  18
 
  19
 
  20
 
  21
 
  22
 
  23
 
  24
 
  25
 
  26
 
  27
 
  28
 
  29
 
  
int staabx(const uint8_t *&in, uint8_t *&out, bool &stop)
 
  {
 
  	if (out != NULL)
 
  	{
 
  		// zero-extend X register offset (in bl) to edx
 
  		*out++ = 0x0F;				// movzx edx, bl
 
  		*out++ = 0xB6;
 
  		*out++ = 0xD3;
 
   
 
  		// decode specified absolute address
 
  		uint16_t offs = MAKEWORD(in[1], in[2]);
 
   
 
  		// calculate true address in emulated memory
 
  		uint32_t addr = (uint32_t)(mMemory + offs);
 
   
 
  		// load address plus offset (placed in edx above) into esi
 
  		*out++ = 0x8D;				// lea esi, [edx + immed]
 
  		*out++ = 0xB2;
 
  		*((uint32_t *)out) = addr;		// <immediate value>
 
  		out += 4;
 
   
 
  		// store accumulator (in al) to memory pointed by esi
 
  		*out++ = 0x88;				// mov [esi], al
 
  		*out++ = 0x06;
 
  	}
 
   
 
  	in += 3;
 
  	return 11;
 
  }

Our fourth opcode is BDh: LDA Absolute, X, which is identical to STA Absolute, X above except it reads from emulated memory rather than writing to it:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  16
 
  17
 
  18
 
  19
 
  20
 
  21
 
  22
 
  23
 
  24
 
  25
 
  26
 
  27
 
  28
 
  29
 
  
int ldaabx(const uint8_t *&in, uint8_t *&out, bool &stop)
 
  {
 
  	if (out != NULL)
 
  	{
 
  		// zero-extend X register offset (in bl) to edx
 
  		*out++ = 0x0F;				// movzx edx, bl
 
  		*out++ = 0xB6;
 
  		*out++ = 0xD3;
 
   
 
  		// decode specified absolute address
 
  		uint16_t offs = MAKEWORD(in[1], in[2]);
 
   
 
  		// calculate true address in emulated memory
 
  		uint32_t addr = (uint32_t)(mMemory + offs);
 
   
 
  		// load address plus offset (placed in edx above) into esi
 
  		*out++ = 0x8D;				// lea esi, [edx + immed]
 
  		*out++ = 0xB2;
 
  		*((uint32_t *)out) = addr;		// <immediate value>
 
  		out += 4;
 
   
 
  		// load accumulator (in al) from memory pointed by esi
 
  		*out++ = 0x8A;				// mov al, [esi]
 
  		*out++ = 0x06;
 
  	}
 
   
 
  	in += 3;
 
  	return 11;
 
  }

Our fifth opcode is 69h: ADC Immediate, which simply adds an immediate value to the accumulator (in AL).  Normally, the carry bit would also need to be added, and set upon overflow, but we’re not going to worry about that for now:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  
int adcimm(const uint8_t *&in, uint8_t *&out, bool &stop)
 
  {
 
  	if (out != NULL)
 
  	{
 
  		*out++ = 0x82;		// add al, imm
 
  		*out++ = 0xC0;
 
  		*out++ = in[1];		// <immediate value>
 
  	}
 
   
 
  	in += 2;
 
  	return 3;
 
  }

Finally, our last opcode is 4Ch: JMP Absolute.  We will store the new emulated Instruction Pointer into DI and tell the compiler to stop, as this is a branch instruction:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  
int jmpabs(const uint8_t *&in, uint8_t *&out, bool &stop)
 
  {
 
  	if (out != NULL)
 
  	{
 
  		*out++ = 0x66;		// 16-bit prefix
 
  		*out++ = 0xBF;		// mov di, imm
 
  		*out++ = in[1];		// <immediate value>
 
  		*out++ = in[2];
 
  	}
 
   
 
  	// this is a branch instruction, so stop compiling
 
  	stop = true;
 
  	in += 3;
 
  	return 4;
 
  }

Executing The Compiled Code

Now that the recompiler is done, we can finally write our execute function.  This function should simply load the emulated CPU state into the corresponding x86 registers, call the specified native code, then store the updated emulated CPU state back before returning.  Since we need to explicitly set the CPU registers, this function will need to be done using inline assembly, and wrapped in a PUSHAD/POPAD block to ensure the CPU state of our application is saved and restored properly:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  16
 
  17
 
  18
 
  19
 
  20
 
  21
 
  22
 
  23
 
  24
 
  25
 
  26
 
  27
 
  28
 
  29
 
  30
 
  31
 
  32
 
  33
 
  
void execute(const NativeCode *code)
 
  {
 
  	const uint8_t *native = code->ptr;
 
  	CPU6502 *cpu = &mCPU;
 
   
 
  	__asm {
 
  		// save all registers
 
  		pushad
 
   
 
  		// load the emulated CPU state
 
  		mov	edx, cpu
 
  		mov	al, [edx]CPU6502.acc
 
  		mov	ah, [edx]CPU6502.sp
 
  		mov	bl, [edx]CPU6502.x
 
  		mov	bh, [edx]CPU6502.y
 
  		mov	cl, [edx]CPU6502.status
 
   
 
  		// call the native code block
 
  		call	native
 
   
 
  		// store the updated CPU state back
 
  		mov	edx, cpu
 
  		mov	[edx]CPU6502.ip, di
 
  		mov	[edx]CPU6502.acc, al
 
  		mov	[edx]CPU6502.sp, ah
 
  		mov	[edx]CPU6502.x, bl
 
  		mov	[edx]CPU6502.y, bh
 
  		mov	[edx]CPU6502.status, cl
 
   
 
  		// restore all registers
 
  		popad
 
  	}
 
  }

And that’s it.  Our very basic six-opcode 6502 dynamic recompiler is now completed.

Emulator Execution

Running the emulator is actually quite easy.  Some 6502 code will first need to be loaded into the emulated memory, then the emulated CPU Instruction Pointer needs to be pointed to that address.  Once that is done, simply calling the run() function will execute one step of the emulator, updating the emulated CPU state and memory for that step.  We’ll load the program that we wrote at the beginning of this article into emulated memory at address 2000h:

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  16
 
  17
 
  18
 
  19
 
  20
 
  21
 
  22
 
  23
 
  24
 
  25
 
  
int main(void)
 
  {
 
  	JIT jit;
 
   
 
  	// load the code we want to execute into emulated memory at address 2000h
 
  	uint8_t *ptr = &jit[0x2000];
 
  	*ptr++ = 0xA9; *ptr++ = 0x00;			// lda #$00
 
  	*ptr++ = 0xA2; *ptr++ = 0x1F;			// ldx #$1F
 
  	*ptr++ = 0x9D; *ptr++ = 0x00; *ptr++ = 0x30;	// sta $3000,x
 
  	*ptr++ = 0xBD; *ptr++ = 0x00; *ptr++ = 0x30;	// lda $3000,x
 
  	*ptr++ = 0x69; *ptr++ = 0x01;			// adc #$01
 
  	*ptr++ = 0x9D; *ptr++ = 0x00; *ptr++ = 0x30;	// sta $3000,x
 
  	*ptr++ = 0x4C; *ptr++ = 0x07; *ptr++ = 0x20;	// jmp $2007
 
   
 
  	// set the cpu's instruction pointer to the start of our code
 
  	jit.cpu().ip = 0x2000;
 
   
 
  	// run our code 64 times
 
  	for (int t = 0; t < 64; t++)
 
  		jit.run();
 
   
 
  	// display the memory byte we modified
 
  	std::cout << "memory[0x301F]=" << (int)jit[0x301F] << std::endl;
 
  	return 0;
 
  }

On the first call to run(), the Instruction Pointer is 2000h — which is not in the cache — and so the recompiler will compile the entire code block (from the initial LDA instruction up to the JMP instruction).  If you set a breakpoint on the call native line in the execute() function, then step into the call, you’ll be able to see the completely recompiled code block:

As you can see, the JMP opcode at the end sets the emulated Instruction Pointer (in DI) to 2007h, then returns control back to the application.

On the second call to run(), the Instruction Pointer is 2007h — which is again, not in the cache — and the recompiler will compile the code block from the LDA Absolute, X instruction at 2007h up to the JMP instruction.  The resulting recompiled code block on this pass is of course smaller than the one at 2000h shown above, and the lower half of the x86 instructions (from 0003000Fh onward) are identical:

On all subsequent calls to run(), the Instruction Pointer is 2007h, which is already in the cache.  The emulator simply uses the already compiled code block and executes it directly.

So, after calling run() in a loop 64 times, the emulated memory at 301Fh should contain the value 64:

…and indeed it does.

On Forward!

The recompiler presented here only scratches the surface of what is needed to implement a fully working CPU emulator.  Aside from the remaining opcodes, some of the other things we need to consider are:

  1. Limiting the size of the cache to a fixed size.  Our current implementation recompiles everything, and will continue to eat up memory.  A proper cache would track the number of times a recompiled block is used, and when the cache becomes full, it discards the least-used blocks.
  2. Invalidating cached blocks on write.  When the emulated code writes to a memory location held by a block in the cache, that block should be discarded so it is recompiled when it is accessed.  Self-modifying code (or recompiled code in a block which modifies itself) can be a bit tricky to handle, but is certainly possible.
  3. Memory mapped I/O.  Rather than simply emulating the read or write to a specific memory location, a callback to the user code should be performed if the address is in a memory mapped area.  This allows the emulator to handle hardware devices, such as graphics and sound.

If you really want to get into it, an optimization pass can be performed on the recompiled block, eliminating unnecessary instructions.  For example (at address 000F000Eh above), neither BL or ESI were modified, so the MOVZX and LEA instructions can be removed completely.  You could then go even further and combine the MOV/ADD/MOV triplet into a single ADD instruction.

I’ll be discussing how to handle the issues above in the future parts, but since this post took a bit more time than I had planned, I might interleave them between other posts on different subjects.  Of course, you are encouraged to explore on your own, and I will be more than happy to answer any questions in the comments, should you get stuck.  Here’s the source code for this post, all in one .cpp file:  jit-6502-x86.zip

Happy coding!

Links

1. Online 6502 assembler and emulator in Javascript:
3. Intel Architecture Software Developer’s Manuals:

5. Wikipedia article on JIT Compilers: