How a CPU converts an address specified in a load or store instruction, to a physical address, is not something that we as games programmers typically need to worry about day-to-day. One exception to that would have been back in ye olde days of 32-bit DOS programs running on DOS Protected Mode Interface (DPMI). Still, the details are quite interesting, and can be helpful when trying to understand some performance issues. This post attempts to give a working knowledge of how address translation works on the x86 and Cell architectures, without getting too bogged down in the exact finer details. Also wraps up a couple loose ends in with my previous post on memory caches, “Getting Down and Dirty with MESI Caches” (/2011/07/09/getting-down-and-dirty-with-mesi-caches/).
CPUs will typically boot up in “real mode”, where addresses specified by the running code are treated directly as a real/physical address. The operating system then configures and enables address translation. This allows all kinds of useful things like
- Protect the OS from malicious applications, by not allowing the applications to see any sensitive memory areas.
- Application code can be protected from modification, and data from execution.
- Virtual memory that maps a large virtual memory space onto a smaller physical one.
- Guard pages to catch stack overflows.
- Applications can all be linked to a fixed address range, removing the need to relocate them when loading.
x86
The x86 system is conceptually nice and simple, so its a great place to start before we look at the Cell.
In 64-bit mode, the x86 translates “linear addresses” specified by code to physical addresses. A four level hierarchy of data structures is accessed by the CPU. The most significant bits of the linear address index into the Page Map Level 4 (PML4). The PML4 Entry specifies the physical address of a Page Directory Pointer Table (PDPT). The next most significant bits of the linear address index into this PDPT. The other levels of the hierarchy, the Page Directory (PD) and Page Table (PT) are accessed in a similar fashion. The remaining least significant bits of the linear address give an offset into the memory page referenced by the final Page Table Entry (PTE).
The (little endian) bit assignments are
47:39 | PML4E index |
38:30 | PDPTE index |
29:21 | PDE index |
20:12 | PTE index |
11:0 | offset |
The layout just described is for 4KB pages (2^num offset bits = 4096). Larger page sizes are supported by dropping the lower levels of the hierarchy and increasing the size of the offset. 2MB pages are stored in a PDE and 1GB pages in a PDPTE.
So what about bits 63:48 ? They should be the same as bit 47, ie the linear address is sign extended to 64-bits. This is known as the canonical form, and you’ll get an exception if the most significant bits are anything else.
There are also some legacy addressing modes supported.
Physical Address Extension (PAE) is similar to 64-bit mode, but loses the PML4, and has only a two bit index (31:30) into the PDPT.
32-bit mode loses both the PML4 and PDPT. Also, PDEs and PTEs halve in size from 8-bytes to 4-bytes, though the tables stay the same size (4KB), so an extra bit is used to index each.
31:22 | PDE index |
21:12 | PTE index |
11:0 | offset |
In all paging modes, the CPU has a single global (physical address) pointer to the top level of the hierarchy. This is stored in Control Register 3 (CR3).
When executing in a legacy, non-64 bit mode, another level of address translation is added, segmentation. Instead of code directly specifying a linear address, it instead works with logical addresses. Segmentation converts logical addresses to linear ones, the linear address is then converted to a physical address as described above. Logical addressing is performed by using a segment selector register (CS, DS, ES, SS, FS or GS). The segment selector references an entry in either the Global Descriptor Table (GDT) or Local Descriptor Table (LDT). The segment descriptors in these tables specify a linear base address that is added to the logical offset (as well as offset limits, and some other security info).
Actually, i kinda lied, just so that the segmentation could be ignored to start off with. The FS and GS segment bases do actually work in 64-bit mode, but apart from that, segmentation is not used.
Looking up all these tables for every memory access would be extremely slow, so instead they are cached. Two types of caching can be used. Translation Lookaside Buffers (TLBs) cache information for individual pages. Paging-structure caches can also cache the upper levels of the structure’s hierarchy.
Just as the CPUID instruction can be used to determine dcache and icache configurations, it can also be used to find out the TLB setup. For example, my CPU has the following TLB related cache info (from CPUID, EAX=2),
Configuration Descriptor |
Meaning |
---|---|
03 | data TLB, 4KB pages, 4-way set associative, 64 entries |
5a | data TLB, 2MB/4MB pages, 4-way set associative, 32 entries |
b2 | code TLB, 4KB pages, 4-way set associative, 64 entries |
55 | code TLB, 2MB/4MB pages, fully associative, 7 entries |
ca | unified 2nd-level TLB, 4KB pages, 4-way set associative, 512 entries |
Notice that small and large pages have separate TLBs.
Caching of segment base and limit values is done using hidden registers that are loaded only when the segment selector changes. For example, a LES instruction to load into the ES segment selector, also performs the GDT/LDT lookup and sets the hidden ESBASE and ESLIMIT registers. There is no specialized memory cache for the GDT and LDT tables as there is the TLB.
All these x86 details are well documented these days (see “Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, System Programming Guide, Part 1″
A Segment Lookaside Buffer (SLB) is used to cache ESID to VSID mappings. The Cell actually does away with segment registers completely and uses just the SLB cache. Without segment registers, the SLB is completely under software control. The OS handles SLB misses and replaces an appropriate SLB entry so that effective to virtual translation can continue.
The Cell contains 64 SLB entries per PPE hardware thread, and 8 per SPE. Yes, each SPE’s DMA controller (aka, the memory flow controller (MFC)) use the same memory translation as the PPE, though this is hidden from the SPU code, instead, the PPU code configures each SPE’s in built memory management unit (MMU).
Once the SLB has been looked up to form the 80-bit virtual address, this is then converted by a paging mechanism to a 62-bit real address.
We saw that the x86 used a hierarchy of tables to do it’s mapping to a physical address, the PowerPC architecture takes an interestingly different approach here. The PowerPC architecture has a flat, hashed page table (HTAB) structure.
Page table entries (PTEs) are organized into groups of 8, called page table entry groups (PTEGs). The HTAB is an array of PTEGs. The most significant bits of the VA form a virtual page number (VPN). The VPN is transformed via two hash functions, a primary and a secondary function. Each hashed value gives the index of a PTEG within the HTAB, so there are 16 possible PTEs that are searched to find a one that matches the VPN. Because of hash collisions, not all valid PTEs can always be stored in the HTAB at the same time. So if the CPU can’t find a PTE that matches the VPN, then a page fault exception is triggered so the OS can swap in the correct PTE.
Like the x86 architecture, a TLB is used to cache PTEs. In the case of the Cell, there is a single 1024-entry, 4-way set associative TLB shared by both of the PPE hardware threads. There is also a 256-entry, 4-way set associative TLB per SPE.
The hardware HTAB lookups can actually be disabled, using only the TLB for address translation. Allowing for software control of loading PTEs into the TLB, in a similar manner to the software managed SLB. This is not part of the standard PowerPC architecture, but a Cell specific extension. Not 100% sure, but think the Playstation 3 does use the hardware HTAB.
To speed things up further, the PPE contains another level of caching on top of the SLB and TLB. Effective to real address translation (ERAT) entries directly cache the EA to RA mappings, bypassing the virtual address step completely when there is an ERAT hit.
The PPE contains a data ERAT (D-ERAT) and instruction ERAT (I-ERAT). Both of these are 64-entry, 2-way set associative caches, and are shared by both hardware threads.
ERAT hits are obviously good, but misses suck more than you may first guess. My previous post mentioned how dcache misses don’t pay the cost on a load instruction, but rather they stall on the first instruction to use the destination register of the load. Unfortunately a ERAT miss doesn’t have this same luxury, and will stall on the load. This is even true for cache prefetches like DCBT, and is one of the reasons why you should always profile prefetching code, and never prefetch NULL. It is not an “error” to prefetch NULL, you wont get an exception, but you will get a D-ERAT miss which can slow things down quite a bit. If a pointer may be NULL, make sure you have a branch around any prefetch, or branch free select another “safe” pointer like the stack.
The “PPC Version 2.02 Book III”, “Chapter 4. Storage Control”, has a detailed description of the PowerPC architecture if you’re interested in more details, such as page sizes and the hash functions (, seems to indicate they also use linear address indexing and physical address tagging.
Well that concludes a two post whirlwind tour of memory caching. Hope you found it interesting, though guess there is only a handful of people like me who get really excited by this stuff :). Not to completely sell out and try to get blog page hit counts up, but planning some more “practical” topics for the next couple posts.