Example: Accessing An Array
Understand further about the operation and the performance of a TLB by following the example explained in this lesson.
We'll cover the following
To make clear the operation of a TLB, let’s examine a simple virtual address trace and see how a TLB can improve its performance. In this example, let’s assume we have an array of 10 4-byte integers in memory, starting at virtual address 100. Assume further that we have a small 8-bit virtual address space, with 16-byte pages; thus, a virtual address breaks down into a 4-bit VPN (there are 16 virtual pages) and a 4-bit offset (there are 16 bytes on each of those pages).
The figure below shows the array laid out on the 16 16-byte pages of the system. As you can see, the array’s first entry (a[0]
) begins on (VPN=06, offset=04); only three 4-byte integers fit onto that page. The array continues onto the next page (VPN=07), where the next four entries (a[3] ... a[6]
) are found. Finally, the last three entries of the 10-entry array (a[7] ... a[9]
) are located on the next page of the address space (VPN=08).
Example
Now let’s consider a simple loop that accesses each array element, something that would look like this in C:
int sum = 0;for (i = 0; i < 10; i++) {sum += a[i];}
For the sake of simplicity, we will pretend that the only memory accesses the loop generates are to the array (ignoring the variables i
and sum
, as well as the instructions themselves). When the first array element (a[0]
) is accessed, the CPU will see a load to virtual address 100. The hardware extracts the VPN from this (VPN=06) and uses that to check the TLB for a valid translation. Assuming this is the first time the program accesses the array, the result will be a TLB miss.
The next access is to a[1]
, and there is some good news here: a TLB hit! Because the second element of the array is packed next to the first, it lives on the same page; because we’ve already accessed this page when accessing the first element of the array, the translation is already loaded into the TLB. And hence the reason for our success. Access to a[2]
encounters similar success (another hit) because it too lives on the same page as a[0]
and a[1]
.
Unfortunately, when the program accesses a[3]
, we encounter another TLB miss. However, once again, the next entries (a[4]
… a[6]
) will hit in the TLB, as they all reside on the same page in memory.
Finally, access to a[7]
causes one last TLB miss. The hardware once again consults the page table to figure out the location of this virtual page in physical memory and updates the TLB accordingly. The final two accesses (a[8]
and a[9]
) receive the benefits of this TLB update; when the hardware looks in the TLB for their translations, two more hits result.
Performance due to spatial locality
Let us summarize TLB activity during our ten accesses to the array: miss, hit, hit, miss, hit, hit, hit, miss, hit, hit. Thus, our TLB hit rate, which is the number of hits divided by the total number of accesses, is 70%. Although this is not too high (indeed, we desire hit rates that approach 100%), it is non-zero, which may be a surprise. Even though this is the first time the program accesses the array, the TLB improves performance due to spatial locality. The elements of the array are packed tightly into pages (i.e., they are close to one another in space), and thus only the first access to an element on a page yields a TLB miss.
Also, note the role that page size plays in this example. If the page size had simply been twice as big (32 bytes, not 16), the array access would suffer even fewer misses. As typical page sizes are more like 4KB, these types of dense, array-based accesses achieve excellent TLB performance, encountering only a single miss per page of accesses.
Performance due to temporal locality
One last point about TLB performance: if the program, soon after this loop completes, accesses the array again, we’d likely see an even better result, assuming that we have a big enough TLB to cache the needed translations: hit, hit, hit, hit, hit, hit, hit, hit, hit, hit. In this case, the TLB hit rate would be high because of temporal locality, i.e., the quick re-referencing of memory items in time. Like any cache, TLBs rely upon both spatial and temporal locality for success, which are program properties. If the program of interest exhibits such locality (and many programs do), the TLB hit rate will likely be high.
TIP: USE CACHING WHEN POSSIBLE
Caching is one of the most fundamental performance techniques in computer systems, one that is used again and again to make the “common-case fast”. “Computer Architecture: A Quantitative Approach” by John Hennessy and David Patterson. Morgan-Kaufmann, 2006. A great book about computer architecture. We have a particular attachment to the classic first edition. The idea behind hardware caches is to take advantage of locality in instruction and data references. There are usually two types of locality: temporal locality and spatial locality. With temporal locality, the idea is that an instruction or data item that has been recently accessed will likely be re-accessed soon in the future. Think of loop variables or instructions in a loop; they are accessed repeatedly over time. With spatial locality, the idea is that if a program accesses memory at address x, it will likely soon access memory near x. Imagine here streaming through an array of some kind, accessing one element, and then the next. Of course, these properties depend on the exact nature of the program and thus are not hard-and-fast laws but more like rules of thumb.
Hardware caches, whether for instructions, data, or address translations (as in our TLB) take advantage of the locality by keeping copies of memory in small, fast on-chip memory. Instead of having to go to a (slow) memory to satisfy a request, the processor can first check if a nearby copy exists in a cache; if it does, the processor can access it quickly (i.e., in a few CPU cycles) and avoid spending the costly time it takes to access memory (many nanoseconds).
You might be wondering: if caches (like the TLB) are so great, why don’t we just make bigger caches and keep all of our data in them? Unfortunately, this is where we run into more fundamental laws like those of physics. If you want a fast cache, it has to be small, as issues like the speed-of-light and other physical constraints become relevant. Any large cache by definition is slow, and thus defeats the purpose. Thus, we are stuck with small, fast caches; the question that remains is how to best use them to improve performance.
Get hands-on with 1400+ tech skills courses.