wiki · home

Virtual Memory

Machines have a limited amount of memory and multiple processes running at the same time can cause issues if the memory access is not managed correctly. Another problem that can happen relates to the fact that there might not be enough memory available to all processes at the same time. How should the operating system (OS) handle this?

You could imagine a solution that when a process P1 is executed, the entire physical memory would be available to P1. Then, when a different process P2 is executed, the whole memory state is stored on disk, thus giving full memory access to P2. This approach is too slow, especially when memory grows.

The OS solves these and other problems by creating an abstraction of physical memory called virtual memory. Before explaining this abstraction, we need to understand how each process sees the memory available to it.

Address Space

Each program has its address space where the memory state is stored. The space is divided into different regions:

Text (program code): contains the instructions of the program.
Data: holds static data such as constants and global variables that are explicitly initialised.
Heap: is the area of memory that can be dynamically allocated at runtime.
Stack: is an area that grows and shrinks when functions are executed. Each time a function is called a new stack frame is placed at the top to hold local variables, arguments, and return values. When the function returns, the frame is removed.

In virtual memory, the OS passes the illusion to the process that its address space starts at 0 and comprises the entire physical memory size. As we will see in the following sections, this is just an abstraction. The operating system has to maintain certain data structures and perform specific operations to ensure that the abstraction holds for all running processes when, in reality, the physical memory is being shared by them.

Goals of Virtual Memory

The main goals of virtual memory are:

Transparency: the running program should not need to know that memory is being virtualised. The program behaves as if it had its own private physical memory.
Efficiency: the OS should make the virtualisation as efficient as possible, with minimal impact on programs’ speed, and without using too much memory for supporting the abstraction.
Protection: the OS should make sure that processes cannot access memory that is allocated to a different process. This means that memory is isolated between processes.

Address Translation

During a program’s execution, the virtual memory addresses need to be translated into the corresponding physical addresses. This is a task shared by the hardware and operating system.

Dynamic (hardware-based) Relocation

This idea is often referred to as base and bounds. Within each CPU there need to be two registers: a base register and the bounds register. When a program starts, the OS decides where in physical memory the address space of that program should start and sets the base register to that value. When a given location needs to be translated, the processor can find the physical address via the following expression:

physical address = virtual address + base

The bounds register is used to check whether or not the address is within the program’s address space, and ensure the program doesn’t access data outside its allowed region. The value stored in this register can be either the size of the address space or the physical address that represents the end of the address space. In the event illegal access is detected, the CPU raises an exception which will most likely result in the program being terminated.

The following picture shows how the physical memory looks like when multiple processes are running.

Hardware Support

The hardware must provide the mentioned registers to support address translation. These are usually referred to as the memory management unit (MMU), although modern MMU units don’t exactly follow the base and bounds model.

Also, there must be privileged instructions that can only be executed by the operating system in order to allow changing the values of the registers. This is combined with the fact that the OS runs in privileged mode and there must be a way for the CPU to know which mode it is currently on.

The CPU also needs to raise exceptions when a process attempts illegal memory access. The kernel should then take over via an exception handler to figure out how to react accordingly.

Operating System Support

The operating system must also provide a new feature to support this dynamic relocation. One of the necessary aspects it needs to maintain is which parts of memory are currently in use and which are free. So when a new process starts, the address space can be correctly mapped to a free region and marked as in-use. When the process finishes executing that region should be marked as free again. Usually, this would be tracked in a free list.

Another aspect that requires OS support is when a context switch occurs. The values stored in the base and bounds registers need to be saved somewhere and be replaced by the new values of a different process, for example. The structure used for storing this data is commonly known as process control block (PCB).

Segmentation

The mechanism described before has a problem of wasting memory because the free space between the stack and the heap cannot be used by any other process. Another issue with this approach is the fact that it can be quite hard to run a process when the entire address space does not fit into memory.

One idea for overcoming this issue is called segmentation. Instead of having just one base and bounds pair for the address space, there is a pair for each logical segment of the address space: stack, heap, and code. Since the limits are now independent, these segments can be placed in different regions of physical memory without taking space that is currently unused.

The same verifications apply when a process tries to access an illegal address. A segmentation fault arises when the illegal access is attempted.

Referencing Segments

The virtual address is split into two parts to make it possible to identify which segment the address belongs to.

In this case, the top two bits represent the segment and the remaining 10 bits contain the offset value within that segment. As an example, consider the following pseudocode for accessing memory:

let segment = (vaddr & SEG_MASK) >> SEG_SHIFT;
let offset = vaddr & OFFSET_MASK;
if offset >= bounds[segment] {
    return Err(ProtectionFault);
} else {
    let phys_addr = base[segment] + offset;
    let register = access_memory(phys_addr);
}

Processes might share similar code (from libraries, for example), and it would be useful to share the respective segments to save space. Protection bits are used for this purpose. The OS can mark segments as read-only for example, and safely share code segments from a library that is currently used by different programs.

Free-Space Management

Over time, the free space is split into pieces of different sizes and might not be enough for satisfying a given request for more memory. When memory is fragmented in this way it is known as external fragmentation. For example, consider the following image:

In this case, there are 3 chunks of 3 KiB available but if a request for 8 KiB is made, it will fail as there are 9 KiB of free space but it is not contiguous to satisfy the request.

Another type of fragmentation, called internal fragmentation can happen when memory is divided into fixed sizes and the allocator hands out chunks that are bigger than what was requested. The unused space constitutes the internal fragmentation.

Strategies

There are several different strategies for managing free space, and in the following sections, we will learn more about some of them.

Best Fit

The best fit strategy is simple: find chunks of memory that are as big or bigger than the requested size and return the smallest chunk between the candidates.

The idea is that by always returning a chunk that is as close as possible to the request size, there is less space wasted. The problem with that is the penalty when trying to perform an exhaustive search for all potential candidates.

Worst Fit

The worst fit approach is the opposite: find the largest chunk, break it, return the requested amount and keep the remaining (large) chunk on the free list.

The idea is to leave large chunks free instead of lots of small chunks. Similarly to the best-fit approach, this one can be costly when searching for possible candidates.

First Fit

Instead of searching for all possible candidates, the first fit method returns the requested size from the first chunk that can satisfy the request while keeping the remaining space in the free list.

In this case, it does not pay the cost of performing a full scan in the free list to find all candidates but the beginning of the list can be polluted with lots of small objects.

Next Fit

In the next fit, the allocator keeps a pointer to the last location it was looking at. This pointer is used in future lookups to evenly distribute the search for free space and avoid the splintering of the beginning of the list that was mentioned in the first fit method.

Segregated Lists

In this method, the allocator maintains separate free lists for different chunk sizes. When a request for space arrives, the free list that manages blocks that are as close as possible to the requested size is used to fulfil the request. The slab allocator follows this principle with the addition of pre-initialising objects in the segregated free lists to improve performance.

Buddy Allocation

This approach tries to minimise the impact of coalescing space when it is returned. The binary buddy allocator recursively splits a chunk of memory in half until the chunk is just as big enough to satisfy the requested size. When that block is returned, the allocator tries to coalesce that block with its buddy (from the splitting phase) and this process is executed recursively until an in-use block is found.

Paging

Memory can be fragmented when splitting into variable-sized pieces, and that is the case when segmentation is used.

A different approach that can be used is to divide memory into fixed-size pieces. This is the idea behind paging, where each fixed-size unit is called a page. The corresponding fixed-size block in physical memory is called a page frame.

One of the benefits of this approach is the flexibility it brings. There isn’t a need for making assumptions about how a process uses its address space, the direction of the heap and stack, and how they are used.

This approach also brings simplicity in how the OS manages the free-space. When a process request for more memory, the OS can simply find the first few free pages that satisfy the given request and return them.

The operating system must record the mapping between a virtual page to the page frame and perform address translation when the memory is referenced. The records are stored in a per-process page table. The virtual address must have reserved bits for identifying the virtual page number (VPN), which will then be used to translate to the corresponding frame number. For more information about the representation of a page table entry, please see [2].

The following pseudocode shows an example of how the memory access might occur:

let vpn = (vaddr & VPN_MASK) >> SHIFT;
let pte_addr = page_table_base_register + vpn * sizeof(PageTableEntry);
let pte = access_memory(pte_addr);

if !pte.is_valid {
    return Err(SegmentationFault);
} else if !can_access(pte.protect_bits) {
    return Err(ProtectionFault);
} else {
    let offset = vaddr & OFFSET_MASK;
    let phys_addr = (pte.frame_number << PFN_SHIFT) | offset;
    let register = access_memory(phys_addr);
}

Translation-lookaside Buffer (TLB)

The process of paging requires an extra memory access to find the page table to finally obtain the frame number where the given virtual address resides. There is an evident impact on the performance with this extra access and to overcome this issue the hardware contains a translation-lookaside buffer (TLB) which is responsible for caching translations from virtual addresses to physical addresses.

The purpose of introducing a cache is to take advantage of temporal locality. The reasoning behind temporal locality is that if a piece of data has been accessed recently, it will likely be reaccessed soon in the future.

The pseudocode presented above can be adapted to consult the TLB before attempting to fetch the page table address:

let vpn = (vaddr & VPN_MASK) >> SHIFT;
let (tlb_entry, ok) = tlb_lookup(vpn);
if ok {
    if can_access(tlb_entry.protection_bits) {
        let offset = vaddr & OFFSET_MASK;
        let phys_addr = (tlb_entry.frame_number << PFN_SHIFT) | offset;
        let register = access_memory(phys_addr);
    } else {
        return Err(ProtectionFault);
    }
} else {
    let pte_addr = page_table_base_register + vpn * sizeof(PageTableEntry);
    let pte = access_memory(pte_addr);

    if !pte.is_valid {
        return Err(SegmentationFault);
    } else if !can_access(pte.protect_bits) {
        return Err(ProtectionFault);
    } else {
        tlb_insert(vpn, pte.frame_number, pte.protection_bits);
        retry_instruction();
    }
}

Some of the issues that should be addressed by a TLB are the following:

Context-switching: different processes can have the same virtual address pointing to different frames. The TLB needs to distinguish which virtual address to use based on an identifier for the currently running process. This is usually handled by having an address-space identifier that distinguishes such entries in a similar way that a process identifier would.
Replacement policy: when the TLB is full, and a new translation needs to be added, which entry should be evicted? The available policies are similar to those found in other caching systems such as LRU, as one example.

Page Table Size

Imagine a system with an address space of 32-bits and linear page tables where each page has a size of 4 KiB (2¹² bytes) and a 4-byte page table entry. One million pages would be required for this address space (2³² / 2¹²), meaning that the page table would take around 4 MiB in size. Add to the fact that the system will have multiple processes, and it is clear to see that a lot of space is necessary for supporting this model.

An initial idea for solving this problem would be to increase the page size, to 16 KiB as an example. With the page table entries still being 4-byte in size, the total size for the page table would be 1 MiB. The problem with this, though, is that it leads to internal fragmentation (internal space wasted).

Segmentation + Paging

A hybrid approach might help in this case. Instead of having a page table for the entire address space, the system could hold one table for each segment. The base register contains the physical address of the page table for that segment, and the bounds register contains the value of the maximum valid page in that segment.

This solution is not perfect, though. The same problems that segmentation brings are also brought here — namely, the lack of flexibility by assuming a particular usage pattern of the address space. Another problem with this approach relates to external fragmentation. Memory is still managed in fixed-size units, but the page table size can vary depending on how many entries they have, and it might be complicated to find free space for them.

Multi-level Page Tables

The essence of the problem with paging is the amount of wasted space in a page table which is used just for holding invalid entries (addresses not currently used). It turns out that the wasted space can be reduced by adding a level of indirection with page table directories.

Each page table directory tracks the pages tables that are valid, and if a page table is not currently in use, it will not be allocated. As an example, consider the following image:

Linear Page Table vs Multi-level Page Table

You can see from the image above that multi-level page tables only allocate page-table space in proportion to the amount of address space that is currently in use, making it compact while supporting sparse address spaces.

Another difference between the linear approach and multi-level tables is the fact that in the linear case, the entire table must reside in a contiguous section of memory. Consider the example presented before for an address space of 32-bit; it can be quite challenging to find 4 MiB of contiguous memory to hold the entire page table. In the multi-level case, there is no need to find that amount of contiguous memory simply because when new space needs to be allocated, the system can grab the first free frame, add it to the directory and allow the process to use it. The level of indirection permits the free frame to be anywhere.

As for negative aspects of this approach, you have the time-space trade-off. This approach introduces an extra access to memory in case of a TLB miss, but in contrast, it saves more space than the linear case. And this approach is undeniably more complex as well.

More than two levels

In the examples given so far, we have only considered the case where a single page directory was required to map all page tables of a particular address space. Now, let’s examine a much bigger address space as it is the case in 64-bit architectures.

Modern architectures don’t use all 64 bits for memory addressing; they usually restrain it to 48 bits (256 TiB). Considering 4 KiB pages, we are left with 36 bits to utilise for the tables and directory. The likely split would then be 18 bits for the page table and the remaining 18 for the directory. This split would mean that a page directory would have around 256 K entries (2¹⁸); which require around the same amount of space for a linear page table scheme in 32-bit architectures (1 MiB): 2¹⁸ * 4 bytes for each page table entry.

As a way to avoid using all this unnecessary space, 64-bit architectures further split the remaining bits into multiple levels; usually four levels with each taking 9 bits. The process is the same; if some areas of the address space are not in use, the levels leading to that address will not be allocated.

Inverted Page Tables

When a page miss occurs in the system explained above that would require up to four accesses to find a given page. It might not be ideal and a different approach using inverted page tables can be used instead.

In this approach, the system has a single page table that has an entry for each physical page. Finding a page becomes a search through this structure, looking at the entry’s metadata to find out which process is using the frame, and so on. An exhaustive search is not ideal, so the structure is turned into a hash map. The virtual addresses are mapped to a slot in the table via a hash function, and the standard collision resolution process is applied in case multiple pages from different processes map to the same frame.

Swap

Memory is not infinite, and at some point, there will be the need to stash away portions of the address space that are currently wired (in a physical page) but not presently in-use to make room for another part of the address space that should be immediately used.

The hard disk can be used as a temporary storage location for such pages. Systems reserve a location on the hard disk for this purpose which is often referred to as the swap area.

As an example, imagine the physical memory is full with pages, and a new region of an address space is accessed. The system will then remove a frame, store it in the swap area, and place the new page in the now empty slot in the physical memory.

When the page stashed in the swap space is referred to again, the system will notice the page is not present in physical memory (perhaps by reading a present bit), and fetch it from the disk. The act of accessing a page not present in physical memory is commonly referred to as a page fault.

Page-Replacement Policy

Similarly to what happens in the TLB, there must be a policy for evicting pages and replacing them with new ones. The same ideas apply here, as well.

One aspect worth noting relates to when the replacements should occur. You might expect that pages would be replaced only when there isn’t any more space in physical memory, and a new page must be accessed. The reality is that systems will keep a small amount of memory free by maintaining a high watermark and a low watermark. Whenever there are fewer than the low watermark of free pages, the system will evict some to disk. The high watermark is used to set the limit of how many pages should be removed. This task is carried out in the background by a swap daemon, for example.

One of the benefits of the approach of evicting several pages at the same time is that the system can better utilise the disk by grouping all pages and issuing a single write, and thus reducing seek and rotational overheads which increase the efficiency of the disk.

Other Aspects

In this section, we will list various aspects that might be considered by a system when dealing with virtual memory.

In case LRU is used as the page-replacement policy, it might be advantageous to employ a scan-resistance mechanism to it to help in workloads that while scanning pages might refer back to a page that was accessed at the beginning of the scan. One way to implement it is by splitting the space into two sections (not necessarily of equal size): young and old. Eviction occurs in the head of the old segment. Whenever a page is added, it first goes into the tail of the old section. Each access moves the page to the head of the young segment, and the recency decreases as you move towards the rear of the young part. This strategy is known as the midpoint insertion strategy.

Systems can also prefetch a page if they consider that it might be accessed soon in the future in contrast with using demand paging which only brings a page into memory when it is accessed.

Several processes running at the same time and all using too much memory can cause a condition called thrashing; in which the system is swapping pages in and out far too often. To cope with this issue, some systems might decide to run a limited number of processes and ignore others. A more drastic measure which is also employed by several operating systems is to kill some memory-intensive processes to release memory.

Copy-on-write (COW) is a mechanism used to reduce space usage where if two processes try to access the same page (when fork is used, for example); the operating system instead of copying the pages for the new address space, it marks the page as read-only, and both processes can use it as usual. When one process tries to write to the page, the OS recognises that COW is in place and only makes a copy of the page that can then be modified without impacting the other process. In case both processes never write to the page, the OS has thus saved space.

It is worth also noting demand zeroing. This could also be referred to as lazy page allocation. When a new page is added to the AS, the operating system maps it to an inaccessible page entry in the page table. If the process then reads or writes to the page, the OS notices the case, and will only then find a real physical page, zero it, and map it into the process’ address space.

References

[1] Remzi H. Arpaci-Dusseau, and Andrea C. Arpaci-Dusseau 2018. Operating Systems: Three Easy Pieces. Arpaci-Dusseau Books.
[2] Wiki.osdev.org. 2020. Paging - Osdev Wiki. [online] Available at: https://wiki.osdev.org/Paging#Page_Table [Accessed 9 July 2020].