Modern Operating Systems - Virtual Memory

Why Use Virtual Memory#

General operating systems typically run multiple processes that use the same physical memory resources. When the operating system manages memory, allowing processes to directly use physical memory can lead to several issues:

The space allocated to processes by the operating system may be insufficient, and subsequent memory requests may not be contiguous.
Since they share the same physical address space, Process A may write to addresses used by Process B.

For efficiency and security reasons, virtual memory is introduced: application processes read and write physical memory through virtual addresses, and the CPU's internal MMU is responsible for converting virtual addresses into specific physical addresses for corresponding read and write operations. The operating system is responsible for establishing a mapping from virtual addresses to physical addresses for each process. Each process can believe it has a contiguous and complete address space.

Most early systems used a segmented virtual memory management approach: the operating system divided the virtual address space and physical address space into segments of different sizes, such as code segments and data segments, and implemented the mapping from virtual addresses to physical addresses through a segment table. However, fragmentation can easily occur between physical segments.

The virtual address consists of: segment number + offset within the segment
The MMU finds the segment table using the segment table base register
The starting physical address of the corresponding segment is found in the segment table using the segment number
The physical address is obtained by adding the offset to the starting address of the physical segment

Virtual Memory Paging#

Modern operating systems typically manage virtual memory using a paging mechanism.

The operating system first divides the virtual address space and physical address space into smaller contiguous pages of equal size, and implements the mapping from virtual addresses to physical addresses through a page table. User processes can use the entire virtual address space, and the kernel maintains a page table for each process, so even if two processes use the same virtual address, there will be no conflict if the physical addresses mapped in the page table are different.

The virtual address consists of: page number + offset within the page
The MMU finds the corresponding process's page table using the page table base register
The starting address of the physical page corresponding to the page number is found in the page table
The physical address is obtained by adding the offset to the starting address of the physical page

Multi-level Page Tables#

To compress the physical memory space occupied by page table entries (considering a 64-bit virtual address space with a page size of 4KB using a single page table), multi-level page tables are typically used. Multi-level page tables divide the page number part of the virtual address into several parts, each corresponding to a page table entry. For example, in the AArch64 architecture, the virtual address uses 48 valid bits (i.e., the virtual address space size is 2^48), where the lower 12 bits represent the offset within the page, and the remaining 36 bits are divided into 4 levels of page tables.

First, find the physical address of the level 0 page table through the TTBR0_EL1 register
Use bits 47-39 to find the entry in the level 0 page table: the physical address of the level 1 page table
Use bits 38-30 to find the entry in the level 1 page table: the physical address of the level 2 page table
Use bits 29-21 to find the entry in the level 2 page table: the physical address of the level 3 page table
Use bits 20-12 to find the entry in the level 3 page table: the corresponding physical page number
The actual physical address is obtained by adding the offset (bits 11-0) to the physical page number

Typically, in addition to storing physical addresses, page table entries also store some flags, such as whether the physical address corresponding to the page table entry exists, whether it is a dirty page, etc. During the lookup process of multi-level page table entries, if a page table entry corresponds to a non-existent physical address, the query can be terminated directly.

Not all page table entries at level i in a multi-level page table point to a level i+1 page table. If a process only uses 4KB of memory space, then only one entry in its level 0 page table truly points to a level 1 page table, and only one entry in the level 1 page table points to a level 2 page table, and so on. This results in the use of only 4 page table pages. If it uses a contiguous memory space of 4KB*64, and if the addresses of the first three levels of page tables for these 64 pages happen to be the same, only 4 page table pages will ultimately be used, which is very space-efficient. Refer to the diagram below:

Address Translation

Translation Lookaside Buffer (TLB) and Page Table Switching#

To improve translation efficiency, the MMU has a TLB hardware component that caches a complete mapping from virtual page numbers to physical page numbers. The TLB has a multi-level cache similar to the CPU, such as L1 instruction cache, L1 data cache, L2 cache, etc. Generally, the number of entries that can be stored in the TLB is limited, with only a few thousand entries. During a translation process, the MMU first queries the cache through the TLB; if there is a cache hit, it returns directly; if there is a miss, it performs a multi-level page table query and finally writes the result back to the TLB. If the TLB is full, it will replace one entry according to hardware policy. Additionally, since multiple processes typically run in the operating system, different processes may use the same virtual addresses, so it is necessary to ensure that the content cached in the TLB is consistent with the current process's page table. The operating system actively refreshes the TLB during process switching (page table switching).

Considering that refreshing (clearing) the TLB every time a process is switched can lead to a large number of TLB misses when the process starts running, modern CPU hardware provides a process tagging feature, such as the Process Context Identifier under amd64. The operating system can assign different PCIDs to different processes and write them into the page table base register and TLB cache entries, so even when switching processes, the TLB can distinguish between different processes' cached entries through the PCID in the current page table base register and the PCID in the cache entries without clearing the TLB. However, when modifying the page table, the operating system still needs to refresh the TLB in a timely manner to ensure consistency. The downside of this approach is that the number of TLB entries available for a single process is further reduced.

In the AArch64 architecture, the kernel and user processes generally use different page table base registers, so there is no need to switch page tables when switching to kernel mode (executing system calls). In the x86-64 architecture, although there is only one page table base register, the kernel does not use a separate page table but maps its address space to the high part of each process's address space (effectively allowing multiple user processes to share the same kernel address space), so there is also no need to switch page tables.

Paging and Page Faults#

Note: In the following text, allocation refers to the process requesting a certain size of space from the operating system, which allocates a segment of space in the virtual address space for the process and returns its starting and ending addresses.

During the execution of an application process, a segment of memory space is pre-allocated (for example, using the brk or mmap system calls in Linux). Before any read or write operations are performed on these addresses, the operating system does not map them to real physical pages, meaning that the mapping of the physical pages corresponding to these addresses in the page table does not exist. Upon the first read or write, the CPU checks the page table and finds that the corresponding physical page does not exist, triggering a page fault. At this point, the CPU executes the page fault handler function pre-set by the operating system, finds a suitable physical page for it, and writes the address into the page table. In practice, to reduce the number of page fault triggers, the operating system may map physical pages for multiple adjacent virtual pages simultaneously when executing the page fault handling function.

A process may use more memory resources than the physical memory size during its execution. If a page fault is triggered again at this time, the operating system can free up some physical pages by writing some physical pages to disk or other lower-level storage devices, allowing some free physical pages for the currently running process. This process is called paging out. During the paging out process, the operating system needs to write the contents of the physical pages to disk, clear the corresponding page table entries, and save the location of the physical pages on disk. When the paged-out virtual page is accessed again, the operating system will write the previously paged-out content back into a physical page, modify the process's page table, and remap the virtual page to the new physical page. This process is called paging in.

The swap partition in the file system is the disk partition used for virtual memory during paging. Considering that disk IO efficiency is low, the operating system will predict multiple pages that may be accessed during paging in and page them in together to reduce the number of IO operations.

When an application accesses a certain virtual page and triggers a page fault, the operating system needs to determine whether this virtual page is in an unallocated state or an allocated but unmapped state. The operating system will only perform the mapping to physical pages for allocated virtual pages. In Linux, the virtual address space of a process is divided into multiple Virtual Memory Areas (VMAs), each with a certain range (from the starting address to the ending address of the area). For example, code, stack, and heap correspond to different VMAs. When a page fault is triggered, the operating system can determine whether it has been allocated by checking whether the virtual page is in a certain VMA.

Virtual Memory Functions#

Shared Memory#

Since virtual pages are mapped to physical pages through the process's page table, the operating system can achieve memory sharing between processes A and B by mapping two virtual pages A1 and B1 to the same physical page P. Any write to the shared memory by one process can be read by the other process.

Copy-On-Write#

Shared memory can implement the Copy On Write feature. For example, in Linux, a process can create a child process using fork. At the time of creation, the memory data of the parent and child processes is identical, so Linux only copies a page table for the child process without modifying any mappings. At the same time, Linux sets the permission of this shared memory in the page table to read-only. When either process writes to this virtual page, a page fault is triggered due to insufficient permissions. The operating system will then copy the data from the faulting page to a new physical page and update the page table entry of the faulting process with the new physical page and restore permissions. In addition to fork, multiple processes can also map to the same dynamic link library through shared memory, reducing memory usage.

Shared Memory and Copy-On-Write

Memory Deduplication#

Based on COW, the operating system can also actively merge multiple physical pages with the same content into one (periodically), and then modify the mappings of all associated page tables to this unique physical page to improve memory utilization efficiency. This feature is called memory deduplication. In Linux, this functionality is implemented as Kernel Same-page Merging. It can be imagined that using this feature may have some impact on performance, and it may also introduce certain security issues: for example, an attacker can enumerate and construct data, then wait for memory deduplication, and confirm whether deduplication has occurred based on access latency (caused by page faults from COW). If deduplication occurs, the attacker can guess that data identical to the current construction exists in the current physical address space. However, the operating system can avoid this issue by deduplicating only between processes of the same user.

Memory Compression#

The operating system can reduce the use of physical pages by compressing less frequently used memory data. In Linux, there is a zswap area in memory used to store compressed memory data. After memory data is compressed, it is first placed in zswap, and when physical memory resources are insufficient, zswap will batch swap out to disk. This can reduce or even avoid the disk IO caused by immediate paging out, thereby improving memory utilization efficiency.

Huge Pages#

As mentioned earlier, the number of TLB cache entries is limited, and with a 4KB page size, it may not be sufficient (frequent TLB misses lead to low efficiency). Therefore, many operating systems provide huge page functionality, increasing the page size to 2MB or even 1GB to reduce TLB occupancy. In Linux, the transparent huge page mechanism automatically merges contiguous 4KB pages into a 2MB page.

Using huge pages can significantly reduce TLB cache entry occupancy, thereby improving TLB hit rates. However, huge pages may also reduce memory utilization efficiency; for example, a 2MB huge page that only uses 256KB will lead to significant resource waste.

References and Citations
“Modern Operating Systems: Principles and Implementation”