Virtual Address Spaces
A VSpace represents a virtual address space. It wraps a hardware page table root (PML4 on x86_64, TTBR0 on aarch64) and a Maple tree that tracks which virtual address regions are mapped. VSpaces are observers: they borrow page views from MemoryObjects and own no physical frames themselves.
VSpace Object Layout
A VSpace is allocated from untyped memory as a single contiguous block:
[0 .. 4096) — PML4 / TTBR0 page table root (one page) [4096 .. 4096 + TRACKING_SIZE) — VSpaceTracking (embedded metadata)
Total allocation: VSPACE_OBJECT_SIZE = PAGE_SIZE + VSPACE_TRACKING_SIZE.
On x86_64, the kernel copies PML4 entries 256-511 (upper half) from the kernel’s page table root into the new VSpace, so the kernel remains mapped in every address space.
VSpaceTracking
VSpaceTracking is embedded in the VSpace untyped allocation (seL4 pattern: all kernel object metadata lives in untyped memory).
The kernel VSpace uses static storage instead.
pub struct VSpaceTracking {
root: PhysAddr, // page table root address
state: AtomicU8, // Active | Dying | Dead
active_count: AtomicU32, // CPUs with this VSpace loaded
active_mask: [AtomicU32; (MAX_CPUS+31)/32], // bitmap of active CPUs
ipi_sent: AtomicU8, // prevent duplicate IPIs
retire_gen: AtomicU64, // generation when retired
retire_snapshot: [AtomicU64; MAX_CPUS], // per-CPU gen snapshot at retire
retire_online_cpus: AtomicU32, // online CPUs at retire time
waiter_head: UnsafeCell<*mut Tcb>, // VSpaceWait queue (sched lock only)
asid: AtomicU16, // ASID (aarch64), 0 = not allocated
asid_generation: AtomicU64, // detect stale ASIDs
mappings: MapleTree<VmArea>, // VA range → MO region
}
The waiter_head field is accessed only from the scheduler module under the scheduler lock with IRQs disabled.
VmArea
Each mapped region in a VSpace is described by a VmArea:
#[repr(C)]
pub struct VmArea { // 24 bytes
pub mo: *mut MemoryObject, // backing MO
pub mo_offset: u32, // page offset within MO
pub page_count: u32, // pages in this region
pub perms: u8, // RWX packed
pub _pad: [u8; 7],
}
VmAreas are stored in a Maple tree keyed by virtual address range.
The Maple tree nodes are allocated from the PMM with FrameOwner::KernelPrivate { MapleNode }.
VmAreas are purely descriptive — they do not own frames. They record which MO pages are visible at which virtual addresses with which permissions.
Page Tables
x86_64
4-level page tables: PML4 → PDPT → PD → PT. 512 entries per level (9 bits each), covering 48-bit virtual addresses.
-
PML4 entries 0-255: user pages.
-
PML4 entries 256-511: kernel pages (shared from the kernel VSpace).
aarch64
4-level page tables under TTBR0 (user) and TTBR1 (kernel). Separate translation registers eliminate the need for PML4-level sharing. ASID (Address Space Identifier) in TTBR0 bits 63:48 avoids full TLB flushes on switch.
Page Table Entry Flags
| Flag | Bit | Meaning |
|---|---|---|
|
0 |
Entry is valid and backed by a physical frame. |
|
1 |
Page is writable. |
|
2 |
Accessible from user mode (EL0). |
|
3 |
Write-through caching policy. |
|
4 |
Caching disabled (used for device MMIO). |
|
9 |
Copy-on-write marker. OS-available bit, used by the COW fast-path. |
|
10 |
Demand page marker. OS-available when |
|
63 |
Execute disable (NX on x86_64, UXN/PXN on aarch64). |
Physical address bits: [47:12] masked by ENTRY_ADDR_MASK = 0x000F_FFFF_FFFF_F000.
Mapping and Unmapping
VSPACE_MAP_MO (invoke label 0x97)
Map a range of MO pages into the VSpace at a specified virtual address:
-
Validate alignment, address range, and permissions.
-
Insert a
VmAreainto the Maple tree. -
Add a
ReverseMapEntryon the backing MO. -
For each page in the range:
-
If the page is committed in the MO’s radix tree: install a PTE with the physical address and permissions.
-
If uncommitted: install a PTE with
PRESENT=0,DEMAND=1(demand marker for lazy commit on first access).
-
-
If the range is ≥
RANGE_TLB_GLOBAL_THRESHOLD(8 pages), use a full TLB flush instead of per-page invalidation.
VSPACE_UNMAP (invoke label 0x51) for MO-backed ranges
MO-backed ranges are torn down with the generic VSPACE_UNMAP invocation — there is no dedicated VSPACE_UNMAP_MO label.
-
Look up the VmArea in the Maple tree by virtual address.
-
Clear the corresponding PTEs.
-
Remove the
ReverseMapEntryfrom the MO. -
Remove the VmArea from the Maple tree.
-
Issue TLB invalidation (per-page or full flush based on page count).
VSPACE_MAP_DEMAND / VSPACE_MAP_DEMAND_RANGE (invoke labels 0x59 / 0x5A)
Install demand PTEs (PRESENT = 0, DEMAND = 1) for a single page or a contiguous range, while still registering the VmArea / reverse map metadata. The range variant also carries a VSPACE_FLAG_REGION_KIND byte so that VSPACE_GET_MEM_STATS can classify the range as VmStk / VmExe / VmLib / VmData without an extra map-time round trip.
The first touch of a demand PTE raises a page fault which is resolved by handle_demand_fault() (see Page Fault Handling) — either zero-filling an anonymous frame or committing the backing MO at the faulted page index.
Per-VSpace counter vm_demand_pages: AtomicU64 tracks live demand pages; VSPACE_GET_MEM_STATS surfaces the value as VmDemand.
mmsrv-Coordinated Fault Servicing
For policies that live outside the kernel fast path (user-mode pagers, hot COW frame pools) the VSpace capability exposes a cluster of invocations used by mmsrv:
| Label | Name | Purpose |
|---|---|---|
|
|
Inject a freshly-allocated frame into a faulting COW PTE on behalf of a user-mode pager. This is an always-consume donation: the Frame cap is revoked and the frame retagged to |
|
|
Accept Frame capabilities and immediately revoke each one, retagging the backing frame to |
|
|
Bind a notification raised when the COW pool drops below its refill watermark. |
|
|
Accept Frame capabilities and immediately revoke each one, retagging the backing frame to |
|
|
Bulk version of |
VSPACE_SHARE_RO_PAGE (invoke label 0x99)
Share a single read-only page from one MO into another VSpace. Used for shared library text pages.
VSPACE_FORK_RANGE (invoke label 0x9A)
Bulk COW fork for the process manager’s fork path:
-
For each mapped VmArea in the source VSpace:
-
Clone the backing MO (
MO_CLONE, creating aCowChild). -
Map the cloned MO into the child VSpace at the same virtual address.
-
Mark both source and child PTEs as read-only with
COWbit set.
-
This is a single invoke that replaces many individual MO_CLONE + MAP_MO operations.
CowPool Structure
Each VSpace optionally holds a ring-buffer of pre-donated physical frames used to satisfy COW faults without calling into the PMM allocator on the fast path:
pub struct CowPool {
entries: [CowPoolEntry; 510],
head: u16, // consumer index (wraps at 510)
tail: u16, // producer index (wraps at 510)
}
pub struct CowPoolEntry {
phys: PhysAddr, // pre-donated frame (tagged KernelPrivate\{CowPool})
}
The ring buffer holds up to 510 slots. head == tail indicates an empty pool. The pool is filled via VSPACE_SET_COW_POOL / VSPACE_REPLENISH_COW_POOL and drained by handle_cow_fault_pooled.
cow_install_atomic (VSpace Method)
cow_install_atomic is a three-phase atomic protocol that installs a copied frame as a write-enabled PTE inside the faulting VSpace:
-
Reserve — acquire
MO.commit_lockand writePHYS_TAG_BUSYinto the MO radix slot. If the slot is already occupied (committed or BUSY by another CPU), returnRaceLostimmediately without modifying the PTE or frame ownership. -
Publish — write the new PTE with the new physical address and write permission, issue a TLB invalidation, and call
pmm_retain_mapping. -
Finalize — replace the
PHYS_TAG_BUSYentry in the radix with the real physical address, then callpmm_set_owner(MoData)as the commit point.
RaceLost semantics: the frame is freed back to the PMM (with its current tag, either KernelPrivate{General} or KernelPrivate{CowPool}). The PTE is never modified. The fault handler decides how to proceed based on which path called cow_install_atomic — see Page Fault Handling for the handle_cow_fault / handle_cow_fault_pooled asymmetry.
TLB Shootdown Protocol
When a PTE is modified on an SMP system, other CPUs may have stale cached translations.
Single-Page Invalidation
-
The modifying CPU writes the target virtual address via
set_tlb_shootdown_addr(va). -
Send
IpiKind::TlbShootdownto all CPUs that have this VSpace active (checked viaactive_mask). -
Each receiving CPU:
-
Reads the shootdown address.
-
Executes
invlpg(x86_64) orTLBI VAE1IS(aarch64) for that address. -
Increments its quiescent generation counter.
-
Range Invalidation
For operations touching ≥ 8 pages, the kernel uses a full TLB flush:
* x86_64: reload CR3 (flushes all non-global TLB entries).
* aarch64: TLBI VMALLE1IS (flush all EL1 entries) + DSB ISH + ISB.
Cross-CPU Synchronization
The active_count and active_mask on VSpaceTracking track which CPUs have this VSpace loaded.
On context switch:
-
The outgoing VSpace’s
active_countis decremented and its bit inactive_maskis cleared. -
The incoming VSpace’s
active_countis incremented and its bit is set.
TLB shootdown IPIs are only sent to CPUs in the active_mask, avoiding unnecessary interrupts to CPUs running other address spaces.
Deferred Free and Quiescent Generation
Page table intermediate pages cannot be freed immediately when a VSpace is destroyed, because another CPU might be mid-walk through that page table.
Mechanism
-
When a VSpace is destroyed, its
VSpaceTrackingis moved to the globalDEFERRED_FREE_LIST(protected byDEFERRED_FREE_LOCK). -
A
retire_genis assigned fromGLOBAL_RETIRE_GEN(monotonically increasing). -
A snapshot of each CPU’s
PENDING_GENERATIONis saved inretire_snapshot[cpu].
Quiescent Condition
A retired VSpaceTracking can be freed when:
for all cpu in 0..retire_online_cpus:
PENDING_GENERATION[cpu] > retire_snapshot[cpu]
This means every CPU has processed at least one scheduler tick (which calls advance_quiescent_gen()) since the VSpace was retired.
Processing
process_deferred_free() runs from the idle thread on each CPU:
-
Acquire
DEFERRED_FREE_LOCK. -
For each entry in
DEFERRED_FREE_LIST:-
Check the quiescent condition.
-
If met: call
free_pdpt_recursive/free_pd_recursive, which callpmm_release_mapping(drop the self-map reference) thenpmm_free(KernelPrivate{PageTable})on each PDPT, PD, and PT frame. The PML4 / aarch64 L0 root are owned by the parent untyped carve and are intentionally skipped. Remove from list. -
If not met: leave in list for the next check.
-
-
Release
DEFERRED_FREE_LOCK.
The idle thread must periodically re-enter this processing. Without it, VSpace teardown paths that wait for cross-CPU quiescence hang indefinitely.
VSpace Lifecycle
When a VSpace transitions to Dying:
-
No new mappings are accepted.
-
An IPI is sent to all CPUs with this VSpace active, requesting them to switch away.
-
Threads blocked on
VSpaceWaitare woken when the last CPU deactivates. -
VSpace::cleanupdrains un-consumed CowPool entries viapmm_free(KernelPrivate{CowPool})before page-table teardown. -
Page table pages are added to the deferred free list.
ASID Management (aarch64)
Each VSpace is assigned an Address Space Identifier (ASID) to avoid full TLB flushes on context switch.
-
ASID pool: protected by
ASID_LOCK(tier 6 in the lock hierarchy). -
ASID width: 16 bits (65,536 ASIDs).
-
ASID is stored in TTBR0_EL1 bits 63:48.
-
Generation tracking: if the global ASID generation has advanced past the VSpace’s
asid_generation, the ASID is stale and must be re-allocated.
On x86_64, ASID is not used — CR3 writes implicitly flush the TLB (PCID support is not yet implemented).
Memory Statistics
Two VSpace capability invocations expose per-address-space accounting for procfs-style tooling:
| Label | Name | Purpose |
|---|---|---|
|
|
Write a |
|
|
Write a |
The kernel maintains the counters directly on the VSpaceTracking:
-
VmSizetracks every page covered by a VmArea (committed or demand). -
VmRSScounts only committed pages (those that hold a frame reference). -
VmLckreflects pages pinned through MO flags (MO_PAGE_FLAG_PINNED). -
VmPTEcounts kernel-private page table pages allocated for this VSpace. -
VmDemandis the livevm_demand_pagescounter updated by the demand map / fault paths. -
The region-kind totals are populated from the
region_kindbyte recorded at map time (VSPACE_FLAG_REGION_KIND_*).
Because the counters are atomic, callers get a consistent snapshot without taking the VSpace lock.
Related Pages
-
Physical Memory — frame allocation for page table pages
-
Memory Objects — MO pages mapped by VSpace
-
Page Fault Handling — COW, demand, and slow-path fault resolution
-
Untyped Memory — VSpace retype from untyped
-
Threads — VSpace associated with each TCB
-
Architecture — x86_64/aarch64 paging differences