Memory Objects
A MemoryObject (MO) is a page-granular memory abstraction that manages physical pages independently of any address space. VSpaces map MO pages into hardware page tables; the MO tracks which VSpaces observe which pages via reverse maps. MOs support copy-on-write cloning for efficient fork.
MemoryObject Structure
#[repr(C)]
pub struct MemoryObject {
pub header: KernelObject, // obj_type = MemoryObject (11)
pub kind: MoKind, // Anon | CowChild | FileBacked | Shm
pub page_count: u32, // total pages in this MO
pub pages: RadixTree, // 4-level radix tree: page_idx → PhysAddr
pub reverse_maps: ReverseMaps, // VSpace observer tracking
pub cow_parent: u64, // parent MO cap slot (0 = none)
pub first_child: *mut MemoryObject, // head of COW child list
pub next_sibling: *mut MemoryObject,// sibling in parent's child list
pub untyped_phys: u64, // backing untyped physical address
pub commit_lock: SpinLock, // protects radix tree mutations
}
The struct size is computed by MemoryObject::required_bytes() and rounded up to 64-byte cache-line alignment.
Radix tree internal nodes are allocated separately from the PMM, not from the MO’s untyped allocation.
MO Kinds
| Kind | Description |
|---|---|
|
Anonymous memory. The standard kind for heap, stack, and general-purpose allocations. |
|
Copy-on-write child. Shares pages with a parent MO until a write fault triggers a private copy. |
|
File-backed pages (reserved, not yet implemented). Will integrate with the page cache. |
|
Shared memory. Multiple VSpaces map the same MO without COW semantics. |
Page Storage: 4-Level Radix Tree
MO pages are stored in a radix tree with the same structural design as hardware page tables.
Node Structure
Each node is one physical page (4,096 bytes) containing 512 entries of 8 bytes each. Each level consumes 9 bits of the page index.
Entries per node |
512 (4096 / 8) |
|---|---|
Bits per level |
9 |
Maximum levels |
4 |
Maximum capacity |
5124 = 68 billion entries (256 TB at 4 KB pages) |
Adaptive Depth
The tree depth adapts to the maximum page index used:
| Max index | Levels | Capacity |
|---|---|---|
≤ 512 |
1 |
2 MB |
≤ 262,144 |
2 |
1 GB |
≤ 134,217,728 |
3 |
512 GB |
> 134,217,728 |
4 |
256 TB |
When an insertion requires a higher index than the current depth supports, the tree grows by inserting new root nodes above the current root. The old root becomes entry 0 of the new root.
Leaf Entries
Leaf entries store physical addresses with tag bits in the low 12 bits (always zero for page-aligned addresses):
| Bit | Name | Meaning |
|---|---|---|
0 |
|
Page was allocated from an untyped source (not PMM). |
1 |
|
Commit in-progress sentinel. Another thread reserved this slot and is allocating a frame. |
Page Commit and Decommit
Commit (MO_COMMIT, label 0x90)
Allocates physical pages for a range of page indices. The dual-source allocation and BUSY sentinel are key to the design.
For each page in the requested range:
-
Acquire
commit_lock. -
Read the radix tree leaf entry.
-
If already committed (non-zero, non-BUSY): skip.
-
If BUSY: another thread is committing this page; skip.
-
-
Write
PHYS_TAG_BUSYsentinel into the leaf to reserve the slot. -
Release
commit_lock. -
Allocate a frame from the dual source:
-
First, try the MO’s untyped source (if configured).
-
If untyped allocation fails, fall back to the PMM.
-
-
Reacquire
commit_lock. -
Write the physical address (with
PHYS_TAG_UNTYPEDif from untyped) into the radix tree leaf, replacing the BUSY sentinel. -
Set
FrameOwner::MoData { mo, page_idx }on the allocated frame viapmm_transfer. -
Release
commit_lock.
The BUSY sentinel prevents two threads from committing the same page concurrently.
The lock is released during frame allocation (step 5) to avoid holding commit_lock across potentially slow untyped operations.
Decommit (MO_DECOMMIT, label 0x91)
Releases physical pages back to their source.
For each committed page in the range:
-
Read the physical address from the radix tree.
-
Iterate the reverse map and unmap the page from all VSpaces that reference it.
-
Free the frame: to the untyped source (if
PHYS_TAG_UNTYPED) or PMM. -
Clear the radix tree leaf entry.
Reverse Mappings
The reverse map tracks which VSpaces have mapped regions of this MO. It enables the kernel to find and update all page table entries that reference an MO page.
Structure
pub struct ReverseMapEntry { // 32 bytes
pub vspace: *mut VSpace, // observing VSpace
pub va_start: u64, // virtual address of mapping start
pub page_count: u32, // pages in this mapping
pub mo_offset: u32, // starting page index within the MO
pub perms: u8, // mapping permissions
pub _pad: [u8; 7],
}
Storage Layout
Each MO has 8 inline reverse map slots (RMAP_INLINE).
If those are exhausted, overflow pages are allocated from the PMM with FrameOwner::MoMeta { subkind: Rmap }:
-
Each overflow page holds
(4096 - 8) / 32 = 127entries, linked as a singly-linked list. -
The 8-byte overhead per page is the
nextpointer.
For the common case (MO mapped into 1-2 VSpaces), no overflow allocation is needed.
Operations
add(entry)-
Insert into the first empty inline slot, or the first empty slot in an overflow page.
remove(vspace, va_start)-
Scan inline slots and overflow pages for a matching entry and clear it.
for_each(callback)-
Iterate all non-empty entries (inline + overflow). Used during decommit and destruction.
COW Cloning
MO_CLONE (invoke label 0x93) creates a copy-on-write child MO.
Clone Operation
-
Create a new
MemoryObjectwithkind = CowChild. -
Set
cow_parentto the parent MO’s capability slot. -
Link the child into the parent’s
first_child/next_siblinglist. -
Increment the parent MO’s reference count.
-
The child’s radix tree starts empty — all pages are shared with the parent.
-
Mark both parent and child VSpace mappings as read-only to trap write faults.
Page Resolution with COW Chain
When resolving a page in a CowChild:
-
Check the child’s own radix tree. If a local entry exists (non-zero, non-BUSY), return it at depth 0.
-
Walk the
cow_parentchain, checking each ancestor’s radix tree. -
Return the first matching entry with the chain depth.
pub fn resolve_page_depth(&self, index: usize) -> Option<(u64, usize)> {
// Check own radix tree first — must hold commit_lock
self.commit_lock.lock();
let entry = self.pages.get(index);
self.commit_lock.unlock();
if entry != 0 && (entry & PHYS_TAG_BUSY) == 0 {
return Some((entry & !PHYS_TAG_MASK, 0));
}
// Walk parent chain — cow_parent must be read inside commit_lock
let mut depth = 0;
self.commit_lock.lock();
let mut parent_slot = self.cow_parent;
self.commit_lock.unlock();
while parent_slot != 0 {
let parent = deref_cow_parent(parent_slot);
depth += 1;
// check parent's radix tree...
parent.commit_lock.lock();
parent_slot = parent.cow_parent;
parent.commit_lock.unlock();
}
None
}
COW Flatten Threshold
If depth > COW_FLATTEN_THRESHOLD (8), the kernel proactively copies the page into the child’s own radix tree.
This bounds the worst-case chain walk length and prevents deep fork chains from degrading read performance.
cow_install_atomic Transaction Model
cow_install_atomic is the atomic commit primitive used to install a privately-copied frame into a COW child MO. All four phases run under MO.commit_lock:
-
Reserve BUSY — write
PHYS_TAG_BUSYinto the radix slot, claiming it. -
Publish PTE + TLB + PMM mapping-refs — write the new PTE with write permission, issue a TLB invalidation for the faulting address, and call
pmm_retain_mappingto account for the new mapping reference. -
Finalize radix — replace the
PHYS_TAG_BUSYsentinel with the real physical address. -
Commit point:
pmm_set_owner(MoData)— transfer the frame fromKernelPrivate{General}(orCowPool) toMoData. After this point the frame is fully owned by the MO.
write_entry failure (step 1) rolls back the PHYS_TAG_BUSY reservation and returns the frame to the PMM without touching the PTE or PMM owner state.
A RaceLost result (another CPU already reserved or finalized the slot) is reflected back to the fault handler as "continue" — the handler retries the fault with no frame leaked.
Write Fault Resolution (Kernel Fast-Path)
When a thread writes to a COW-shared page, the page table entry is read-only but the VmArea allows writes:
-
The page fault handler identifies this as a COW fault (PTE has
COWbit set). -
Allocate a new frame tagged
KernelPrivate{General}from the PMM (not from the MO’s untyped source). -
Copy 4,096 bytes from the original page to the new frame.
-
Call
cow_install_atomic, which performs the following atomically:-
Reserve the radix slot with
PHYS_TAG_BUSYunderMO.commit_lock. -
Write the new PTE, issue TLB invalidation, and call
pmm_retain_mapping. -
Finalize the radix slot: replace
PHYS_TAG_BUSYwith the new physical address. -
Call
pmm_set_owner(MoData)— this is the commit point.
-
-
If the radix reservation fails (e.g., another CPU already installed a page): the frame is returned to the PMM via
pmm_free, and the PTE is never modified. The fault handler translates aRaceLostreturn fromcow_install_atomicas "continue" and retries the fault.
| COW write faults are resolved entirely in the kernel without IPC to mmsrv. This is critical for fork performance — a deep fork tree with many touched pages would be extremely slow if each fault required an IPC round-trip. |
is_local_committed
is_local_committed(index) acquires commit_lock and checks whether a given page index has a committed (non-zero, non-BUSY) entry in the MO’s own radix tree.
It is used by fault paths to determine whether a page is locally committed before walking the COW parent chain — a locally committed page does not need a COW copy.
Invoke Labels
| Label | Name | Description |
|---|---|---|
|
|
Commit (allocate) pages for a range. |
|
|
Decommit (free) pages for a range. |
|
|
Return the page count. |
|
|
Create a COW child MO. |
|
|
Grow or shrink the MO. |
|
|
Read data from MO pages (kernel-mediated read). |
|
|
Write data to MO pages (kernel-mediated write). |
|
|
MO context: check if page is committed. VSpace context: map MO pages into VSpace. Label shared — dispatch decided by the invoked capability’s object type. |
|
|
Return how many live VSpace mappings reference a given page of this MO. |
|
|
MO context: atomically set/clear |
|
|
Bulk COW fork: clone MO range and remap. |
There is no dedicated VSPACE_UNMAP_MO label. MO-backed ranges are torn down with the generic VSPACE_UNMAP (0x51).
Per-Page Flags (MO_UPDATE_PAGE_FLAGS)
MO_UPDATE_PAGE_FLAGS operates on FrameMeta.flags for a single committed page (arg0 = page_index, arg1 = set_mask, arg2 = clear_mask). The previous flag word is returned in TronaResult.value.
| Bit | Name | Meaning |
|---|---|---|
|
|
Page has been modified since the last cleaner pass. |
|
|
Page has been accessed recently; cleared by the aging pass. |
|
|
Page is pinned into memory and counts toward |
|
|
Page is currently being written back to its backing store. |
|
|
Page belongs to the active LRU list. |
Destruction
When the last capability to an MO is deleted (release_object triggers destroy()):
-
Iterate the reverse map and unmap all pages from all observing VSpaces. Note: by the time MO
destroy()fires, some rmap entries may already have been removed by prior VSpace teardowns (commit ba7d104). The iteration must therefore tolerate missing entries gracefully. -
Walk the radix tree and free every committed page to its source (untyped or PMM).
-
Free radix tree internal nodes back to PMM.
-
Free reverse map overflow pages back to PMM.
-
Unlink from the COW parent’s child list (if
CowChild). -
Decrement the parent MO’s reference count.
Lock Ordering
MemoryObject operations participate in the kernel lock hierarchy:
CAP_LOCK → endpoint.lock → ... → commit_lock → ut.alloc_lock
commit_lock protects the radix tree during commit/decommit and is now also held during all radix reads on the COW chain (including reads of self.cow_parent).
ut.alloc_lock (on the backing untyped) protects the untyped watermark during frame allocation.
commit_lock is at the same nesting depth as rmap_lock — these two locks are disjoint and must never be held simultaneously. Full writer ordering: commit_lock → ut.alloc_lock → FRAME_LOCK.
MemoryObject::destroy() runs after release_object() decrements the refcount to zero.
At this point, no concurrent accessors exist.
The destroy path acquires VSpace.lock per reverse-map entry without holding CAP_LOCK (which was released before release_object).
Error Conditions and Edge Cases
Commit Exhaustion (Both Sources Fail)
MO_COMMIT tries the untyped source first, then the PMM.
If both are exhausted:
-
The BUSY sentinel in the radix tree leaf is cleared (slot reverts to uncommitted).
-
The commit returns
NotEnoughMemory. -
The caller (mmsrv or fault handler) must handle the failure — typically by reporting out-of-memory to the application.
The emergency reserve provides a last-resort pool for fault-path commits, but it is small (32 frames) and restricted to fault handlers.
Concurrent Commit to the Same Page
Two threads faulting on the same COW page simultaneously:
-
Thread A acquires
commit_lock, writes BUSY sentinel, releases lock, starts allocating. -
Thread B acquires
commit_lock, sees BUSY, treats page as uncommitted, releases lock. -
Thread B either retries (if in a fault loop) or returns to userspace (where the fault will re-trigger after A completes).
-
Thread A finishes allocation, writes the physical address, replaces BUSY.
The BUSY sentinel prevents double allocation. No frame is wasted.
The above describes the non-COW commit path. For COW paths, a RaceLost return from cow_install_atomic is translated to "continue" in the fault handler — the faulting thread simply retries the fault rather than blocking. The semantics differ: in the non-COW path Thread B eventually re-executes and may observe Thread A’s result; in the COW path, a RaceLost means another CPU already resolved the COW fault and the faulting thread can proceed.
|
Deep COW Chain
A chain of fork-without-exec creates increasingly deep COW parent chains:
shell → fork → fork → fork → ... → fork (depth N)
Page resolution walks the chain until it finds the page.
When depth > COW_FLATTEN_THRESHOLD (8), the kernel copies the page into the child’s own radix tree (flattening).
This bounds worst-case resolution to 8 parent hops.
Without flattening, a depth-100 fork chain would require 100 pointer dereferences per page access.
Destruction with Active Mappings
When the last capability to an MO is deleted while VSpaces still map its pages:
-
destroy()iterates the reverse map. -
For each mapping: acquires
VSpace.lock, clears the PTEs, issues TLB invalidation, removes the VmArea. -
After all mappings are removed: frees all committed pages to their sources.
-
Frees radix tree nodes and reverse map overflow pages.
Threads that access the unmapped pages after destruction take a page fault. The fault handler finds no VmArea and delivers a VM fault to mmsrv, which typically sends SIGSEGV.
Resize Race with Active Commit
MO_RESIZE (shrink) while a commit is in progress on a page that will be truncated:
-
Resize acquires
commit_lockbefore modifyingpage_count. -
Any in-flight commit that holds
commit_lockmust complete before resize proceeds. -
After resize, pages beyond the new
page_countare decommitted.
Related Pages
-
Physical Memory — PMM frame ownership and allocation
-
Virtual Address Spaces — page tables, fault handling, MAP_MO/UNMAP_MO
-
Page Fault Handling — COW and demand fault resolution
-
Untyped Memory — primary frame source for MO commit
-
Design Patterns — fork and memory allocation patterns
-
Object Model — MemoryObject as kernel object type 11