Memory Objects

A MemoryObject (MO) is a page-granular memory abstraction that manages physical pages independently of any address space. VSpaces map MO pages into hardware page tables; the MO tracks which VSpaces observe which pages via reverse maps. MOs support copy-on-write cloning for efficient fork.

MemoryObject Structure

#[repr(C)]
pub struct MemoryObject {
    pub header: KernelObject,           // obj_type = MemoryObject (11)
    pub kind: MoKind,                   // Anon | CowChild | FileBacked | Shm
    pub page_count: u32,                // total pages in this MO
    pub pages: RadixTree,               // 4-level radix tree: page_idx → PhysAddr
    pub reverse_maps: ReverseMaps,      // VSpace observer tracking
    pub cow_parent: u64,                // parent MO cap slot (0 = none)
    pub first_child: *mut MemoryObject, // head of COW child list
    pub next_sibling: *mut MemoryObject,// sibling in parent's child list
    pub untyped_phys: u64,              // backing untyped physical address
    pub commit_lock: SpinLock,          // protects radix tree mutations
}

The struct size is computed by MemoryObject::required_bytes() and rounded up to 64-byte cache-line alignment. Radix tree internal nodes are allocated separately from the PMM, not from the MO’s untyped allocation.

MO Kinds

Kind Description

Anon

Anonymous memory. The standard kind for heap, stack, and general-purpose allocations.

CowChild

Copy-on-write child. Shares pages with a parent MO until a write fault triggers a private copy.

FileBacked

File-backed pages (reserved, not yet implemented). Will integrate with the page cache.

Shm

Shared memory. Multiple VSpaces map the same MO without COW semantics.

Page Storage: 4-Level Radix Tree

MO pages are stored in a radix tree with the same structural design as hardware page tables.

Node Structure

Each node is one physical page (4,096 bytes) containing 512 entries of 8 bytes each. Each level consumes 9 bits of the page index.

Entries per node

512 (4096 / 8)

Bits per level

9

Maximum levels

4

Maximum capacity

5124 = 68 billion entries (256 TB at 4 KB pages)

Adaptive Depth

The tree depth adapts to the maximum page index used:

Max index Levels Capacity

≤ 512

1

2 MB

≤ 262,144

2

1 GB

≤ 134,217,728

3

512 GB

> 134,217,728

4

256 TB

When an insertion requires a higher index than the current depth supports, the tree grows by inserting new root nodes above the current root. The old root becomes entry 0 of the new root.

Leaf Entries

Leaf entries store physical addresses with tag bits in the low 12 bits (always zero for page-aligned addresses):

Bit Name Meaning

0

PHYS_TAG_UNTYPED

Page was allocated from an untyped source (not PMM).

1

PHYS_TAG_BUSY

Commit in-progress sentinel. Another thread reserved this slot and is allocating a frame.

Node Allocation

Internal nodes are allocated from the PMM with FrameOwner::MoMeta { subkind: Radix }. This is separate from the MO’s data page allocation — the MO’s untyped provides data pages, while the PMM provides tree metadata pages.

Memory Overhead

For an MO with N committed pages:

  • Best case (dense, 1 level): 1 node = 4 KB overhead for up to 512 pages.

  • Typical (2 levels): 1 root + ceil(N/512) leaf nodes.

  • Worst case (sparse, 4 levels): up to 3 intermediate nodes per leaf path, but sharing reduces this dramatically.

Page Commit and Decommit

Commit (MO_COMMIT, label 0x90)

Allocates physical pages for a range of page indices. The dual-source allocation and BUSY sentinel are key to the design.

For each page in the requested range:

  1. Acquire commit_lock.

  2. Read the radix tree leaf entry.

    1. If already committed (non-zero, non-BUSY): skip.

    2. If BUSY: another thread is committing this page; skip.

  3. Write PHYS_TAG_BUSY sentinel into the leaf to reserve the slot.

  4. Release commit_lock.

  5. Allocate a frame from the dual source:

    1. First, try the MO’s untyped source (if configured).

    2. If untyped allocation fails, fall back to the PMM.

  6. Reacquire commit_lock.

  7. Write the physical address (with PHYS_TAG_UNTYPED if from untyped) into the radix tree leaf, replacing the BUSY sentinel.

  8. Set FrameOwner::MoData { mo, page_idx } on the allocated frame via pmm_transfer.

  9. Release commit_lock.

The BUSY sentinel prevents two threads from committing the same page concurrently. The lock is released during frame allocation (step 5) to avoid holding commit_lock across potentially slow untyped operations.

Decommit (MO_DECOMMIT, label 0x91)

Releases physical pages back to their source.

For each committed page in the range:

  1. Read the physical address from the radix tree.

  2. Iterate the reverse map and unmap the page from all VSpaces that reference it.

  3. Free the frame: to the untyped source (if PHYS_TAG_UNTYPED) or PMM.

  4. Clear the radix tree leaf entry.

Reverse Mappings

The reverse map tracks which VSpaces have mapped regions of this MO. It enables the kernel to find and update all page table entries that reference an MO page.

Structure

pub struct ReverseMapEntry {    // 32 bytes
    pub vspace: *mut VSpace,   // observing VSpace
    pub va_start: u64,         // virtual address of mapping start
    pub page_count: u32,       // pages in this mapping
    pub mo_offset: u32,        // starting page index within the MO
    pub perms: u8,             // mapping permissions
    pub _pad: [u8; 7],
}

Storage Layout

Each MO has 8 inline reverse map slots (RMAP_INLINE). If those are exhausted, overflow pages are allocated from the PMM with FrameOwner::MoMeta { subkind: Rmap }:

  • Each overflow page holds (4096 - 8) / 32 = 127 entries, linked as a singly-linked list.

  • The 8-byte overhead per page is the next pointer.

For the common case (MO mapped into 1-2 VSpaces), no overflow allocation is needed.

Operations

add(entry)

Insert into the first empty inline slot, or the first empty slot in an overflow page.

remove(vspace, va_start)

Scan inline slots and overflow pages for a matching entry and clear it.

for_each(callback)

Iterate all non-empty entries (inline + overflow). Used during decommit and destruction.

COW Cloning

MO_CLONE (invoke label 0x93) creates a copy-on-write child MO.

Clone Operation

  1. Create a new MemoryObject with kind = CowChild.

  2. Set cow_parent to the parent MO’s capability slot.

  3. Link the child into the parent’s first_child / next_sibling list.

  4. Increment the parent MO’s reference count.

  5. The child’s radix tree starts empty — all pages are shared with the parent.

  6. Mark both parent and child VSpace mappings as read-only to trap write faults.

Page Resolution with COW Chain

When resolving a page in a CowChild:

  1. Check the child’s own radix tree. If a local entry exists (non-zero, non-BUSY), return it at depth 0.

  2. Walk the cow_parent chain, checking each ancestor’s radix tree.

  3. Return the first matching entry with the chain depth.

pub fn resolve_page_depth(&self, index: usize) -> Option<(u64, usize)> {
    // Check own radix tree first — must hold commit_lock
    self.commit_lock.lock();
    let entry = self.pages.get(index);
    self.commit_lock.unlock();
    if entry != 0 && (entry & PHYS_TAG_BUSY) == 0 {
        return Some((entry & !PHYS_TAG_MASK, 0));
    }
    // Walk parent chain — cow_parent must be read inside commit_lock
    let mut depth = 0;
    self.commit_lock.lock();
    let mut parent_slot = self.cow_parent;
    self.commit_lock.unlock();
    while parent_slot != 0 {
        let parent = deref_cow_parent(parent_slot);
        depth += 1;
        // check parent's radix tree...
        parent.commit_lock.lock();
        parent_slot = parent.cow_parent;
        parent.commit_lock.unlock();
    }
    None
}

COW Flatten Threshold

If depth > COW_FLATTEN_THRESHOLD (8), the kernel proactively copies the page into the child’s own radix tree. This bounds the worst-case chain walk length and prevents deep fork chains from degrading read performance.

cow_install_atomic Transaction Model

cow_install_atomic is the atomic commit primitive used to install a privately-copied frame into a COW child MO. All four phases run under MO.commit_lock:

  1. Reserve BUSY — write PHYS_TAG_BUSY into the radix slot, claiming it.

  2. Publish PTE + TLB + PMM mapping-refs — write the new PTE with write permission, issue a TLB invalidation for the faulting address, and call pmm_retain_mapping to account for the new mapping reference.

  3. Finalize radix — replace the PHYS_TAG_BUSY sentinel with the real physical address.

  4. Commit point: pmm_set_owner(MoData) — transfer the frame from KernelPrivate{General} (or CowPool) to MoData. After this point the frame is fully owned by the MO.

write_entry failure (step 1) rolls back the PHYS_TAG_BUSY reservation and returns the frame to the PMM without touching the PTE or PMM owner state.

A RaceLost result (another CPU already reserved or finalized the slot) is reflected back to the fault handler as "continue" — the handler retries the fault with no frame leaked.

Write Fault Resolution (Kernel Fast-Path)

When a thread writes to a COW-shared page, the page table entry is read-only but the VmArea allows writes:

  1. The page fault handler identifies this as a COW fault (PTE has COW bit set).

  2. Allocate a new frame tagged KernelPrivate{General} from the PMM (not from the MO’s untyped source).

  3. Copy 4,096 bytes from the original page to the new frame.

  4. Call cow_install_atomic, which performs the following atomically:

    1. Reserve the radix slot with PHYS_TAG_BUSY under MO.commit_lock.

    2. Write the new PTE, issue TLB invalidation, and call pmm_retain_mapping.

    3. Finalize the radix slot: replace PHYS_TAG_BUSY with the new physical address.

    4. Call pmm_set_owner(MoData) — this is the commit point.

  5. If the radix reservation fails (e.g., another CPU already installed a page): the frame is returned to the PMM via pmm_free, and the PTE is never modified. The fault handler translates a RaceLost return from cow_install_atomic as "continue" and retries the fault.

COW write faults are resolved entirely in the kernel without IPC to mmsrv. This is critical for fork performance — a deep fork tree with many touched pages would be extremely slow if each fault required an IPC round-trip.

is_local_committed

is_local_committed(index) acquires commit_lock and checks whether a given page index has a committed (non-zero, non-BUSY) entry in the MO’s own radix tree. It is used by fault paths to determine whether a page is locally committed before walking the COW parent chain — a locally committed page does not need a COW copy.

Invoke Labels

Label Name Description

0x90

MO_COMMIT

Commit (allocate) pages for a range.

0x91

MO_DECOMMIT

Decommit (free) pages for a range.

0x92

MO_GET_SIZE

Return the page count.

0x93

MO_CLONE

Create a COW child MO.

0x94

MO_RESIZE

Grow or shrink the MO.

0x95

MO_READ

Read data from MO pages (kernel-mediated read).

0x96

MO_WRITE

Write data to MO pages (kernel-mediated write).

0x97

MO_HAS_PAGE / VSPACE_MAP_MO

MO context: check if page is committed. VSpace context: map MO pages into VSpace. Label shared — dispatch decided by the invoked capability’s object type.

0x98

MO_GET_MAP_COUNT

Return how many live VSpace mappings reference a given page of this MO.

0x99

MO_UPDATE_PAGE_FLAGS / VSPACE_SHARE_RO_PAGE

MO context: atomically set/clear FrameMeta.flags bits for a committed page. VSpace context: share a single read-only page between MOs. Label shared — dispatch decided by the invoked capability’s object type.

0x9A

VSPACE_FORK_RANGE

Bulk COW fork: clone MO range and remap.

There is no dedicated VSPACE_UNMAP_MO label. MO-backed ranges are torn down with the generic VSPACE_UNMAP (0x51).

Per-Page Flags (MO_UPDATE_PAGE_FLAGS)

MO_UPDATE_PAGE_FLAGS operates on FrameMeta.flags for a single committed page (arg0 = page_index, arg1 = set_mask, arg2 = clear_mask). The previous flag word is returned in TronaResult.value.

Bit Name Meaning

1 << 0

MO_PAGE_FLAG_DIRTY

Page has been modified since the last cleaner pass.

1 << 1

MO_PAGE_FLAG_REFERENCED

Page has been accessed recently; cleared by the aging pass.

1 << 2

MO_PAGE_FLAG_PINNED

Page is pinned into memory and counts toward VmLck.

1 << 3

MO_PAGE_FLAG_WRITEBACK

Page is currently being written back to its backing store.

1 << 4

MO_PAGE_FLAG_ACTIVE

Page belongs to the active LRU list.

Destruction

When the last capability to an MO is deleted (release_object triggers destroy()):

  1. Iterate the reverse map and unmap all pages from all observing VSpaces. Note: by the time MO destroy() fires, some rmap entries may already have been removed by prior VSpace teardowns (commit ba7d104). The iteration must therefore tolerate missing entries gracefully.

  2. Walk the radix tree and free every committed page to its source (untyped or PMM).

  3. Free radix tree internal nodes back to PMM.

  4. Free reverse map overflow pages back to PMM.

  5. Unlink from the COW parent’s child list (if CowChild).

  6. Decrement the parent MO’s reference count.

Lock Ordering

MemoryObject operations participate in the kernel lock hierarchy:

CAP_LOCK → endpoint.lock → ... → commit_lock → ut.alloc_lock

commit_lock protects the radix tree during commit/decommit and is now also held during all radix reads on the COW chain (including reads of self.cow_parent). ut.alloc_lock (on the backing untyped) protects the untyped watermark during frame allocation.

commit_lock is at the same nesting depth as rmap_lock — these two locks are disjoint and must never be held simultaneously. Full writer ordering: commit_lock → ut.alloc_lock → FRAME_LOCK.

MemoryObject::destroy() runs after release_object() decrements the refcount to zero. At this point, no concurrent accessors exist. The destroy path acquires VSpace.lock per reverse-map entry without holding CAP_LOCK (which was released before release_object).

Error Conditions and Edge Cases

Commit Exhaustion (Both Sources Fail)

MO_COMMIT tries the untyped source first, then the PMM. If both are exhausted:

  • The BUSY sentinel in the radix tree leaf is cleared (slot reverts to uncommitted).

  • The commit returns NotEnoughMemory.

  • The caller (mmsrv or fault handler) must handle the failure — typically by reporting out-of-memory to the application.

The emergency reserve provides a last-resort pool for fault-path commits, but it is small (32 frames) and restricted to fault handlers.

Concurrent Commit to the Same Page

Two threads faulting on the same COW page simultaneously:

  1. Thread A acquires commit_lock, writes BUSY sentinel, releases lock, starts allocating.

  2. Thread B acquires commit_lock, sees BUSY, treats page as uncommitted, releases lock.

  3. Thread B either retries (if in a fault loop) or returns to userspace (where the fault will re-trigger after A completes).

  4. Thread A finishes allocation, writes the physical address, replaces BUSY.

The BUSY sentinel prevents double allocation. No frame is wasted.

The above describes the non-COW commit path. For COW paths, a RaceLost return from cow_install_atomic is translated to "continue" in the fault handler — the faulting thread simply retries the fault rather than blocking. The semantics differ: in the non-COW path Thread B eventually re-executes and may observe Thread A’s result; in the COW path, a RaceLost means another CPU already resolved the COW fault and the faulting thread can proceed.

Deep COW Chain

A chain of fork-without-exec creates increasingly deep COW parent chains:

shell → fork → fork → fork → ... → fork (depth N)

Page resolution walks the chain until it finds the page. When depth > COW_FLATTEN_THRESHOLD (8), the kernel copies the page into the child’s own radix tree (flattening). This bounds worst-case resolution to 8 parent hops.

Without flattening, a depth-100 fork chain would require 100 pointer dereferences per page access.

Destruction with Active Mappings

When the last capability to an MO is deleted while VSpaces still map its pages:

  1. destroy() iterates the reverse map.

  2. For each mapping: acquires VSpace.lock, clears the PTEs, issues TLB invalidation, removes the VmArea.

  3. After all mappings are removed: frees all committed pages to their sources.

  4. Frees radix tree nodes and reverse map overflow pages.

Threads that access the unmapped pages after destruction take a page fault. The fault handler finds no VmArea and delivers a VM fault to mmsrv, which typically sends SIGSEGV.

Resize Race with Active Commit

MO_RESIZE (shrink) while a commit is in progress on a page that will be truncated:

  • Resize acquires commit_lock before modifying page_count.

  • Any in-flight commit that holds commit_lock must complete before resize proceeds.

  • After resize, pages beyond the new page_count are decommitted.