Known Kernel Debt
This page lists structural debts and deferred work inside the kernite kernel. Every entry here is known and being tracked — these are not unknown bugs. Each entry describes the current state, why it still exists, what it blocks, where to intervene, and when to revisit.
New entries should be added whenever a workaround, latent race, or placeholder is introduced that the author does not intend to fix in the same change.
Memory-Model Audit Status
All blockers H1–H10 are CLOSED. No open blockers exist at current HEAD.
See Memory Model Audit for the full ledger.
Secondary Proof Complications
These are not blocker-level items, but each complicates a mechanized proof:
-
Intermediate radix nodes on
write_entryfailure — Memory accounting only; MO teardown frees the nodes. No correctness impact at runtime. -
Pool depth drift on
RaceLost—handle_cow_fault_pooledconsumes a pool entry beforecow_install_atomic; aRaceLostresult frees the frame without notifying mmsrv. Counter-visible but correctness-neutral. -
First-mapping caller lifetime — The physical address returned by
resolve_page_depthis valid only inside theVSpace.lockwindow for first-mapping callers. Callers must not escape the lock before using the returned address. -
cow_parentslot-reuse window — Betweencommit_lockrelease and thederef_cow_parentcall, the CDT may free and reallocate the slot. Bounded by A1 (kernel objects are never freed) and type-checking. -
handle_cow_faultbehavior change — Now returnsErr(NotMapped)instead of silent success when the COW PTE has no backing MO.
Load-Bearing Assumptions
These assumptions are relied upon by multiple subsystems. If either is ever violated, the reverse-map / destroy logic and publication protocol need audit.
A1. Kernel objects are never freed or reused during system lifetime. Raw pointer identity in reverse-map and destroy logic depends on this property. An object pointer stored in the rmap list remains valid as long as the system is running.
A2. The mmsrv main loop is single-threaded.
MmClient publication uses a staged swap, not an SMP-safe publication protocol.
Introducing parallelism into mmsrv requires revisiting every staged-swap site.
uapi ↔ kernel ABI drift
- Status
-
Accumulated (as of the
feat/cpu-accounting-procfsbranch). Mitigated by convention: every new syscall or invoke label must be added to both sides in the same change.
Current state
Kernite’s syscall numbers and invoke labels are declared in two places that are not structurally enforced to agree:
| Surface | Declared in | Consumed by |
|---|---|---|
|
|
userland (trona substrate) |
|
|
kernel dispatcher |
Invoke label constants ( |
|
userland wrappers |
Invoke label dispatch (same values, written as numeric literals) |
|
kernel dispatcher |
Because the kernel does not include! uapi, the compiler cannot prove the two tables stay in sync. Drift is only detectable at runtime, and usually shows up as InvalidOperation or silent no-ops.
Server protocol labels (PM_*, MM_*, VFS_*, net_*, dns_*) do NOT have this problem — servers use uapi directly, so a rename breaks compilation.
Known concrete drift
-
TCB labels
0x44/0x45/0x47(TCB_SET_AFFINITY,TCB_READ_REGISTERS,TCB_SET_PRIORITY) exist in the kernel’shandle_invokebut are not yet exposed in uapi. Reading uapi alone makes them look "free" — they are not. -
TCB_UNBIND_NOTIFICATION = 0x4Aonly landed in uapi on this branch; it was a kernel-only label before. -
TICKS_PER_SEC = 100is defined both inlib/trona/uapi/consts/kernel.rsandkernite/src/sched/mod.rs. -
TCB_GET_CPU_TIMES = 0x80is a workaround: the TCB0x40-0x4Fblock is saturated, so the introspection extension landed outside the range. Intended to move back into the TCB range once the label blocks are widened (see PR-5).
Target end state
lib/trona/uapi/ is the single source of truth; the kernel include!`s it and the compiler rejects any drift. Numeric `0xNN literals in handle_invoke are replaced with crate::uapi::kernel::TCB_* constants so that label renames show up as build errors.
Resolution path (ordered PRs)
- PR-1 (small)
-
Expose the four TCB labels (
0x44,0x45,0x47,0x4A) and add the matching substrate wrappers. No kernel change. - PR-2
-
Add
pub mod uapi { include!("../../lib/trona/uapi/consts/kernel.rs"); }to the kernel. RewriteSyscallenum variants to usecrate::uapi::kernel::SYS_*as their discriminants. - PR-3
-
Rewrite every
match (ObjectType::X, 0xNN)arm inhandle_invoketo use named constants fromcrate::uapi::kernel::*. - PR-4
-
Replace
kernite/src/sched/mod.rs::TICKS_PER_SECwithpub use crate::uapi::kernel::TICKS_PER_SEC;.
- PR-5 (ABI-breaking)
-
Widen label block size (e.g. TCB
0x40-0x5F, VSpace0x60-0x7F, …) and moveTCB_GET_CPU_TIMESback into the TCB range. Acceptable while vNext is pre-release.
PMM ↔ untyped dual inventory
- Status
-
Latent race. The x86_64
handle_demand_fault/handle_stack_growth_faultfast paths were previously disabled to avoid the race; demand paging is now active (H6–H10 closed), so the fast paths run on both architectures. The structural dual-inventory issue remains: aarch64 still runs with the race exposed on concurrentuntyped_retype+pmm_allocworkloads.
Current state
Two allocators describe the same physical frames and do not see each other:
-
FrameAllocator(kernite/src/mm/frame.rs) is a bitmap of free frames;pmm_allocdraws from it. -
UntypedMemory::retype(kernite/src/cap/untyped.rs) is a bump watermark over a contiguous physical range; it checks only that its target is inside the untyped block and does not consultFrameMeta.owner_tag.
A frame can simultaneously be "watermarked but not yet retyped" (so retype is about to hand it out) AND "free in the PMM bitmap" (so pmm_alloc can hand it out first). The untyped-side zero-fill then corrupts user data that the PMM-side caller just installed.
Blast radius
-
kernite/src/arch/x86_64/idt.rspreviously skipped the in-kernel demand-fault / stack-growth fast paths; with H6–H10 closed, demand paging is now active and those paths are enabled. -
aarch64 runs the fast path; the race window on concurrent
mmap(MAP_LAZY)+untyped_retyperemains. -
The dual-inventory issue still blocks lazy brk / sbrk safety guarantees and a unified cross-arch fault dispatch until the
UntypedReservedinvariant is enforced.
Required invariant
Every physical frame must be in exactly one of three states:
| State | Rule |
|---|---|
|
Only |
|
Only |
Typed owner ( |
Owned by a single capability / object. |
Transitions:
-
Boot: all usable RAM starts
Free; initial untyped creation moves ranges toUntypedReserved. -
retypeonFrame/MO:UntypedReserved→ typed owner (zero-fill stays). -
untype/ object destruction: typed owner →UntypedReserved(returned to the parent untyped, never toFree). -
Top-level untyped revoke: the full range returns to
Freein one step.
Per-frame FrameOwner tags alone are not enough — the bug is that the watermark allocator doesn’t look at them. The range-level invariant above is what retype must enforce.
Resolution path
-
Add the
UntypedReservedFrameOwnervariant (already present on this branch) and tag every frame an untyped covers at creation time. -
Teach
UntypedMemory::retypeto refuse when the next watermark slice is notUntypedReserved. -
Teach
pmm_allocto assert the drawn frame isFree(debug builds) and to skipUntypedReserved(release builds). -
Remove the x86_64
idt.rscomment that disables the demand / stack-growth fast paths. -
Align aarch64’s exception dispatch order with x86_64.
Eager FPU lock-in (forward-looking)
- Status
-
Policy chosen.
lazy → eagertransition completed onfeat/cpu-accounting-procfs. This entry exists so future work does not reintroduce the race that motivated the switch.
Current state
Both architectures save and restore the full FPU / SIMD register file on every TCB switch:
-
x86_64:
CR0.TSis held at0for the kernel’s lifetime.#NM(vector 7) is configured as a panic regression guard — hitting it means someone re-introduced lazy switching. -
aarch64:
CPACR_EL1.FPENis held at0b11. EC=0x07traps are a panic regression guard.
TCB.fpu_used and the per-CPU FPU_OWNER tracking have been removed — the current thread is always the owner.
Why eager was chosen
The previous lazy path had a race between the #NM / FP-trap handler and a preemption that occurred between clear_ts → xrstor → set_fpu_owner → return:
-
Thread A faults, handler clears TS, restores A’s state, about to return.
-
Preemption: scheduler picks B,
save_on_switchsets TS=1 again. -
A resumes. TS is now 1 but A believes the restore already completed.
#NMfires at the same RIP. Repeat.
The symptom was RIP drift with apparently-unstable FPU ownership. Making the switch atomic with set_ts inside the #NM handler is structurally hard with the current dispatch shape.
Why this is "debt"
The eager path costs one XSAVE + one XRSTOR per context switch even when the outgoing thread did not touch the FPU. On CPUs that support XSAVEC / XSAVEOPT, the compressed save is cheap and the cost is small — but on very-hot IPC paths between FP-light threads, it is measurable.
Resolution path (if ever needed)
Not a priority. If lazy FPU becomes worth revisiting:
-
The
#NMhandler and the scheduler’ssave_on_switchmust enter and leave CR0.TS atomically with respect to preemption. The easiest shape is a lock-free owner CAS rather than a sequence of separate writes. -
The aarch64 equivalent must do the same with
CPACR_EL1.FPEN. -
The save-on-switch path must be idempotent so that a preemption mid-restore does not double-save an unrelated thread’s state.
PerCpuData GS offset synchronization foot-gun
- Status
-
Structural foot-gun. Hand-maintained contract between
cpu.rsandsyscall.S.
Current state
The x86_64 PerCpuData struct is #[repr©] and accessed from assembly via hardcoded %gs:OFFSET literals. Rust knows the real offsets, but syscall.S embeds numeric offsets that are not derived from the struct at build time.
Current layout (post eager-FPU transition):
| Offset | Field | Assembly access |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(unused) |
If a field is added, reordered, or resized without updating the assembly, the stack canary check reads the wrong byte and reports a false-positive [STACK_CORRUPT] panic. Silent in CI; loud at runtime.
Files that must stay in sync
-
kernite/src/arch/x86_64/cpu.rs— struct layout + layout doc comment. -
kernite/src/arch/x86_64/syscall.S— header docstring + every%gs:OFFSETaccess. -
kernite/src/init.rs,kernite/src/sched/thread.rs— doc comments referencing%gs:32.
Resolution options
- Short term
-
Keep the hand-maintained contract but add a compile-time assertion (
const _: () = assert!(offset_of!(PerCpuData, saved_rsp) == 16);) for every asm-visible field. Prevents silent drift. - Long term
-
Replace hardcoded asm offsets with
global_asm!expressions that referenceoffset_of!at compile time, so the assembly picks up layout changes automatically.
Rule for editing PerCpuData
-
Add new fields at the end, inside
_reserved. Never in the first five slots. -
If an asm-visible slot MUST change, grep
rg "gs:"across the whole kernel tree and update every match. -
Compile-time assert the offsets of the first five fields in
cpu.rsnear the struct definition.
Page cache scaffolding without implementation
- Status
-
Placeholder.
OwnerTag::PageCacheandFrameOwner::PageCacheexist; no code ever produces a frame with either tag.
Current state
-
read(2)copies through a VFS-internal buffer that is not retained. -
mmap(MAP_SHARED, file)makes mmsrv allocate a freshMoKind::FileBackedMO per caller. Two processes that mmap the same file get two distinct MOs and distinct frames. -
/proc/meminfoCached:is therefore always0, as is thepages_page_cacheaccounting field. -
The three-state PMM invariant (see PMM ↔ untyped dual inventory) carries a permanently-zero
PageCacheterm.
Why the scaffolding was kept
The long-term plan is to implement a proper VFS page cache, not to delete the tag. Removing it now would only be undone later; keeping it zero-valued lets future code slot into place without churning the FrameOwner enum.
Resolution outline
-
VFS / mmsrv shared registry keyed by
(file_id, offset_range) → MO*so that multiple opens of the same file share one MO. -
Per-page
FRAME_FLAG_DIRTYtracking driven by PTE dirty bits, withmsync/fsync/ writeback cleaning them. -
Evict path: pick clean
OwnerTag::PageCacheframes (lowmap_count), drop the MO radix entry, rewrite every reverse-mapped PTE toENTRY_DEMAND, return the frame toFree. -
Writeback queue: background flushes dirty pages;
fsync/msyncare synchronous flushes. -
OOM escalation: try evicting clean cache pages → retry, then fall back to victim kill.
Dependencies
-
Demand paging must be active (it is, on this branch) so that re-loading an evicted page can fault-in from the backing file.
-
MemoryObject.reverse_mapsneeds per-page granularity, not per-VmArea granularity, for eviction to know which PTE to clear. -
VFS needs an inode-level identity so that dedup is well-defined.