Threads

A thread is a schedulable execution context represented by a Thread Control Block (TCB). Each TCB is a kernel object created by retyping untyped memory. Threads are bound to a SchedContext for scheduling parameters and a VSpace for their address space.

Thread Control Block

The TCB contains all per-thread kernel state:

Field Type Purpose

header

KernelObject

Object header (type = Tcb, refcount).

state

ThreadState

Current execution state.

priority

u64

Effective deadline (may be boosted by PIP).

base_priority

u64

Original deadline (unaffected by inheritance).

pip_donating_to

*mut Tcb

TCB this thread is donating priority to. Refcount-managed via set_pip_target / clear_pip_target helpers — the raw pointer is never assigned directly.

vspace_root

*mut VSpace

Page table root for this thread.

cspace_root

*mut CNode

CSpace root CNode.

ipc_buffer

u64

Virtual address of the thread’s IPC buffer page.

kernel_stack

u64

Kernel stack for syscall entry.

cpu_affinity

u32

Pinned CPU (0xFFFF_FFFF = any).

sc

*mut SchedContext

Bound scheduling context.

fault_endpoint

u64

Capability address of fault endpoint.

bound_notification

*mut Notification

Bound notification for combined IPC wait.

notification_dispatcher

u64

Userspace dispatcher for notification delivery during Call.

reply_tcb

*mut Tcb

Saved reply target (for server-side reply capability). Refcount-managed via set_reply_tcb / clear_reply_tcb + release_tcb_ref helpers — the raw pointer is never assigned directly.

blocked_reason

Option<BlockedReason>

Why the thread is blocked.

blocked_endpoint

*mut u8

Endpoint the thread is blocked on.

fpu_state

XSaveArea

Saved FPU/SIMD register state (always restored on context switch — eager FPU).

user_ticks

AtomicU64

Cumulative ticks spent executing in user mode; exported via TCB_GET_CPU_TIMES.

system_ticks

AtomicU64

Cumulative ticks spent executing in kernel mode on behalf of this thread; exported via TCB_GET_CPU_TIMES.

Thread States

Diagram
State Description

Inactive

Not schedulable. Newly created threads start here.

Ready

In the ready queue, waiting for the scheduler to select it.

Running

Currently executing on a CPU.

Blocked

Waiting for a specific event. The blocked_reason field identifies the cause.

Waiting

Waiting on a notification (legacy alias for notification-specific blocking).

Blocked Reasons

When a thread enters Blocked state, the blocked_reason identifies the cause:

Reason Cause

SendBlocked

Waiting for a receiver on an endpoint.

RecvBlocked

Waiting for a sender on an endpoint.

CallSendBlocked

Call’s send phase is waiting for a receiver.

ReplyWait

Waiting for the server’s reply after a Call.

FaultBlocked

Waiting for a fault handler to reply.

NotificationWait

Waiting on a notification.

TimerBlocked

Sleeping via NanoSleep.

FutexBlocked

Blocked on a futex wait.

FutexTimedBlocked

Blocked on a futex wait with timeout.

SendTimedBlocked

Timed send to an endpoint.

RecvTimedBlocked

Timed receive from an endpoint.

VSpaceWait

Waiting for a VSpace teardown to complete.

SchedContext

A SchedContext encapsulates scheduling parameters, decoupled from the thread identity:

pub struct SchedContext {
    pub header: KernelObject,
    pub sc_lock_state: AtomicU8,  // per-SC spinlock
    pub budget: u64,              // max CPU time per period (ticks)
    pub remaining: u64,           // remaining budget in current period
    pub period: u64,              // replenishment interval
    pub deadline: u64,            // absolute deadline
    pub bound_tcb: *mut Tcb,      // bound thread
    pub consumed: u64,            // cumulative consumed time (ticks)
}

A TCB is bound to a SchedContext via SC_BIND (invoke label 0x31). The SchedContext determines when the thread runs; the TCB determines what it executes.

Priority Inheritance Protocol (PIP)

When a high-priority thread blocks on a resource held by a lower-priority thread, the kernel temporarily boosts the holder’s priority to prevent priority inversion.

Mechanism

  1. Thread A (high deadline) sends to an endpoint where thread B (low deadline) is the expected receiver but is not yet waiting.

  2. Thread A blocks. The kernel records A.pip_donating_to = B (via set_pip_target, which increments B’s TCB refcount).

  3. Thread B’s effective priority (priority field) is set to A’s deadline.

  4. When B completes the operation and A is unblocked, B’s priority reverts to base_priority, and B is re-sorted in the ready queue (resort_ready_thread) so the scheduler sees the restored priority immediately. On SMP, the holder’s tcb_lock is acquired during reversion to prevent concurrent priority races.

Properties

  • Donation is transitive: if B is itself blocked on C, the donation propagates.

  • Only one donation per thread is tracked at a time.

  • Reversion is automatic when the blocking reason is resolved, and includes a ready-queue re-sort so the scheduler sees the restored priority without waiting for the next tick.

  • PIP applies to the thread’s TCB priority field, not the SchedContext directly.

  • Both pip_donating_to and reply_tcb are refcount-managed: setting either field increments the target TCB’s refcount, and clearing it decrements. This prevents use-after-free if the target TCB is destroyed while the pointer is still live.

FPU/SIMD State

FPU and SIMD registers use eager context switching: every TCB switch unconditionally saves the outgoing thread’s state and restores the incoming thread’s state, so no per-thread "dirty" flag is needed. The lazy #NM / CPACR trap paths used by previous versions were removed after they raced with the timer interrupt on both architectures.

x86_64

CR0.TS is cleared at boot and held at 0 for the lifetime of the kernel — the #NM vector is never taken for FPU faults. fpu::switch() calls XSAVE on the outgoing TCB and XRSTOR on the incoming TCB. Save area: 832 bytes (x87 + XSAVE header + AVX), 64-byte aligned.

aarch64

CPACR_EL1.FPEN is programmed to 0b11 (no trap from EL0 or EL1). fpu::switch() calls aarch64_fpsimd_save_state / aarch64_fpsimd_restore_state unconditionally. Save area: 528 bytes (32×V128 + FPCR + FPSR + padding), 16-byte aligned.

flush_current() / reload_current() keep the live register file in sync when TCB state is mutated outside a context switch (e.g. TCB_COPY_FPU).

Sleep Queue

The sleep queue manages threads with timed operations:

  • NanoSleep (syscall 13): the thread is enqueued with an absolute wakeup time.

  • Timed IPC: the thread is in both the endpoint queue and the sleep queue.

  • Timed futex: the thread is in both the futex hash table and the sleep queue.

On each timer tick, the scheduler checks the sleep queue head and wakes threads whose deadline has passed. Whichever event fires first (sleep expiry or partner arrival) wakes the thread; the other reference is cleaned up.

TCB Operations

Label Name Description

0x40

TCB_CONFIGURE

Set VSpace root, CSpace root, IPC buffer, fault endpoint.

0x41

TCB_RESUME

Transition from Inactive to Ready.

0x42

TCB_SUSPEND

Transition to Inactive.

0x43

TCB_SET_SPACE

Set CSpace root and VSpace root separately.

0x46

TCB_WRITE_REGISTERS

Write saved register state (IP, SP, arguments).

0x48

TCB_SET_IPC_BUFFER

Set IPC buffer virtual address.

0x49

TCB_BIND_NOTIFICATION

Bind a notification for combined IPC wait.

0x4A

TCB_UNBIND_NOTIFICATION

Break the TCB↔Notification link.

0x4B

TCB_SET_FAULT_HANDLER

Set fault handler endpoint.

0x4C

TCB_COPY_FPU

Copy FPU/SIMD state from another TCB.

0x4D

TCB_SET_TLS_BASE

Set TLS base register (FS on x86_64, TPIDR_EL0 on aarch64).

0x4E

TCB_SET_NOTIFICATION_DISPATCHER

Set notification dispatcher entry point.

0x4F

TCB_GET_SPACE_INFO

Return VSpace, CSpace, and IPC buffer configuration.

0x80

TCB_GET_CPU_TIMES

Sample the target thread’s cumulative user_ticks and system_ticks counters (assigned outside the 0x4X block because that block is saturated).

0x83

TCB_SET_ABI_TP

Set the ABI thread pointer (used by fork/TLS handoff).

Labels 0x44, 0x45, 0x47 are not assigned — the numbering is sparse to leave room for future TCB operations.
TCB_SET_TLS_BASE (0x4D) and TCB_GET_SPACE_INFO (0x4F) serve as kernel-side hooks for fork support and TLS handoff in the substrate worker pool.

Error Conditions and Edge Cases

TCB Destruction While Thread Is Running

When the TCB’s refcount drops to zero (all capabilities deleted and all internal references such as pip_donating_to and reply_tcb cleared), the TCB is destroyed. Both pip_donating_to and reply_tcb increment the target TCB’s refcount; clearing them decrements it.

  • If the thread is in any wait queue (endpoint, notification, futex, sleep), it is dequeued.

  • If the thread is in the ready queue, it is removed.

  • Bound notification and bound SchedContext are unlinked.

  • If the thread is currently running on a CPU, it is marked Inactive and the scheduler switches away at the next reschedule.

ThreadExit

When a thread calls syscall 28 (ThreadExit), the kernel:

  1. Clears the thread’s own scheduler state (removes it from ready/sleep queues if present).

  2. Drops internal references: walks and releases the reply_tcb chain, unbinds the bound notification.

  3. Signals the thread’s fault endpoint to notify any waiting fault handler.

  4. Marks the TCB as exited.

The actual destroy_object call is deferred until flush_deferred_current_release runs (typically at the next reschedule boundary), ensuring the TCB is not freed while it is still referenced by the scheduler’s current-thread pointer.

PIP Chain Depth Limit

Priority inheritance propagates transitively through pip_donating_to chains. To prevent unbounded traversal, propagation is limited to MAX_PIP_DEPTH (8 levels). If a donation chain exceeds 8 links, the boost stops at the 8th thread — deeper holders retain their base priority.

In practice, chains beyond 2-3 are extremely rare (a chain requires A→B→C where each is blocked waiting for the next).

PIP With Multiple Donors

The current PIP implementation tracks pip_donation_count (number of threads donating to this holder) but only applies the most recent donor’s priority. When pip_donation_count drops to 0, the holder’s priority reverts to base_priority.

This means if two donors with different priorities donate simultaneously, only the last donor’s priority is applied. The correct priority (minimum of all donors) is approximated but not exactly tracked. This is a known simplification — seL4 has the same limitation.

FPU State On Thread Destruction

The XSaveArea is embedded in the TCB struct, so the FPU save area is reclaimed together with the TCB and no separate allocation has to be freed. Because the switch is eager, the outgoing save happens before the TCB is ever marked inactive, so no in-flight FPU state can outlive the object.

SchedContext Without TCB

A SchedContext can exist without a bound TCB. It has no effect until bound via SC_BIND. Destroying an unbound SchedContext is a no-op (cleanup finds bound_tcb = null).

TCB Without SchedContext

A TCB without a bound SchedContext cannot be scheduled. TCB_RESUME on such a thread moves it to Ready state, but the scheduler skips threads without a valid SC during dequeue.

  • Scheduler — EDF algorithm, budget management, SMP

  • Endpoints — IPC blocking states

  • Notifications — bound notifications and notification dispatcher

  • Design Patterns — fork and capability delegation patterns

  • Object Model — TCB and SchedContext as kernel object types