Threads
A thread is a schedulable execution context represented by a Thread Control Block (TCB). Each TCB is a kernel object created by retyping untyped memory. Threads are bound to a SchedContext for scheduling parameters and a VSpace for their address space.
Thread Control Block
The TCB contains all per-thread kernel state:
| Field | Type | Purpose |
|---|---|---|
|
|
Object header (type = |
|
|
Current execution state. |
|
|
Effective deadline (may be boosted by PIP). |
|
|
Original deadline (unaffected by inheritance). |
|
|
TCB this thread is donating priority to. Refcount-managed via |
|
|
Page table root for this thread. |
|
|
CSpace root CNode. |
|
|
Virtual address of the thread’s IPC buffer page. |
|
|
Kernel stack for syscall entry. |
|
|
Pinned CPU ( |
|
|
Bound scheduling context. |
|
|
Capability address of fault endpoint. |
|
|
Bound notification for combined IPC wait. |
|
|
Userspace dispatcher for notification delivery during Call. |
|
|
Saved reply target (for server-side reply capability). Refcount-managed via |
|
|
Why the thread is blocked. |
|
|
Endpoint the thread is blocked on. |
|
|
Saved FPU/SIMD register state (always restored on context switch — eager FPU). |
|
|
Cumulative ticks spent executing in user mode; exported via |
|
|
Cumulative ticks spent executing in kernel mode on behalf of this thread; exported via |
Thread States
| State | Description |
|---|---|
|
Not schedulable. Newly created threads start here. |
|
In the ready queue, waiting for the scheduler to select it. |
|
Currently executing on a CPU. |
|
Waiting for a specific event. The |
|
Waiting on a notification (legacy alias for notification-specific blocking). |
Blocked Reasons
When a thread enters Blocked state, the blocked_reason identifies the cause:
| Reason | Cause |
|---|---|
|
Waiting for a receiver on an endpoint. |
|
Waiting for a sender on an endpoint. |
|
Call’s send phase is waiting for a receiver. |
|
Waiting for the server’s reply after a Call. |
|
Waiting for a fault handler to reply. |
|
Waiting on a notification. |
|
Sleeping via |
|
Blocked on a futex wait. |
|
Blocked on a futex wait with timeout. |
|
Timed send to an endpoint. |
|
Timed receive from an endpoint. |
|
Waiting for a VSpace teardown to complete. |
SchedContext
A SchedContext encapsulates scheduling parameters, decoupled from the thread identity:
pub struct SchedContext {
pub header: KernelObject,
pub sc_lock_state: AtomicU8, // per-SC spinlock
pub budget: u64, // max CPU time per period (ticks)
pub remaining: u64, // remaining budget in current period
pub period: u64, // replenishment interval
pub deadline: u64, // absolute deadline
pub bound_tcb: *mut Tcb, // bound thread
pub consumed: u64, // cumulative consumed time (ticks)
}
A TCB is bound to a SchedContext via SC_BIND (invoke label 0x31).
The SchedContext determines when the thread runs; the TCB determines what it executes.
Priority Inheritance Protocol (PIP)
When a high-priority thread blocks on a resource held by a lower-priority thread, the kernel temporarily boosts the holder’s priority to prevent priority inversion.
Mechanism
-
Thread A (high deadline) sends to an endpoint where thread B (low deadline) is the expected receiver but is not yet waiting.
-
Thread A blocks. The kernel records
A.pip_donating_to = B(viaset_pip_target, which increments B’s TCB refcount). -
Thread B’s effective priority (
priorityfield) is set to A’s deadline. -
When B completes the operation and A is unblocked, B’s priority reverts to
base_priority, and B is re-sorted in the ready queue (resort_ready_thread) so the scheduler sees the restored priority immediately. On SMP, the holder’stcb_lockis acquired during reversion to prevent concurrent priority races.
Properties
-
Donation is transitive: if B is itself blocked on C, the donation propagates.
-
Only one donation per thread is tracked at a time.
-
Reversion is automatic when the blocking reason is resolved, and includes a ready-queue re-sort so the scheduler sees the restored priority without waiting for the next tick.
-
PIP applies to the thread’s TCB
priorityfield, not the SchedContext directly. -
Both
pip_donating_toandreply_tcbare refcount-managed: setting either field increments the target TCB’s refcount, and clearing it decrements. This prevents use-after-free if the target TCB is destroyed while the pointer is still live.
FPU/SIMD State
FPU and SIMD registers use eager context switching: every TCB switch unconditionally saves the outgoing thread’s state and restores the incoming thread’s state, so no per-thread "dirty" flag is needed. The lazy #NM / CPACR trap paths used by previous versions were removed after they raced with the timer interrupt on both architectures.
- x86_64
-
CR0.TSis cleared at boot and held at0for the lifetime of the kernel — the#NMvector is never taken for FPU faults.fpu::switch()callsXSAVEon the outgoing TCB andXRSTORon the incoming TCB. Save area: 832 bytes (x87 + XSAVE header + AVX), 64-byte aligned. - aarch64
-
CPACR_EL1.FPENis programmed to0b11(no trap from EL0 or EL1).fpu::switch()callsaarch64_fpsimd_save_state/aarch64_fpsimd_restore_stateunconditionally. Save area: 528 bytes (32×V128 + FPCR + FPSR + padding), 16-byte aligned.
flush_current() / reload_current() keep the live register file in sync when TCB state is mutated outside a context switch (e.g. TCB_COPY_FPU).
Sleep Queue
The sleep queue manages threads with timed operations:
-
NanoSleep(syscall 13): the thread is enqueued with an absolute wakeup time. -
Timed IPC: the thread is in both the endpoint queue and the sleep queue.
-
Timed futex: the thread is in both the futex hash table and the sleep queue.
On each timer tick, the scheduler checks the sleep queue head and wakes threads whose deadline has passed. Whichever event fires first (sleep expiry or partner arrival) wakes the thread; the other reference is cleaned up.
TCB Operations
| Label | Name | Description |
|---|---|---|
|
|
Set VSpace root, CSpace root, IPC buffer, fault endpoint. |
|
|
Transition from |
|
|
Transition to |
|
|
Set CSpace root and VSpace root separately. |
|
|
Write saved register state (IP, SP, arguments). |
|
|
Set IPC buffer virtual address. |
|
|
Bind a notification for combined IPC wait. |
|
|
Break the TCB↔Notification link. |
|
|
Set fault handler endpoint. |
|
|
Copy FPU/SIMD state from another TCB. |
|
|
Set TLS base register ( |
|
|
Set notification dispatcher entry point. |
|
|
Return VSpace, CSpace, and IPC buffer configuration. |
|
|
Sample the target thread’s cumulative |
|
|
Set the ABI thread pointer (used by fork/TLS handoff). |
| Labels 0x44, 0x45, 0x47 are not assigned — the numbering is sparse to leave room for future TCB operations. |
TCB_SET_TLS_BASE (0x4D) and TCB_GET_SPACE_INFO (0x4F) serve as kernel-side hooks for fork support and TLS handoff in the substrate worker pool.
|
Error Conditions and Edge Cases
TCB Destruction While Thread Is Running
When the TCB’s refcount drops to zero (all capabilities deleted and all internal references such as pip_donating_to and reply_tcb cleared), the TCB is destroyed.
Both pip_donating_to and reply_tcb increment the target TCB’s refcount; clearing them decrements it.
-
If the thread is in any wait queue (endpoint, notification, futex, sleep), it is dequeued.
-
If the thread is in the ready queue, it is removed.
-
Bound notification and bound SchedContext are unlinked.
-
If the thread is currently running on a CPU, it is marked
Inactiveand the scheduler switches away at the next reschedule.
ThreadExit
When a thread calls syscall 28 (ThreadExit), the kernel:
-
Clears the thread’s own scheduler state (removes it from ready/sleep queues if present).
-
Drops internal references: walks and releases the
reply_tcbchain, unbinds the bound notification. -
Signals the thread’s fault endpoint to notify any waiting fault handler.
-
Marks the TCB as exited.
The actual destroy_object call is deferred until flush_deferred_current_release runs (typically at the next reschedule boundary), ensuring the TCB is not freed while it is still referenced by the scheduler’s current-thread pointer.
PIP Chain Depth Limit
Priority inheritance propagates transitively through pip_donating_to chains.
To prevent unbounded traversal, propagation is limited to MAX_PIP_DEPTH (8 levels).
If a donation chain exceeds 8 links, the boost stops at the 8th thread — deeper holders retain their base priority.
In practice, chains beyond 2-3 are extremely rare (a chain requires A→B→C where each is blocked waiting for the next).
PIP With Multiple Donors
The current PIP implementation tracks pip_donation_count (number of threads donating to this holder) but only applies the most recent donor’s priority.
When pip_donation_count drops to 0, the holder’s priority reverts to base_priority.
This means if two donors with different priorities donate simultaneously, only the last donor’s priority is applied. The correct priority (minimum of all donors) is approximated but not exactly tracked. This is a known simplification — seL4 has the same limitation.
FPU State On Thread Destruction
The XSaveArea is embedded in the TCB struct, so the FPU save area is reclaimed together with the TCB and no separate allocation has to be freed.
Because the switch is eager, the outgoing save happens before the TCB is ever marked inactive, so no in-flight FPU state can outlive the object.
Related Pages
-
Scheduler — EDF algorithm, budget management, SMP
-
Endpoints — IPC blocking states
-
Notifications — bound notifications and notification dispatcher
-
Design Patterns — fork and capability delegation patterns
-
Object Model — TCB and SchedContext as kernel object types