Scheduler
Kernite uses an Earliest Deadline First (EDF) scheduler with per-thread budget enforcement. Each schedulable thread is bound to a SchedContext that defines its budget, period, deadline, and priority. The scheduler selects the thread with the earliest deadline from the local CPU’s ready queue.
Scheduler Structure
pub struct Scheduler {
ready_heads: [*mut Tcb; MAX_CPUS], // per-CPU ready queue heads
current: [*mut Tcb; MAX_CPUS], // per-CPU running thread
idle: [*mut Tcb; MAX_CPUS], // per-CPU idle thread
pending_enqueue: [AtomicPtr<Tcb>; MAX_CPUS], // deferred enqueue slot
lock_states: [AtomicU8; MAX_CPUS], // per-CPU queue locks
context_switches: [u64; MAX_CPUS], // per-CPU counters
timer_ticks: [u64; MAX_CPUS],
idle_ticks: [u64; MAX_CPUS],
per_cpu_user_ticks: [u64; MAX_CPUS], // CPU accounting (see below)
per_cpu_system_ticks: [u64; MAX_CPUS],
ipi_reschedules: [u64; MAX_CPUS],
online_cpus: u32,
rebalance_counter: u64,
}
Each CPU has its own ready queue, current thread pointer, idle thread, and deferred enqueue slot.
Per-CPU state is protected by lock_cpu(cpu), a per-CPU spinlock with TTAS and exponential backoff.
Ready Queue
Each CPU’s ready queue is a singly-linked list of TCBs sorted by deadline (earliest first).
The priority field on the TCB holds the effective EDF deadline (lower value = earlier deadline = higher urgency).
Insertion
insert_sorted() inserts a TCB into the correct position in the sorted list:
-
If the queue is empty or the new thread’s deadline is earlier than the head’s, insert at the head.
-
Otherwise, walk the list until finding the correct sorted position and insert there.
Insertion is O(N) in the number of ready threads on that CPU. In practice, ready queue depth is small (typically < 10 threads per CPU).
CPU Selection
When a thread is enqueued, select_target_cpu() chooses which CPU’s ready queue to use:
-
If the thread has a specific
cpu_affinity(not0xFFFF_FFFF), always use that CPU. -
Otherwise, prefer
last_cpu(cache affinity heuristic). -
Fall back to the current CPU.
EDF Algorithm
reschedule() on the local CPU:
-
Process the deferred
pending_enqueueslot (drain threads waiting for context save). -
Dequeue the head of the ready queue (earliest deadline).
-
Compare with the current thread:
-
If no ready thread exists, switch to the idle thread.
-
If the ready thread has an earlier deadline than the current thread, preempt: the current thread is re-enqueued via
pending_enqueue, and the ready thread becomes the new current. -
Otherwise, the current thread continues.
-
The priority field is the comparison key.
In EDF, this is the absolute deadline — a thread with priority = 100 has a later deadline than one with priority = 50.
Budget Management
Each thread’s SchedContext tracks CPU time:
| Field | Type | Description |
|---|---|---|
|
|
Maximum CPU time per period (ticks). |
|
|
Ticks remaining in the current period. |
|
|
Replenishment interval (ticks). |
|
|
Absolute deadline for the current period. |
|
|
Cumulative ticks consumed across all periods. Used to feed back into |
Timer Tick Handler
On each timer interrupt (timer_tick()):
-
Increment the per-CPU
timer_tickscounter. -
If the current thread is the idle thread, increment
idle_ticksand return. -
Decrement the current thread’s
sc.remaining. -
If
remainingreaches zero:-
Mark the SchedContext exhausted.
-
Schedule replenishment at
deadline + period. -
Trigger reschedule.
-
-
Check the sleep queue: wake all threads whose
timer_wakeup_nshas passed (see Sleep Queue). -
Periodically trigger load rebalancing (every N ticks).
eoi() (end-of-interrupt) is sent before calling any function that might trigger a context switch.
If EOI is deferred past a context switch, the interrupt controller blocks all further timer interrupts.
|
CPU Accounting
Every timer tick classifies the interrupted context as user, kernel, or idle and increments the appropriate counters:
-
If the interrupted CPU was running the idle thread, bump
idle_ticks[cpu]and skip the per-thread update. -
If the interrupted context was EL0 / ring 3, bump
per_cpu_user_ticks[cpu]andsc.consumed, and atomically bumpTcb.user_tickson the current thread. -
If the interrupted context was EL1 / ring 0 (kernel), bump
per_cpu_system_ticks[cpu]andsc.consumed, and atomically bumpTcb.system_tickson the current thread.
The per-thread user_ticks / system_ticks counters are AtomicU64 so the readout path (the TCB_GET_CPU_TIMES invoke — label 0x80) can sample them without taking the scheduler lock. The per-CPU counters feed SysInfo / SysMemInfo for system-wide procfs views.
Preemption
A reschedule is triggered by:
-
Timer tick: budget exhaustion or timed wakeup.
-
IPC wake: a blocked thread is unblocked by a partner arriving.
-
Yield: explicit
Yieldsyscall (syscall 8). -
State change: thread block, unblock, priority change, affinity change.
-
IPI: reschedule IPI from another CPU after remote enqueue.
Context Switch
The context switch sequence:
-
The current thread’s state is saved (architecture-specific register save).
-
If the outgoing thread is still runnable, it is not directly re-enqueued. Instead, it is placed in the
pending_enqueueatomic slot for this CPU. -
The page table root is switched to the incoming thread’s VSpace.
-
The incoming thread’s register state is restored.
-
On the next entry to
reschedule(),process_pending_enqueue()drains the deferred slot and inserts the outgoing thread into the ready queue.
The deferred enqueue prevents a race: without it, another CPU could dequeue and switch to the outgoing thread before its registers have been fully saved.
Lock Protocol
Two distinct locks govern the context switch path:
-
SCHED_IPC_LOCK— global IPC-path lock, acquired before any IPC/scheduler interaction. Protects the combined IPC operation and scheduler enqueue/context switch sequence. -
lock_states[cpu](akasched.lock_cpu) — per-CPU ready-queue spinlock, acquired underSCHED_IPC_LOCK. Protects per-CPU queue manipulation.
Sequence:
-
SCHED_IPC_LOCKis acquired before queue manipulation. -
lock_states[cpu](sched.lock_cpu) is acquired underSCHED_IPC_LOCKfor ready-queue insertion/removal. -
do_context_switchreleasesSCHED_IPC_LOCKbefore the actualarch::context_switch()call and reacquires it on resume. -
The
pending_enqueueslot is drained after reacquiring.
SMP
Per-CPU Lock Ordering
When enqueuing a thread on a remote CPU’s queue, the scheduler must acquire two per-CPU locks. To prevent deadlock, locks are always acquired in CPU-ID order:
-
If
target > local: hold local lock, acquire target lock. -
If
target < local: release local lock, acquire target lock, insert, release target lock, reacquire local lock.
IPI Reschedule
After enqueuing a thread on a remote CPU, the scheduler evaluates whether an IPI is needed:
-
If the target CPU is idle: always send IPI.
-
If the target CPU is running a thread with a later deadline than the newly enqueued thread: send IPI (preemption).
-
If the target CPU’s current thread is in a switch-out path (state != Running): send IPI to flush pending work.
The target CPU’s IPI handler calls reschedule().
The current[target] read is best-effort (no target lock held).
Stale reads cause either a spurious IPI (harmless) or a missed IPI (timer tick catches it within ~1 ms).
Idle Thread
Each CPU has a dedicated idle thread. The idle loop performs three operations:
-
Deferred free processing: advance the quiescent generation counter and free VSpace page table pages whose generation has been observed by all CPUs.
-
Pending enqueue drain: process the
pending_enqueueslot. -
Halt:
hlt(x86_64) orwfi(aarch64) until the next interrupt.
| The idle thread must periodically re-enter scheduler lock processing. Pending VSpace deactivations require each CPU to observe an IPI-driven state change. Without idle re-entry, delete and syscall paths that depend on cross-CPU quiescence can hang indefinitely. |
Sleep Queue
The sleep queue is a global singly-linked list sorted by wakeup time (ascending).
It is protected by SLEEP_LOCK (independent of the per-CPU scheduler locks).
A NEXT_WAKEUP_NS atomic variable caches the head’s wakeup time for fast skip in timer_tick():
if now_ns < NEXT_WAKEUP_NS, no sleep queue processing is needed.
Operations
insert(tcb)-
Insert into the sorted list by
timer_wakeup_ns. O(N). remove(tcb)-
Remove a specific TCB. O(N) list walk. Used when a thread is woken by another event before its sleep expires.
check_wakeups(now_ns)-
Pop all threads from the head whose
timer_wakeup_ns ⇐ now_ns. For each woken thread, clearblocked_reason, set state toReady, and enqueue. Handles timed IPC cleanup: if the thread was in both an endpoint queue and the sleep queue, it is removed from the endpoint queue as well.
SchedContext Operations
| Label | Name | Description |
|---|---|---|
|
|
Set budget, period, and priority. |
|
|
Bind a SchedContext to a TCB. |
Error Conditions and Edge Cases
Bootstrap TCB
The bootstrap TCB represents kmain’s transient boot context.
It has no dedicated per-thread kernel stack and must never be scheduled again after the first context switch away from it.
If `enqueue_unlocked() detects the bootstrap TCB, it sets the thread to Inactive and returns without enqueuing.
Enqueue Race: Thread Still Running on Another CPU
When a thread is woken (e.g., IPC partner arrives) but is still the architectural current[cpu] on another CPU (mid-switch-out), directly enqueuing it would allow a second CPU to dequeue and switch to it before its registers are saved.
The scheduler detects this via find_running_cpu() and routes the thread into that CPU’s pending_enqueue atomic slot.
An IPI nudges the target CPU to process the pending slot promptly.
IPI Read Race (Best-Effort)
The current[target] read in the IPI decision logic is best-effort — no target lock is held during the read.
Stale reads produce two outcomes:
-
Spurious IPI: the target CPU finds nothing to preempt. Harmless (costs one interrupt).
-
Missed IPI: the next timer tick (within ~10 ms) or periodic rebalance catches the enqueued thread.
Per-CPU Lock Ordering Inversion
Enqueuing on a remote CPU requires two per-CPU locks.
If target < local, the scheduler must release the local lock first, acquire target, insert, release target, reacquire local.
This avoids ABBA deadlock between two CPUs simultaneously cross-enqueuing.
Budget Exhaustion During IPC
If a thread’s budget expires while it is blocked in IPC (e.g., waiting in RecvBlocked), the thread remains blocked — budget exhaustion only preempts running threads.
The thread’s budget is replenished at its next deadline + period, and it becomes schedulable again when both unblocked and replenished.
Idle Thread Starvation of Deferred Free
If no timer interrupts reach the idle thread (e.g., all CPUs are busy with user threads), deferred VSpace free entries accumulate.
The periodic rebalance counter in timer_tick() ensures deferred free processing occurs even on busy CPUs, but with lower frequency than idle processing.
Related Pages
-
Threads — TCB structure, thread states, priority inheritance
-
Endpoints — IPC blocking and wake interaction with scheduler
-
Design Patterns — timed operations pattern
-
Architecture — context switch assembly, timer setup