Scheduler

Kernite uses an Earliest Deadline First (EDF) scheduler with per-thread budget enforcement. Each schedulable thread is bound to a SchedContext that defines its budget, period, deadline, and priority. The scheduler selects the thread with the earliest deadline from the local CPU’s ready queue.

Scheduler Structure

pub struct Scheduler {
    ready_heads: [*mut Tcb; MAX_CPUS],            // per-CPU ready queue heads
    current: [*mut Tcb; MAX_CPUS],                // per-CPU running thread
    idle: [*mut Tcb; MAX_CPUS],                   // per-CPU idle thread
    pending_enqueue: [AtomicPtr<Tcb>; MAX_CPUS],  // deferred enqueue slot
    lock_states: [AtomicU8; MAX_CPUS],            // per-CPU queue locks
    context_switches: [u64; MAX_CPUS],            // per-CPU counters
    timer_ticks: [u64; MAX_CPUS],
    idle_ticks: [u64; MAX_CPUS],
    per_cpu_user_ticks: [u64; MAX_CPUS],          // CPU accounting (see below)
    per_cpu_system_ticks: [u64; MAX_CPUS],
    ipi_reschedules: [u64; MAX_CPUS],
    online_cpus: u32,
    rebalance_counter: u64,
}

Each CPU has its own ready queue, current thread pointer, idle thread, and deferred enqueue slot. Per-CPU state is protected by lock_cpu(cpu), a per-CPU spinlock with TTAS and exponential backoff.

Ready Queue

Each CPU’s ready queue is a singly-linked list of TCBs sorted by deadline (earliest first). The priority field on the TCB holds the effective EDF deadline (lower value = earlier deadline = higher urgency).

Insertion

insert_sorted() inserts a TCB into the correct position in the sorted list:

  1. If the queue is empty or the new thread’s deadline is earlier than the head’s, insert at the head.

  2. Otherwise, walk the list until finding the correct sorted position and insert there.

Insertion is O(N) in the number of ready threads on that CPU. In practice, ready queue depth is small (typically < 10 threads per CPU).

Dequeue

dequeue_for_cpu_unlocked() pops the head of the queue — O(1). The head always has the earliest deadline.

Removal

remove_from_ready_queue_unlocked() removes a specific TCB from the queue. Uses an O(1) guard via the ready_queued flag before doing the O(N) list walk.

CPU Selection

When a thread is enqueued, select_target_cpu() chooses which CPU’s ready queue to use:

  1. If the thread has a specific cpu_affinity (not 0xFFFF_FFFF), always use that CPU.

  2. Otherwise, prefer last_cpu (cache affinity heuristic).

  3. Fall back to the current CPU.

EDF Algorithm

reschedule() on the local CPU:

  1. Process the deferred pending_enqueue slot (drain threads waiting for context save).

  2. Dequeue the head of the ready queue (earliest deadline).

  3. Compare with the current thread:

    • If no ready thread exists, switch to the idle thread.

    • If the ready thread has an earlier deadline than the current thread, preempt: the current thread is re-enqueued via pending_enqueue, and the ready thread becomes the new current.

    • Otherwise, the current thread continues.

The priority field is the comparison key. In EDF, this is the absolute deadline — a thread with priority = 100 has a later deadline than one with priority = 50.

Budget Management

Each thread’s SchedContext tracks CPU time:

Field Type Description

budget

u64

Maximum CPU time per period (ticks).

remaining

u64

Ticks remaining in the current period.

period

u64

Replenishment interval (ticks).

deadline

u64

Absolute deadline for the current period.

consumed

u64

Cumulative ticks consumed across all periods. Used to feed back into /proc-style CPU accounting and to diagnose runaway budget usage.

Timer Tick Handler

On each timer interrupt (timer_tick()):

  1. Increment the per-CPU timer_ticks counter.

  2. If the current thread is the idle thread, increment idle_ticks and return.

  3. Decrement the current thread’s sc.remaining.

  4. If remaining reaches zero:

    1. Mark the SchedContext exhausted.

    2. Schedule replenishment at deadline + period.

    3. Trigger reschedule.

  5. Check the sleep queue: wake all threads whose timer_wakeup_ns has passed (see Sleep Queue).

  6. Periodically trigger load rebalancing (every N ticks).

eoi() (end-of-interrupt) is sent before calling any function that might trigger a context switch. If EOI is deferred past a context switch, the interrupt controller blocks all further timer interrupts.

CPU Accounting

Every timer tick classifies the interrupted context as user, kernel, or idle and increments the appropriate counters:

  • If the interrupted CPU was running the idle thread, bump idle_ticks[cpu] and skip the per-thread update.

  • If the interrupted context was EL0 / ring 3, bump per_cpu_user_ticks[cpu] and sc.consumed, and atomically bump Tcb.user_ticks on the current thread.

  • If the interrupted context was EL1 / ring 0 (kernel), bump per_cpu_system_ticks[cpu] and sc.consumed, and atomically bump Tcb.system_ticks on the current thread.

The per-thread user_ticks / system_ticks counters are AtomicU64 so the readout path (the TCB_GET_CPU_TIMES invoke — label 0x80) can sample them without taking the scheduler lock. The per-CPU counters feed SysInfo / SysMemInfo for system-wide procfs views.

Replenishment

When the replenishment time arrives:

  1. Set remaining = budget.

  2. Compute new deadline = now + period.

  3. Update the thread’s priority to the new deadline.

  4. Re-enqueue the thread in the ready queue (sorted by new deadline).

Preemption

A reschedule is triggered by:

  • Timer tick: budget exhaustion or timed wakeup.

  • IPC wake: a blocked thread is unblocked by a partner arriving.

  • Yield: explicit Yield syscall (syscall 8).

  • State change: thread block, unblock, priority change, affinity change.

  • IPI: reschedule IPI from another CPU after remote enqueue.

Context Switch

The context switch sequence:

  1. The current thread’s state is saved (architecture-specific register save).

  2. If the outgoing thread is still runnable, it is not directly re-enqueued. Instead, it is placed in the pending_enqueue atomic slot for this CPU.

  3. The page table root is switched to the incoming thread’s VSpace.

  4. The incoming thread’s register state is restored.

  5. On the next entry to reschedule(), process_pending_enqueue() drains the deferred slot and inserts the outgoing thread into the ready queue.

The deferred enqueue prevents a race: without it, another CPU could dequeue and switch to the outgoing thread before its registers have been fully saved.

Lock Protocol

Two distinct locks govern the context switch path:

  • SCHED_IPC_LOCK — global IPC-path lock, acquired before any IPC/scheduler interaction. Protects the combined IPC operation and scheduler enqueue/context switch sequence.

  • lock_states[cpu] (aka sched.lock_cpu) — per-CPU ready-queue spinlock, acquired under SCHED_IPC_LOCK. Protects per-CPU queue manipulation.

Sequence:

  1. SCHED_IPC_LOCK is acquired before queue manipulation.

  2. lock_states[cpu] (sched.lock_cpu) is acquired under SCHED_IPC_LOCK for ready-queue insertion/removal.

  3. do_context_switch releases SCHED_IPC_LOCK before the actual arch::context_switch() call and reacquires it on resume.

  4. The pending_enqueue slot is drained after reacquiring.

SMP

Per-CPU Lock Ordering

When enqueuing a thread on a remote CPU’s queue, the scheduler must acquire two per-CPU locks. To prevent deadlock, locks are always acquired in CPU-ID order:

  • If target > local: hold local lock, acquire target lock.

  • If target < local: release local lock, acquire target lock, insert, release target lock, reacquire local lock.

IPI Reschedule

After enqueuing a thread on a remote CPU, the scheduler evaluates whether an IPI is needed:

  1. If the target CPU is idle: always send IPI.

  2. If the target CPU is running a thread with a later deadline than the newly enqueued thread: send IPI (preemption).

  3. If the target CPU’s current thread is in a switch-out path (state != Running): send IPI to flush pending work.

The target CPU’s IPI handler calls reschedule().

The current[target] read is best-effort (no target lock held). Stale reads cause either a spurious IPI (harmless) or a missed IPI (timer tick catches it within ~1 ms).

Load Rebalancing

The scheduler periodically rebalances by checking CPU queue depths. Only threads with cpu_affinity = 0xFFFF_FFFF (no affinity) are eligible for migration. Rebalancing moves threads from overloaded CPUs to underloaded ones.

Idle Thread

Each CPU has a dedicated idle thread. The idle loop performs three operations:

  1. Deferred free processing: advance the quiescent generation counter and free VSpace page table pages whose generation has been observed by all CPUs.

  2. Pending enqueue drain: process the pending_enqueue slot.

  3. Halt: hlt (x86_64) or wfi (aarch64) until the next interrupt.

The idle thread must periodically re-enter scheduler lock processing. Pending VSpace deactivations require each CPU to observe an IPI-driven state change. Without idle re-entry, delete and syscall paths that depend on cross-CPU quiescence can hang indefinitely.

Sleep Queue

The sleep queue is a global singly-linked list sorted by wakeup time (ascending). It is protected by SLEEP_LOCK (independent of the per-CPU scheduler locks).

A NEXT_WAKEUP_NS atomic variable caches the head’s wakeup time for fast skip in timer_tick(): if now_ns < NEXT_WAKEUP_NS, no sleep queue processing is needed.

Operations

insert(tcb)

Insert into the sorted list by timer_wakeup_ns. O(N).

remove(tcb)

Remove a specific TCB. O(N) list walk. Used when a thread is woken by another event before its sleep expires.

check_wakeups(now_ns)

Pop all threads from the head whose timer_wakeup_ns ⇐ now_ns. For each woken thread, clear blocked_reason, set state to Ready, and enqueue. Handles timed IPC cleanup: if the thread was in both an endpoint queue and the sleep queue, it is removed from the endpoint queue as well.

SchedContext Operations

Label Name Description

0x30

SC_CONFIGURE

Set budget, period, and priority.

0x31

SC_BIND

Bind a SchedContext to a TCB.

Error Conditions and Edge Cases

Bootstrap TCB

The bootstrap TCB represents kmain’s transient boot context. It has no dedicated per-thread kernel stack and must never be scheduled again after the first context switch away from it. If `enqueue_unlocked() detects the bootstrap TCB, it sets the thread to Inactive and returns without enqueuing.

Enqueue Race: Thread Still Running on Another CPU

When a thread is woken (e.g., IPC partner arrives) but is still the architectural current[cpu] on another CPU (mid-switch-out), directly enqueuing it would allow a second CPU to dequeue and switch to it before its registers are saved.

The scheduler detects this via find_running_cpu() and routes the thread into that CPU’s pending_enqueue atomic slot. An IPI nudges the target CPU to process the pending slot promptly.

IPI Read Race (Best-Effort)

The current[target] read in the IPI decision logic is best-effort — no target lock is held during the read. Stale reads produce two outcomes:

  • Spurious IPI: the target CPU finds nothing to preempt. Harmless (costs one interrupt).

  • Missed IPI: the next timer tick (within ~10 ms) or periodic rebalance catches the enqueued thread.

Per-CPU Lock Ordering Inversion

Enqueuing on a remote CPU requires two per-CPU locks. If target < local, the scheduler must release the local lock first, acquire target, insert, release target, reacquire local. This avoids ABBA deadlock between two CPUs simultaneously cross-enqueuing.

Budget Exhaustion During IPC

If a thread’s budget expires while it is blocked in IPC (e.g., waiting in RecvBlocked), the thread remains blocked — budget exhaustion only preempts running threads. The thread’s budget is replenished at its next deadline + period, and it becomes schedulable again when both unblocked and replenished.

Idle Thread Starvation of Deferred Free

If no timer interrupts reach the idle thread (e.g., all CPUs are busy with user threads), deferred VSpace free entries accumulate. The periodic rebalance counter in timer_tick() ensures deferred free processing occurs even on busy CPUs, but with lower frequency than idle processing.

  • Threads — TCB structure, thread states, priority inheritance

  • Endpoints — IPC blocking and wake interaction with scheduler

  • Design Patterns — timed operations pattern

  • Architecture — context switch assembly, timer setup