x86_64

This page documents the x86_64 architecture backend in kernite/src/arch/x86_64/.

Module Structure

File Responsibility

mod.rs

Backend entry point, init() sequence, timer abstraction, I/O port wrappers.

boot.rs

BSP bootstrap: GDT, IDT, APIC, ACPI, paging, CPUID initialization in order.

gdt.rs

Global Descriptor Table with Task State Segment (TSS). Ring 0/3 code and data segments.

idt.rs

Interrupt Descriptor Table — 256 entries covering exceptions (0-31), IRQs (32+), and syscall.

apic.rs

Local APIC (timer, IPI, EOI) + I/O APIC (IRQ routing, level/edge trigger, masking).

acpi.rs

ACPI MADT (CPU enumeration), MCFG (PCIe ECAM base), HPET detection.

paging.rs

4-level page tables (PML4 → PDPT → PD → PT), direct physical map setup, identity map management.

ap_boot.rs

Application processor trampoline: real mode → protected mode → long mode. Uses ACPI MADT to find APs.

context.rs

Context switch: save/restore of rsp, rbx, rbp, r12-r15, rflags, page table switch.

cpu.rs

Per-CPU PerCpuData block accessed via %gs segment — CPU ID, kernel stack, saved user RSP, invoke sequence, stack canary.

cpuid.rs

CPUID feature detection: SSE, AVX, XSAVE, SMAP, SMEP, 1G pages, FSGSBASE.

fpu.rs

FPU/SSE/AVX eager context switch — XSAVE/XRSTOR on every TCB switch, CR0.TS held at 0.

pit.rs

Programmable Interval Timer — used for APIC timer calibration, fallback timer.

uaccess.rs

SMAP/SMEP control: stac/clac for user memory access windows.

Initialization Order

init() follows a strict dependency-driven sequence:

  1. gdt::init() — GDT must exist before IDT can reference code segments.

  2. cpu::init_bsp() — per-CPU data (%gs base) must be set after GDT reload (which clobbers GS).

  3. idt::init() — must be ready before any interrupts fire.

  4. mm::init() — frame allocator needs boot info memory map.

  5. acpi::init() — parse MADT for CPU topology, MCFG for PCIe.

  6. apic::init() — configure Local APIC and I/O APIC. Timer stays masked.

  7. pit::init() — calibrate APIC timer frequency against PIT.

  8. paging::init() — build kernel page tables, establish direct physical map.

  9. cpuid::detect() — detect optional features (SMAP, SMEP, XSAVE).

Timer interrupts start later via start_timer() after the scheduler is ready.

GDT and TSS

The GDT contains 5 segments:

Index Selector Purpose

0

0x00

Null descriptor.

1

0x08

Kernel code (64-bit, DPL 0).

2

0x10

Kernel data (DPL 0).

3

0x18

User code (64-bit, DPL 3).

4

0x20

User data (DPL 3).

5-6

0x28

TSS descriptor (16 bytes, points to per-CPU TSS).

The TSS provides:

  • RSP0 — kernel stack pointer loaded on ring 3 → ring 0 transition. Updated on every context switch to point to the current thread’s kernel stack.

  • IST entries — interrupt stack table for double fault and NMI handlers.

IDT

256 entries:

  • Entries 0-31: CPU exceptions (divide error, debug, NMI, breakpoint, overflow, …​, page fault, …​, SIMD floating point).

  • Entry 14 (page fault): dispatches to the generic VSpace fault handler after constructing PageFaultInfo from CR2 and the error code.

  • Entries 32+: Hardware IRQs routed through I/O APIC.

  • No dedicated syscall vector — syscalls use the syscall instruction (MSR-based, not IDT).

APIC

Local APIC

  • Memory-mapped at the standard 0xFEE0_0000 physical address (remapped through direct map).

  • Timer: one-shot or periodic mode, calibrated against the PIT during boot.

  • EOI: written to the EOI register after every interrupt handler completes.

  • IPI: sends inter-processor interrupts for reschedule, TLB shootdown, and shutdown.

eoi() must be sent before any code that might trigger a context switch. A delayed EOI blocks further timer interrupts on the local APIC.

I/O APIC

  • Routes external hardware IRQs (keyboard, COM1, PCI devices) to Local APIC interrupt vectors.

  • Supports level-triggered (PCI) and edge-triggered (ISA) delivery modes.

  • ioapic_unmask(irq) / ioapic_mask(irq) control per-IRQ delivery.

  • Shared IRQ support: multiple IrqHandler objects can register for the same IRQ line.

Paging

4-level page tables: PML4 → PDPT → PD → PT.

  • User pages: PML4 entries 0-255 (lower canonical half).

  • Kernel pages: PML4 entries 256-511 (upper canonical half, shared across all VSpaces).

  • Direct physical map: starts at PHYS_MAP_OFFSET (0xFFFF_8000_0000_0000), uses 2 MB large pages where possible.

  • Identity map: temporary 1:1 mapping used during boot for AP trampoline code. Removed by clear_boot_identity_map() after all APs boot.

Page table flags: see VSpace for the complete PTE flag definitions.

SMP

AP Trampoline

Application processors start in real mode. The trampoline code (ap_tramp.S) performs:

  1. Real mode → protected mode (set PE in CR0, load GDT).

  2. Protected mode → long mode (set PAE, PGE in CR4; set LME in EFER; set PG in CR0).

  3. Jump to 64-bit kernel code using the BSP’s page tables.

  4. Initialize per-CPU data (%gs segment, TSS, Local APIC).

  5. Enter the idle loop.

CPU topology is discovered via the ACPI MADT (Multiple APIC Description Table). Each AP is started with a SIPI (Startup IPI) directed to the trampoline’s physical address.

IPI Types

Kind Purpose

Reschedule

Wake a remote CPU to run reschedule() (thread enqueued on its ready queue).

TlbShootdown

Invalidate a TLB entry on a remote CPU after page table modification.

Shutdown

Request orderly CPU halt during system shutdown.

Syscall Entry

The syscall instruction is configured via MSRs:

  • STAR — segment selectors for kernel/user mode.

  • LSTAR — kernel entry point address (syscall_entry in syscall.S).

  • SFMASK — flags cleared on entry (interrupts disabled).

Entry sequence in syscall.S:

  1. Hardware saves user RCX (return RIP) and R11 (return RFLAGS).

  2. swapgs to the kernel GS base (which points at PerCpuData).

  3. Stash the caller’s user RSP in PerCpuData.saved_rsp (%gs:16).

  4. Load the current thread’s kernel stack pointer from PerCpuData.kernel_stack (%gs:8) into RSP.

  5. Check syscall number: if 2 (Call) or 3 (ReplyRecv), jump to the fastpath.

  6. Otherwise, save all user registers on the kernel stack and call syscall_handle_rust().

  7. On return, restore user RSP from %gs:16, swapgs back, and execute sysretq.

PerCpuData Layout

syscall.S and the rest of the kernel share the layout below. Any change to PerCpuData must update the corresponding %gs:OFFSET references in syscall.S.

Offset Field Purpose

0

cpu_id: u32 (+4B pad)

Logical CPU index (BSP = 0). 32-bit load to avoid reading into the next field.

8

kernel_stack: u64

Top-of-stack pointer loaded on syscall entry.

16

saved_rsp: u64

User RSP saved by the syscall stub between swapgs pairs.

24

invoke_seq: u64

Monotonic counter incremented on each Invoke; used as a diagnostic correlation ID.

32

stack_canary: u64

Per-CPU stack canary seeded from RDSEED/RDRAND during BSP/AP init.

40-95

_reserved: [u64; 11]

Reserved for future growth.

FPU / SSE / AVX

Eager context switching:

  1. At boot, CR0.TS is cleared (and CR0.MP kept set). CR0.TS stays at 0 for the lifetime of the kernel, so the #NM vector is never taken for FPU faults.

  2. On every TCB switch, fpu::switch() calls XSAVE on the outgoing TCB and XRSTOR on the incoming TCB unconditionally — there is no "fpu_used" fast-exit.

  3. flush_current() / reload_current() keep the live register file in sync after TCB_COPY_FPU or similar kernel-side mutations.

XSAVE/XRSTOR fall back to FXSAVE/FXRSTOR when CPUID does not advertise XSAVE.

Save area: XSaveArea — 832 bytes, 64-byte aligned (x87: 512 + XSAVE header: 64 + AVX: 256).

SMAP / SMEP

If CPUID indicates support:

  • SMEP (Supervisor Mode Execution Prevention): prevents kernel from executing user pages. Enabled via CR4.SMEP.

  • SMAP (Supervisor Mode Access Prevention): prevents kernel from reading/writing user pages unless explicitly allowed. Enabled via CR4.SMAP. The stac instruction temporarily opens user access; clac closes it.

The uaccess module provides controlled user memory access windows.