x86_64
This page documents the x86_64 architecture backend in kernite/src/arch/x86_64/.
Module Structure
| File | Responsibility |
|---|---|
|
Backend entry point, |
|
BSP bootstrap: GDT, IDT, APIC, ACPI, paging, CPUID initialization in order. |
|
Global Descriptor Table with Task State Segment (TSS). Ring 0/3 code and data segments. |
|
Interrupt Descriptor Table — 256 entries covering exceptions (0-31), IRQs (32+), and syscall. |
|
Local APIC (timer, IPI, EOI) + I/O APIC (IRQ routing, level/edge trigger, masking). |
|
ACPI MADT (CPU enumeration), MCFG (PCIe ECAM base), HPET detection. |
|
4-level page tables (PML4 → PDPT → PD → PT), direct physical map setup, identity map management. |
|
Application processor trampoline: real mode → protected mode → long mode. Uses ACPI MADT to find APs. |
|
Context switch: save/restore of |
|
Per-CPU |
|
CPUID feature detection: SSE, AVX, XSAVE, SMAP, SMEP, 1G pages, FSGSBASE. |
|
FPU/SSE/AVX eager context switch — XSAVE/XRSTOR on every TCB switch, |
|
Programmable Interval Timer — used for APIC timer calibration, fallback timer. |
|
SMAP/SMEP control: |
Initialization Order
init() follows a strict dependency-driven sequence:
-
gdt::init()— GDT must exist before IDT can reference code segments. -
cpu::init_bsp()— per-CPU data (%gsbase) must be set after GDT reload (which clobbers GS). -
idt::init()— must be ready before any interrupts fire. -
mm::init()— frame allocator needs boot info memory map. -
acpi::init()— parse MADT for CPU topology, MCFG for PCIe. -
apic::init()— configure Local APIC and I/O APIC. Timer stays masked. -
pit::init()— calibrate APIC timer frequency against PIT. -
paging::init()— build kernel page tables, establish direct physical map. -
cpuid::detect()— detect optional features (SMAP, SMEP, XSAVE).
Timer interrupts start later via start_timer() after the scheduler is ready.
GDT and TSS
The GDT contains 5 segments:
| Index | Selector | Purpose |
|---|---|---|
0 |
|
Null descriptor. |
1 |
|
Kernel code (64-bit, DPL 0). |
2 |
|
Kernel data (DPL 0). |
3 |
|
User code (64-bit, DPL 3). |
4 |
|
User data (DPL 3). |
5-6 |
|
TSS descriptor (16 bytes, points to per-CPU TSS). |
The TSS provides:
-
RSP0— kernel stack pointer loaded on ring 3 → ring 0 transition. Updated on every context switch to point to the current thread’s kernel stack. -
ISTentries — interrupt stack table for double fault and NMI handlers.
IDT
256 entries:
-
Entries 0-31: CPU exceptions (divide error, debug, NMI, breakpoint, overflow, …, page fault, …, SIMD floating point).
-
Entry 14 (page fault): dispatches to the generic VSpace fault handler after constructing
PageFaultInfofrom CR2 and the error code. -
Entries 32+: Hardware IRQs routed through I/O APIC.
-
No dedicated syscall vector — syscalls use the
syscallinstruction (MSR-based, not IDT).
APIC
Local APIC
-
Memory-mapped at the standard
0xFEE0_0000physical address (remapped through direct map). -
Timer: one-shot or periodic mode, calibrated against the PIT during boot.
-
EOI: written to the EOI register after every interrupt handler completes.
-
IPI: sends inter-processor interrupts for reschedule, TLB shootdown, and shutdown.
eoi() must be sent before any code that might trigger a context switch.
A delayed EOI blocks further timer interrupts on the local APIC.
|
I/O APIC
-
Routes external hardware IRQs (keyboard, COM1, PCI devices) to Local APIC interrupt vectors.
-
Supports level-triggered (PCI) and edge-triggered (ISA) delivery modes.
-
ioapic_unmask(irq)/ioapic_mask(irq)control per-IRQ delivery. -
Shared IRQ support: multiple
IrqHandlerobjects can register for the same IRQ line.
Paging
4-level page tables: PML4 → PDPT → PD → PT.
-
User pages: PML4 entries 0-255 (lower canonical half).
-
Kernel pages: PML4 entries 256-511 (upper canonical half, shared across all VSpaces).
-
Direct physical map: starts at
PHYS_MAP_OFFSET(0xFFFF_8000_0000_0000), uses 2 MB large pages where possible. -
Identity map: temporary 1:1 mapping used during boot for AP trampoline code. Removed by
clear_boot_identity_map()after all APs boot.
Page table flags: see VSpace for the complete PTE flag definitions.
SMP
AP Trampoline
Application processors start in real mode.
The trampoline code (ap_tramp.S) performs:
-
Real mode → protected mode (set PE in CR0, load GDT).
-
Protected mode → long mode (set PAE, PGE in CR4; set LME in EFER; set PG in CR0).
-
Jump to 64-bit kernel code using the BSP’s page tables.
-
Initialize per-CPU data (
%gssegment, TSS, Local APIC). -
Enter the idle loop.
CPU topology is discovered via the ACPI MADT (Multiple APIC Description Table). Each AP is started with a SIPI (Startup IPI) directed to the trampoline’s physical address.
Syscall Entry
The syscall instruction is configured via MSRs:
-
STAR— segment selectors for kernel/user mode. -
LSTAR— kernel entry point address (syscall_entryinsyscall.S). -
SFMASK— flags cleared on entry (interrupts disabled).
Entry sequence in syscall.S:
-
Hardware saves user
RCX(return RIP) andR11(return RFLAGS). -
swapgsto the kernelGSbase (which points atPerCpuData). -
Stash the caller’s user RSP in
PerCpuData.saved_rsp(%gs:16). -
Load the current thread’s kernel stack pointer from
PerCpuData.kernel_stack(%gs:8) intoRSP. -
Check syscall number: if 2 (Call) or 3 (ReplyRecv), jump to the fastpath.
-
Otherwise, save all user registers on the kernel stack and call
syscall_handle_rust(). -
On return, restore user RSP from
%gs:16,swapgsback, and executesysretq.
PerCpuData Layout
syscall.S and the rest of the kernel share the layout below. Any change to PerCpuData must update the corresponding %gs:OFFSET references in syscall.S.
| Offset | Field | Purpose |
|---|---|---|
|
|
Logical CPU index (BSP = 0). 32-bit load to avoid reading into the next field. |
|
|
Top-of-stack pointer loaded on syscall entry. |
|
|
User RSP saved by the syscall stub between |
|
|
Monotonic counter incremented on each |
|
|
Per-CPU stack canary seeded from RDSEED/RDRAND during BSP/AP init. |
|
|
Reserved for future growth. |
FPU / SSE / AVX
Eager context switching:
-
At boot,
CR0.TSis cleared (andCR0.MPkept set).CR0.TSstays at0for the lifetime of the kernel, so the#NMvector is never taken for FPU faults. -
On every TCB switch,
fpu::switch()callsXSAVEon the outgoing TCB andXRSTORon the incoming TCB unconditionally — there is no "fpu_used" fast-exit. -
flush_current()/reload_current()keep the live register file in sync afterTCB_COPY_FPUor similar kernel-side mutations.
XSAVE/XRSTOR fall back to FXSAVE/FXRSTOR when CPUID does not advertise XSAVE.
Save area: XSaveArea — 832 bytes, 64-byte aligned (x87: 512 + XSAVE header: 64 + AVX: 256).
SMAP / SMEP
If CPUID indicates support:
-
SMEP (Supervisor Mode Execution Prevention): prevents kernel from executing user pages. Enabled via
CR4.SMEP. -
SMAP (Supervisor Mode Access Prevention): prevents kernel from reading/writing user pages unless explicitly allowed. Enabled via
CR4.SMAP. Thestacinstruction temporarily opens user access;claccloses it.
The uaccess module provides controlled user memory access windows.
Related Pages
-
aarch64 — the other supported architecture
-
Architecture — architecture abstraction layer
-
Memory Layout — address space split and direct physical map
-
Boot Sequence — initialization order