Threads, TLS, and Worker Pool
Three closely-related substrate modules make up the threading infrastructure:
| Module | Lines | Role |
|---|---|---|
|
788 |
The thread descriptor pool, |
|
1,094 |
A personality-neutral worker pool and the |
|
206 |
The |
Together they are what makes multi-threaded servers possible. This page walks through them in that order.
ThreadDesc — the pool
Every substrate-managed thread is represented by a ThreadDesc struct in a fixed-size per-process pool:
pub const MAX_THREADS: usize = 64;
The maximum is shared with the pthread layer — pthread_create cannot create more than 64 live threads per process because the pool would overflow.
The pool is statically allocated inside substrate/tls.rs, so no heap is required.
Each ThreadDesc carries:
| Field | Purpose |
|---|---|
|
Atomic state ( |
|
Incremented every time the slot is reused. The |
|
Pointer to this thread’s |
|
A |
|
Capability slot for the kernel TCB that backs this thread. |
|
The stack region allocated for this thread, used at thread exit to |
|
A |
|
An opaque |
The owner tag lets substrate dispatch thread exit differently depending on who created the thread.
POSIX threads go through pthread_exit; worker threads go through WorkerLoopControl::Exit; Win32 threads go through the future trona_win32 exit path; bare-runtime threads just die.
The personality_data pointer is how POSIX threads add their own extended state — PosixThreadExt (detached flag, procmgr tid, cleanup handler stack, cancellation state) lives at that pointer, allocated from the stack of the thread-creating call.
ThreadLocalBlock — the per-thread static state
Each thread has one ThreadLocalBlock, allocated at thread creation time and accessed through the thread pointer (FS base on x86_64, tpidr_el0 on aarch64).
The block layout is fixed and known to the compiler so that #[thread_local] variables land at predictable offsets.
Key fields:
| Field | Meaning |
|---|---|
|
A pointer that equals |
|
Per-thread |
|
Per-thread |
|
Index into the |
|
Current blocked-signal mask used by |
|
Deferred / asynchronous / disabled — used by |
|
Set by |
|
Pointer to the static TLS block ( |
Total size is small — around 256 bytes — so every thread pays a modest fixed allocation for TLS.
Static TLS initialization
ELF static TLS is handled entirely outside substrate: rtld parses PT_TLS segments during image load, computes the static TLS layout, and writes the result into the _trona_tls* weak symbols documented in substrate Overview.
Substrate’s TLS module then reads those symbols at thread creation time.
The sequence is:
-
At rtld time,
ld-trona.socomputes the per-module TLS offsets (x86_64 uses negative offsets from TP; aarch64 uses positive offsets plus a 16-byte TP header) and writestrona_tls_template,trona_tls_filesz,trona_tls_memsz,trona_tls_align,trona_tls_module_count, andtrona_tls_modules[]. -
At thread creation time,
substrate/tls.rsallocates a new stack + TLS block and callsinitialize_static_tls_for_tp(tp). This function copies[trona_tls_template, trona_tls_template + __trona_tls_filesz)into the thread’s static TLS area, zero-fills[filesz, memsz), and writes the thread pointer. -
On thread entry (
tcb_resumewith the new TLS base), the kernel loads the thread pointer into the appropriate register and starts executing at the user-specified entry point.
On aarch64 there is an extra wrinkle: the first 16 bytes below TP hold a TLS header ({tls_base, reserved}) that the _tlsdesc_static_resolver PLT resolver walks during dynamic TLS access.
Substrate sets up those 16 bytes as part of initialize_static_tls_for_tp; the x86_64 path does not have this header.
Thread identity
The current thread can be looked up through three accessors:
pub fn current_ipc_ctx() -> *mut IpcContext;
pub unsafe fn current_thread_desc() -> *mut ThreadDesc;
pub fn current_thread_index() -> u64;
current_ipc_ctx() is the most frequently called — every IPC primitive in IPC starts with it.
It walks the thread pointer → ThreadLocalBlock → ipc_context, falling back to the global __trona_ipc_ctx if TLS has not yet been initialized (during very early CRT startup).
current_thread_index() returns the zero-based index in the thread pool and is the cheapest way to identify the current thread.
pthread_self() uses it to build the (pool_index, generation) pthread_t handle.
current_thread_desc() is unsafe because the caller must guarantee the TLS block has been set up — calling it before initialize_static_tls_for_tp returns a null pointer.
allocate_thread() and fork handling
Creating a new thread is a two-step process from the substrate side:
-
Reserve a pool slot.
allocate_thread(owner)scans theThreadDescpool for aTD_FREEslot and CAS-transitions it toTD_LIVE. If the pool is full, it returnsNoneand the caller propagates the failure. -
Initialize the slot. The caller fills in
stack_base,stack_size,tcb_cap,ipc_context.ipc_buffer, andpersonality_data, then calls the personality-specific entry path (e.g.PM_THREAD_CREATEfor POSIX threads).
Thread exit runs in reverse: the personality layer marks the slot TD_EXITED, the reaping thread transitions TD_EXITED → TD_REAPING → TD_FREE (bumping generation in the process), and the stack is unmapped through posix_munmap.
There is also a post_fork_child hook that the x86_64 and aarch64 fork.S trampolines call immediately after a successful fork.
Its job is to reinitialize the child’s thread pool (every thread except the forking one is dead in the child) and reset per-process counters — the forking thread stays alive but needs a fresh ThreadDesc index in the new process.
See fork.S and Linker Scripts for how the trampoline reaches this hook.
Worker pool — substrate/worker.rs
Every SaltyOS server — VFS, procmgr, mmsrv, namesrv, dnssrv, SaltyFS, the network stack — runs on top of the same run_workers() event loop from substrate/worker.rs.
The loop is personality-neutral: it knows about TronaMsg, endpoints, and notifications, and nothing about POSIX or Win32.
WorkerConfig
A server declares its worker pool through a WorkerConfig struct:
pub struct WorkerConfig {
pub worker_count: u32, // including worker #0
pub endpoints: *const Cap, // endpoints every worker receives on
pub endpoint_count: usize, // length of the endpoints array
pub untyped: Cap, // source untyped for TCB / stack frame allocation
pub self_tcb: Cap, // the main thread's TCB (becomes worker #0)
pub self_sc: Cap, // the main thread's sched context
pub stack_pages: u64, // pages per worker stack
pub pool_budget_us: u64, // optional shared sched budget
pub pool_period_us: u64, // optional shared sched period
pub cspace_depth: u8,
}
Setting worker_count = 4 with two endpoints and sixteen-page stacks is typical for a medium-traffic server.
worker_count = 1 is valid — it turns run_workers into a plain single-threaded event loop on the calling thread.
The handler signature
Every worker calls back into a single user-provided handler:
type WorkerHandler = fn(
in_msg: *const TronaMsg,
badge: u64,
source: u64,
reply: *mut TronaMsg,
) -> WorkerLoopControl;
WorkerLoopControl is a three-valued enum:
-
Reply— send the contents ofreplyback to the caller, then wait for the next request. -
NoReply— skip the reply; jump straight to the next receive. Used when the handler decides to defer the response (for example, adding the request to thePendingRequesttable for later completion). -
Exit— the worker is done. Worker #0 must never returnExit— if it does, substrate falls into an infinite yield loop because the main thread has no stack to unwind to.
The handler is the only server-specific code; everything else — thread startup, endpoint receive, reply-recv transitions, exit bookkeeping — lives in substrate.
Run loop structure
run_workers() does three things:
-
Spawn worker threads 1 through
worker_count - 1. Each gets its own stack from a freshly-retyped frame, a new TCB, and enters the shared event loop. -
Convert the calling thread into worker #0 and enter the same event loop.
-
Never return. Worker #0 does not have a "done" state; when the server shuts down, the process as a whole exits.
Inside the loop, each worker does:
loop {
// First iteration: plain receive. After that: reply_recv.
if first_iteration {
recv_any_ctx(ctx, endpoints, count, &msg, &badge, &source);
} else {
reply_recv_any_ctx(ctx, endpoints, count, &reply, &msg, &badge, &source);
}
let control = handler(&msg, badge, source, &mut reply);
match control {
Reply => {
// Loop back — reply_recv will send `reply` and block for next.
}
NoReply => {
// Set reply = empty, loop back — the "reply" will be a no-op.
}
Exit => {
if worker_id == 0 { yield_forever(); }
else { reap_self(); }
}
}
}
Notice that recv_any_ctx is only used for the first iteration — subsequent iterations go through reply_recv_any_ctx, which is the fastpath-eligible primitive.
This is a measurable win for server throughput: a worker that is busy handling requests never pays the slowpath for recv.
FIFO dispatch
The kernel handles dispatch across multiple workers by waking the FIFO-first waiter on each endpoint.
When a request arrives on an endpoint, the kernel picks the oldest waiter currently blocked in reply_recv_any on that endpoint and hands the request to that worker.
This gives fair throughput across workers without any userspace scheduling logic.
Pending request table — substrate/pending.rs
Long-running server operations cannot hold a worker hostage.
If a VFS read has to wait on disk I/O for 10ms, tying up a worker for that whole time would cause head-of-line blocking for every other request on that endpoint.
The solution is the PendingRequest table.
A server that accepts a request but cannot respond immediately:
-
Stores a
PendingRequest { request_id, wait_reason, reply_cap, client_badge, … }in thepending.rstable. -
Saves the client’s reply capability via
cnode_save_callerso it can reply later even after handling other requests. -
Returns
WorkerLoopControl::NoReplyfrom the worker handler. -
When the underlying resource becomes ready, the server looks up the pending record by
request_id, builds a reply, invokes the saved reply cap, and removes the record.
pending.rs is just 206 lines — it is a small fixed-size table with a spinlock around it, plus helpers for allocation, lookup, and removal.
The table is small (tens of entries per server) because real workloads rarely have more than a few tens of concurrent in-flight async operations.
The key contract is that PendingRequest.client_badge matches the original caller badge, so the server can double-check on reply that it is responding to the right client — catching the race where a client exits and its reply cap becomes stale.
Why this all lives in substrate
All three modules sit at the substrate layer rather than in trona_posix or in individual servers because:
-
Worker pools precede POSIX. Several servers (init, namesrv, early rsrcsrv) run before the POSIX personality is wired up. They still need a worker loop.
-
TLS precedes everything.
current_ipc_ctx()is on the hot path of every syscall, so it has to live in the lowest layer. -
Fork needs a neutral hook. The fork trampoline is assembly in
posix/arch/but the post-fork reset runs in Rust — and substrate is the lowest layer that Rust code fromfork.Scan call into.
By putting the infrastructure in substrate, trona_posix’s pthread layer, the future trona_win32 thread shim, and every bare Rust server all get the same building blocks without each reinventing the thread pool.
Related pages
-
Synchronization Primitives — the
Mutex/RWLock/Condvarused inside this thread infrastructure and consumed by its users. -
IPC — the
recv_any_ctx/reply_recv_any_ctxprimitives the worker loop calls. -
POSIX Threads — the trona_posix layer that wraps the
ThreadDescpool with pthread semantics. -
fork.S and Linker Scripts — the arch-specific entry into
post_fork_child.