Regular Expressions
lib/basalt/c/src/regex.rs (~900 lines) is a self-contained POSIX regular expression engine.
It implements regcomp, regexec, regerror, regfree, supports both Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE), handles capture groups, and is pure Rust with no external dependencies.
This page covers the supported syntax, the compilation and matching strategy, the error reporting, and the GNU extensions that are intentionally not implemented.
API Surface
typedef struct {
size_t re_nsub; // number of capture groups
void *priv; // opaque internal state
} regex_t;
typedef struct {
regoff_t rm_so; // start offset (-1 for unmatched group)
regoff_t rm_eo; // end offset
} regmatch_t;
int regcomp(regex_t *preg, const char *pattern, int cflags);
int regexec(const regex_t *preg, const char *string,
size_t nmatch, regmatch_t pmatch[], int eflags);
size_t regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size);
void regfree(regex_t *preg);
The regex_t struct is opaque to user code; basaltc declares it as a small fixed-size struct in <regex.h> and stores its real state in a heap-allocated RegexState referenced by the priv pointer.
Compilation Flags
| Flag | Meaning |
|---|---|
|
Use extended regular expressions (ERE). Default is basic (BRE). |
|
Case-insensitive match. Letter classes are widened on compile. |
|
|
|
Do not record capture group offsets. Compile is faster and |
regcomp returns 0 on success or a REG_* error code on failure.
Execution Flags
| Flag | Meaning |
|---|---|
|
The match starts at a position where |
|
|
BRE vs ERE
Basic Regular Expressions (BRE) is the older syntax used by grep, sed, and vi.
Extended Regular Expressions (ERE) is the syntax used by egrep, awk, and most modern regex libraries.
The differences:
| Construct | BRE | ERE |
|---|---|---|
Alternation |
|
|
Grouping |
|
|
Repetition once or more |
|
|
Repetition zero or one |
|
|
Repetition count |
|
|
Backreference |
|
|
* (zero or more) is the same in both modes.
basaltc compiles BRE and ERE through the same parser; the cflag toggles which characters are metacharacters.
Supported Metacharacters
POSIX character classes inside […]:
-
[:alpha:],[:upper:],[:lower:]— letters -
[:digit:],[:xdigit:]— digits / hex digits -
[:alnum:]— alphanumeric -
[:space:],[:blank:]— whitespace -
[:punct:],[:graph:],[:print:]— punctuation / printing -
[:cntrl:]— control characters
Each POSIX class consults the same rune table as <ctype.h> (see Locale, ctype and wchar).
Because basaltc is C-locale-only, the classes match only the ASCII subsets.
Compilation Strategy
regcomp parses the pattern into an intermediate AST, then compiles the AST into a small bytecode for the matcher to execute.
The bytecode is stored inside the heap-allocated RegexState referenced by regex_t::priv.
Steps:
-
Tokenize — walk the pattern, classifying each byte as a metacharacter or literal based on the BRE/ERE flag.
-
Parse — build an AST of nodes (
Char,AnyChar,Anchor,Repeat,Group,Alternation,CharClass,Backref). -
Compile — emit a flat bytecode array. Branches use forward jumps that are patched after the destination is known.
-
Allocate
RegexStatewith the bytecode, the capture group count, and the cflag. -
Set
regex_t::re_nsubto the number of(constructs encountered. -
Set
regex_t::privto theRegexStatepointer.
The implementation is hand-written; there is no parser generator or external regex crate.
Matching Strategy
regexec runs an NFA simulation against the bytecode.
At each input position, it maintains a set of active states (the "thread set") and steps each one byte by byte.
When a thread reaches an Accept instruction, it records the match span; the matcher continues looking for the leftmost-longest match (POSIX semantics).
Capture groups are tracked per-thread: each thread carries an array of (start, end) offsets, one per group.
At the end of matching, the winning thread’s offsets are copied into the user’s pmatch array (up to nmatch entries).
For patterns without backreferences, the NFA simulation is O(N×M) where N is the input length and M is the pattern length. Backreferences require backtracking and can be exponential in the worst case — basaltc does not have a guard against pathological patterns; ports that compile untrusted patterns should add their own timeout.
Error Reporting
regerror(errcode, preg, errbuf, errbuf_size) writes a human-readable error message into errbuf and returns the length the message would have had if errbuf_size were unlimited:
| Error code | Meaning |
|---|---|
|
Invalid |
|
Generic syntax error |
|
Repetition without a target (e.g., |
|
Mismatched |
|
Mismatched |
|
Invalid collating element (basaltc does not support collation, so |
|
|
|
Trailing backslash |
|
Mismatched |
|
Invalid range in |
|
Out of memory |
|
Invalid backreference (e.g., |
|
|
regfree
regfree(preg) releases the heap memory pointed to by preg→priv and zeroes the struct.
It must be called for every successful regcomp to avoid leaking the bytecode and the parser scratch space.
GNU Extensions Not Supported
-
\d/\w/\sand their uppercase complements — Perl/PCRE syntax. basaltc accepts the escape but treats it as the literal letter (so\dmatches the letterd, not a digit). -
Lookahead / lookbehind (
(?=…),(?⇐…), etc.) — basaltc rejects them as unsupported. -
Non-capturing groups (
(?:…)) — basaltc rejects them. -
Possessive quantifiers (
*`, `+) — not supported. -
Named captures (
(?<name>…),(?P<name>…)) — not supported. -
Unicode property classes (
\p{Letter}) — not supported.
The intent is to provide POSIX BRE/ERE coverage at the level needed for grep/sed/awk ports.
Software requiring PCRE features should link against a PCRE port instead of using the basaltc regex API.
Thread Safety
regcomp, regexec, regerror, regfree are all reentrant — they take a regex_t pointer as the first argument and operate only on that struct.
A single regex_t cannot be shared across threads (the matcher mutates the thread set during execution); compile separate regex_t instances per thread, or wrap calls in an external lock.
regerror writes only into the caller-supplied buffer and is fully reentrant.
Related Pages
-
Strings and Memory —
strstrandstrpbrkare simpler patterns that may be enough -
Locale, ctype and wchar — POSIX classes use the rune table
-
trona Boundary — regex is in the "pure Rust" bucket; no external dependencies