Regular Expressions

lib/basalt/c/src/regex.rs (~900 lines) is a self-contained POSIX regular expression engine. It implements regcomp, regexec, regerror, regfree, supports both Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE), handles capture groups, and is pure Rust with no external dependencies. This page covers the supported syntax, the compilation and matching strategy, the error reporting, and the GNU extensions that are intentionally not implemented.

API Surface

typedef struct {
    size_t  re_nsub;     // number of capture groups
    void   *priv;        // opaque internal state
} regex_t;

typedef struct {
    regoff_t rm_so;      // start offset (-1 for unmatched group)
    regoff_t rm_eo;      // end offset
} regmatch_t;

int    regcomp(regex_t *preg, const char *pattern, int cflags);
int    regexec(const regex_t *preg, const char *string,
               size_t nmatch, regmatch_t pmatch[], int eflags);
size_t regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size);
void   regfree(regex_t *preg);

The regex_t struct is opaque to user code; basaltc declares it as a small fixed-size struct in <regex.h> and stores its real state in a heap-allocated RegexState referenced by the priv pointer.

Compilation Flags

Flag Meaning

REG_EXTENDED

Use extended regular expressions (ERE). Default is basic (BRE).

REG_ICASE

Case-insensitive match. Letter classes are widened on compile.

REG_NEWLINE

. does not match newline; ^ and $ match around newlines, not just at string boundaries.

REG_NOSUB

Do not record capture group offsets. Compile is faster and regexec’s `nmatch/pmatch arguments are ignored.

regcomp returns 0 on success or a REG_* error code on failure.

Execution Flags

Flag Meaning

REG_NOTBOL

The match starts at a position where ^ should not match (typically used when matching mid-string).

REG_NOTEOL

$ does not match at the end of string.

BRE vs ERE

Basic Regular Expressions (BRE) is the older syntax used by grep, sed, and vi. Extended Regular Expressions (ERE) is the syntax used by egrep, awk, and most modern regex libraries. The differences:

Construct BRE ERE

Alternation

| (escaped) — though POSIX BRE does not formally support alternation; basaltc accepts the GNU-style escape

| (literal)

Grouping

\( \)

( )

Repetition once or more

\+ (GNU extension; basaltc accepts it)

+

Repetition zero or one

\? (GNU extension; basaltc accepts it)

?

Repetition count

\{m,n\}

\{m,n\}

Backreference

\1 through \9

\1 through \9 (POSIX leaves this implementation-defined; basaltc supports them in both modes)

* (zero or more) is the same in both modes.

basaltc compiles BRE and ERE through the same parser; the cflag toggles which characters are metacharacters.

Supported Metacharacters

Token Meaning

.

Any single character (any byte; or any byte except newline if REG_NEWLINE)

^

Start of string (or start of line with REG_NEWLINE)

$

End of string (or end of line with REG_NEWLINE)

*

Zero or more of the previous atom

+ (ERE)

One or more of the previous atom

? (ERE)

Zero or one of the previous atom

{m}, \{m,\}, \{m,n\} (ERE)

Counted repetition

[…​]

Character class. Supports ranges ([a-z]), negation ([^abc]), POSIX classes (, , , etc.)

(…​) (ERE)

Capture group

\(…​\) (BRE)

Capture group

\1 to \9

Backreference to capture group N

\\, \., \*, etc.

Literal escape

\b, \B

GNU word boundary extensions — accepted by basaltc

\<, \>

GNU word-start / word-end extensions — accepted by basaltc

\w, \W, \d, \D, \s, \S

Perl-style shorthand classes — not supported, treated as literal escaped letters

POSIX character classes inside […​]:

  • [:alpha:], [:upper:], [:lower:] — letters

  • [:digit:], [:xdigit:] — digits / hex digits

  • [:alnum:] — alphanumeric

  • [:space:], [:blank:] — whitespace

  • [:punct:], [:graph:], [:print:] — punctuation / printing

  • [:cntrl:] — control characters

Each POSIX class consults the same rune table as <ctype.h> (see Locale, ctype and wchar). Because basaltc is C-locale-only, the classes match only the ASCII subsets.

Compilation Strategy

regcomp parses the pattern into an intermediate AST, then compiles the AST into a small bytecode for the matcher to execute. The bytecode is stored inside the heap-allocated RegexState referenced by regex_t::priv.

Steps:

  1. Tokenize — walk the pattern, classifying each byte as a metacharacter or literal based on the BRE/ERE flag.

  2. Parse — build an AST of nodes (Char, AnyChar, Anchor, Repeat, Group, Alternation, CharClass, Backref).

  3. Compile — emit a flat bytecode array. Branches use forward jumps that are patched after the destination is known.

  4. Allocate RegexState with the bytecode, the capture group count, and the cflag.

  5. Set regex_t::re_nsub to the number of ( constructs encountered.

  6. Set regex_t::priv to the RegexState pointer.

The implementation is hand-written; there is no parser generator or external regex crate.

Matching Strategy

regexec runs an NFA simulation against the bytecode. At each input position, it maintains a set of active states (the "thread set") and steps each one byte by byte. When a thread reaches an Accept instruction, it records the match span; the matcher continues looking for the leftmost-longest match (POSIX semantics).

Capture groups are tracked per-thread: each thread carries an array of (start, end) offsets, one per group. At the end of matching, the winning thread’s offsets are copied into the user’s pmatch array (up to nmatch entries).

For patterns without backreferences, the NFA simulation is O(N×M) where N is the input length and M is the pattern length. Backreferences require backtracking and can be exponential in the worst case — basaltc does not have a guard against pathological patterns; ports that compile untrusted patterns should add their own timeout.

Error Reporting

regerror(errcode, preg, errbuf, errbuf_size) writes a human-readable error message into errbuf and returns the length the message would have had if errbuf_size were unlimited:

Error code Meaning

REG_BADBR

Invalid \{m,n\} count

REG_BADPAT

Generic syntax error

REG_BADRPT

Repetition without a target (e.g., * at start of pattern)

REG_EBRACE

Mismatched {/}

REG_EBRACK

Mismatched [/]

REG_ECOLLATE

Invalid collating element (basaltc does not support collation, so [[.ch.]] returns this)

REG_ECTYPE

Invalid POSIX class name (e.g., )

REG_EESCAPE

Trailing backslash

REG_EPAREN

Mismatched (/)

REG_ERANGE

Invalid range in […​] (e.g., [z-a])

REG_ESPACE

Out of memory

REG_ESUBREG

Invalid backreference (e.g., \3 when there are only 2 groups)

REG_NOMATCH

regexec returned no match (not really an error, but a status code)

regfree

regfree(preg) releases the heap memory pointed to by preg→priv and zeroes the struct. It must be called for every successful regcomp to avoid leaking the bytecode and the parser scratch space.

GNU Extensions Not Supported

  • \d / \w / \s and their uppercase complements — Perl/PCRE syntax. basaltc accepts the escape but treats it as the literal letter (so \d matches the letter d, not a digit).

  • Lookahead / lookbehind ((?=…​), (?⇐…​), etc.) — basaltc rejects them as unsupported.

  • Non-capturing groups ((?:…​)) — basaltc rejects them.

  • Possessive quantifiers (*`, `+) — not supported.

  • Named captures ((?<name>…​), (?P<name>…​)) — not supported.

  • Unicode property classes (\p{Letter}) — not supported.

The intent is to provide POSIX BRE/ERE coverage at the level needed for grep/sed/awk ports. Software requiring PCRE features should link against a PCRE port instead of using the basaltc regex API.

Thread Safety

regcomp, regexec, regerror, regfree are all reentrant — they take a regex_t pointer as the first argument and operate only on that struct. A single regex_t cannot be shared across threads (the matcher mutates the thread set during execution); compile separate regex_t instances per thread, or wrap calls in an external lock.

regerror writes only into the caller-supplied buffer and is fully reentrant.