Testing¶
Accepted
Accepted.
Compiler phases are black-box testable through deterministic dumps:
- lexer snapshots use
.tokens.snap - parser snapshots use
.ast.snap - structured diagnostics use
.diag.snap - rendered diagnostics use
.stderr.snap - IR snapshots use
.ir.snap - C backend snapshots use
.c.snap - interpreter execution snapshots use
.run.snapwhen execution output or trap reports are part of the contract
Snapshots are plain text. Output ordering must be deterministic. Behavior changes should update or add tests in the narrowest relevant phase.
Phase responsibilities and snapshot ownership are described in Phase Boundaries.
Public Contract¶
Snapshot output is part of the public contract. Any intentional behavior change should update the relevant snapshot in the narrowest phase.
Test Strategy¶
Compiler behavior should be verified through the narrowest public surface that proves the rule:
- phase unit tests cover small deterministic algorithms when a snapshot would be too broad;
- artifact snapshots cover phase contracts such as tokens, AST, SIR, IR, generated C, structured diagnostics, rendered diagnostics, and hook order;
- runtime snapshots cover interpreter results, trap reports, and test-harness output;
- source-to-binary smoke tests cover generated C compilation and execution when a supported C toolchain is available;
- cross-cutting regression suites cover diagnostics, target facts, cleanup/error interactions, comptime demand/cache behavior, and module/source identity;
- negative tests cover deterministic rejection of invalid V1 forms and deferred V2/Far Future syntax that shares V1 surface area.
TDD Workflow¶
New compiler behavior should default to tracer-bullet TDD:
- Add one failing test or snapshot fixture for one observable behavior.
- Implement the smallest real path that makes it pass.
- Add the next behavior only after the previous one is green.
- Refactor only while the relevant suite is green.
Tests should describe language/compiler behavior through public interfaces: CLI commands, phase dumps, diagnostics, runtime output, generated C, or source-to-binary smoke results. Unit tests may cover local algorithms, but they should not become the primary specification for language behavior.
Avoid horizontal test passes where a whole slice's snapshots are written before implementation. Snapshot suites should grow one behavior at a time so expected output reflects real compiler behavior, not imagined implementation shape.
Fixture Organization¶
Test fixtures should be organized by product behavior and compiler surface, not by implementation slice number. Slice names are planning scaffolding and must not appear in fixture paths, snapshot names, diagnostics, or user-facing test output.
Suggested top-level fixture groups:
syntax;diagnostics;sema;ir;runtime;backend-c;modules;comptime;interop;standard-library;regressions;smoke.
A fixture introduced during a slice should be named for the rule it proves, such as backend-c/target-facts, diagnostics/parser-recovery, or runtime/checked-traps.
Snapshot Scope¶
Snapshots should be small, focused, and cheap to review. A snapshot should usually prove one rule or one closely related rule family. Large end-to-end snapshots are reserved for smoke tests, integration examples, and regressions where the interaction between phases is the behavior under test.
The harness should reduce authoring work:
- one fixture should be able to request multiple relevant outputs without duplicating source;
- default fixture metadata should cover the common target, safety mode, command, and expected phase outputs;
- update mode should refresh only the snapshots owned by the fixture;
- mismatch output should show the smallest useful diff and the command needed to refresh it;
- tests should avoid asserting noisy implementation details when a narrower artifact proves the behavior.
Prefer one precise fixture over a broad example when both catch the same regression. Prefer a smoke test only when the value comes from proving phase composition.
Fixture Directives¶
Most fixtures should rely on harness defaults. When a fixture needs non-default behavior, put a small directive block at the top of the .ct file:
/*@test
outputs = ast, diag, stderr
expect = fail
*/
fn main() void {
const x =
}
The directive block is harness metadata, not Catalyst source semantics. The lexer and parser treat it as an ordinary block comment.
Directive rules:
- the block header is exactly
/*@test; - entries are
key = valuelines; - comma-separated values are allowed only for keys that declare lists;
- unknown keys are hard harness errors;
- duplicate keys are hard harness errors unless the key explicitly allows repetition;
- defaults cover the common target, safety mode, expected success, and narrowest useful output;
- sidecar manifests should be reserved for generated fixtures or cases where embedding metadata would obscure the source being tested.
Initial directive keys:
| Key | Meaning |
|---|---|
outputs |
Snapshot outputs such as tokens, ast, sir, ir, c, diag, stderr, run, or hook-order. |
expect |
pass or fail. |
target |
Target triple or default. |
safety |
Safety mode such as checked or unchecked. |
requires |
External capability such as c-toolchain; unmet requirements produce deterministic skip output. |
Diagnostic Snapshots¶
Diagnostics have two snapshot layers:
.diag.snaprecords the structured technical diagnostic stream: severity, stable code, primary span, secondary spans, labels, notes, hints, and deterministic machine-readable details..stderr.snaprecords selected rendered stderr output produced from the structured diagnostic stream and source store.
Most diagnostic behavior should be tested with .diag.snap. Use .stderr.snap when the behavior under test is human presentation: source excerpt layout, label placement, note/hint wording, color-mode boundaries, or multi-diagnostic rendering.
Rendered stderr snapshots should use a deterministic non-color mode by default. Color output can have focused renderer tests, but phase tests should not depend on terminal styling.
Determinism¶
Generated output must be deterministic. Tests should not depend on map iteration order, pointer addresses, host-specific paths, or nondeterministic formatting.
Lexer and parser snapshots should be target-independent. Sema/SIR, IR, backend, interpreter, and comptime-related snapshots should include the active target and safety mode whenever those phases can depend on them. This keeps checked-vs-unchecked and cross-target behavior inspectable.
Comptime evaluation snapshots or diagnostics should be reusable within one compilation only for the same evaluated semantic instance, comptime arguments, active target, active safety mode, captured scope context, and active semantic environment.