Frozen runs of the twelve test targets in ../README.md, kept in the repo so a release run can be diffed against them. The current set is v0.7.7-claude48/ — all twelve targets on the Praxen v0.7.7 skill under Anthropic Claude Opus 4.8, against the intent-level Worker Remits, frozen via a median-of-3 process (see its BASELINE.md). It is the comparison point for the pre-release regression review (see ../README.md, “What a release review looks like”). The same skill on Opus 4.7 is retained as v0.7.7-sequential/ for diff archaeology.
baselines/
README.md ← this file
owasp_coverage.py ← cross-baseline OWASP-coverage HTML report generator
owasp-coverage-report.html ← committed snapshot; live at GitHub Pages (link below)
v0.7.7-claude48/ ← CURRENT — all twelve targets, Praxen v0.7.7 on Opus 4.8 (median-of-3, schema 2.0)
v0.7.7-sequential/ retired — same skill on Opus 4.7; kept for diff archaeology
v0.7.4-sequential/ retired — see CHANGELOG [0.7.7]
v0.7.0-sequential/ retired — see CHANGELOG [0.7.4]
BASELINE.md ← summary table, provenance, how to compare
<target>/
<target>-findings-<date>.json ← the canonical record (the thing you diff)
<target>-analysis-<timestamp>.html ← the rendered report
<target>-analysis-<timestamp>.txt ← the plain-text summary
v0.4-parallel/ ← historical — the Phase-2 parallel-path evaluation gate
GATE-NOTES.md ← the A/B record and the "drop the parallel path" verdict
When a Praxen release legitimately moves the calibration, the findings schema changes, or the reference model changes, the suite is re-run and re-frozen under a new vX.Y-<variant>/ directory, the previous set is retired, and the pointer in ../README.md is updated. The current v0.7.7-claude48/ set re-froze the suite on Opus 4.8 using the median-of-3 characterization process — see its BASELINE.md for the per-target table, the model-lean analysis, and the five band changes. The retired v0.7.7-sequential/ set is the same skill on Opus 4.7 — the eleven cold runs that validated the SKILL Pre-flight Step 5 + multi-component remit guidance (PR #42) and Step 4 source-inferred log files (PR #43) shipped in [0.7.7]; it is kept on disk for diff archaeology (see v0.7.7-sequential/BASELINE.md). Earlier sets — v0.7.4-sequential/ (the 0.7.4-skill cold runs, kept on disk for diff archaeology — validated the deterministic-Step-10 + Step-9.9-emission-discipline changes), v0.7.0-sequential/, v0.3-sequential/, v0.2-sequential/, the partial v0.6-sequential/, and the same-content v0.6.3-sequential/ — were retired in successive re-baselines.
v0.4-parallel/ is not a baseline set — it is the record of the Phase-2 parallel-analysis gate (design/V2_HARVEST_PLAN.md §5), whose verdict was to drop the parallel path. It is kept as a historical decision record.
The renderer is deterministic — a baseline’s committed HTML/TXT re-render byte-for-byte from its committed JSON, and tests/render/test_render.py enforces that on every run:
python3 skills/behavior-verifier/render.py \
--findings tests/baselines/v0.7.7-claude48/<target>/<target>-findings-<date>.json \
--template skills/behavior-verifier/report_template.html \
--out-html /tmp/<target>.html --out-txt /tmp/<target>.txt
A committed snapshot of the aggregate view lives at owasp-coverage-report.html and is served live by GitHub Pages — browse it here. Each per-target card links to both the agent’s source repository and the per-target Praxen baseline analysis HTML, so the report doubles as a navigable index of what the suite tests and what the analyses found. Also includes a horizontal bar chart per OWASP Top 10 and a methodology note.
The snapshot is produced by owasp_coverage.py, which walks every <target>/<target>-findings-*.json in the chosen baseline set and sums the per-finding owasp_llm / owasp_agentic primary scalars. Regenerate it whenever the baselines change:
# regenerate the committed snapshot in place (canonical form)
python3 tests/baselines/owasp_coverage.py \
--baseline-dir tests/baselines/v0.7.7-claude48 \
--out tests/baselines/owasp-coverage-report.html
# or render somewhere else for ad-hoc browsing
python3 tests/baselines/owasp_coverage.py --out /tmp/owasp-coverage.html
No external dependencies — pure Python 3 stdlib + inline CSS.
Ad-hoc / mid-development re-run reports for individual targets. They regenerate on every run and drift between analyses — only the named, version-pinned baseline set is committed. The owasp-coverage-report.html snapshot above is the one committed aggregate view (treated like the bundled example reports in examples/ — a stable, browsable artifact that regenerates deterministically from the inputs).