praxen

Baseline — v0.7.7-sequential

Frozen runs of all eleven test targets (../../README.md) on the Praxen v0.7.7 skill, against the intent-level Worker Remits (tests/remits/*.md). The eleven runs were produced cold on 2026-05-29 against the release/0.7.7 branch (off dev) with version-source-of-truth bumped to 0.7.7, sources fresh-cloned from upstream. Retires the previous v0.7.4-sequential/ set, which is no longer the comparison point but is kept on disk for diff archaeology.

What changed since v0.7.4-sequential

How this run compares to v0.7.4-sequential

Metric v0.7.4 v0.7.7 Delta
Total findings (across 11 targets) 135 109 −19.3%
Weighted RAISE deltas per target all within ±0.45
Targets in expected band 11/11 11/11
Critical themes preserved yes yes
Inferred log rows surfaced n/a 9 (across 6 targets) new behavior

The 19.3% drop in total findings is within natural run-to-run variance (the SKILL’s Build an AI Red Team calibration note documents ±2–3 swings per severity bucket per blind run). Per-target weighted RAISE is stable: 10 of 11 targets within ±0.15, the largest delta is deepagents-cli at −0.45 (boundary of natural variance, scope unchanged). Every Critical theme catalogued in ../../README.md’s per-target notes is present in the v0.7.7 set — the “no Critical theme dropped” hard gate passes for all 11.

The new inferred log-file rows fired correctly on six targets: openhands (3), yaah (2), and airline-customer-service / sweep / devika / deepagents-cli (1 each) — 9 inferred rows total across the suite. The remaining five targets had no logging infrastructure visible in the in-scope source files.

The eleven baselines

Sorted by weighted RAISE, ascending.

Target Critical High Medium Low Info Weighted Maturity
finbot 7 4 4 0 0 0.45 Absent
helperbot 3 4 1 0 0 0.45 Absent
devika 6 6 3 0 0 0.60 Absent
openai-customer-service 4 4 2 0 0 0.90 Absent
sweep 4 3 2 0 0 0.90 Absent
langchain-sql 2 5 2 0 0 1.00 Ad hoc
autogen-code-executor 4 4 1 0 0 1.15 Ad hoc
aider 3 5 2 0 0 1.40 Ad hoc
deepagents-cli 0 3 3 0 0 2.00 Partial
openhands 2 3 4 1 1 2.05 Partial
yaah 0 3 4 0 0 2.15 Partial

Run notes

The Full Suite Run was executed in two sessions across 2026-05-28 → 2026-05-29 because the first attempt coincided with the Claude 4.8 release and tripped concurrency-induced watchdog stalls. The pattern that worked: parallel subagents in batches of 4 with explicit heartbeat-discipline at the top of each prompt, Step 9.9 disk-write mandate emphasized, plus a continuation-subagent pattern for any agent that died mid-findings-loop (continuation reads the on-disk draft, completes remaining findings + positives + log_files, runs manifest_to_findings.py + render.py).

All eleven outputs went through manifest_to_findings.py and render.py cleanly; the byte-identity gate in tests/render/test_render.py passes 242/0 against this set.

How to compare

# Diff a single target's findings JSON across baseline sets
diff <(python3 -m json.tool tests/baselines/v0.7.4-sequential/finbot/finbot-findings-2026-05-26.json) \
     <(python3 -m json.tool tests/baselines/v0.7.7-sequential/finbot/finbot-findings-2026-05-29.json)

# Re-render any baseline from its JSON (byte-identical re-render is enforced by tests/render/test_render.py)
python3 skills/behavior-verifier/render.py \
  --findings tests/baselines/v0.7.7-sequential/<target>/<target>-findings-<date>.json \
  --template skills/behavior-verifier/report_template.html \
  --out-html /tmp/<target>.html --out-txt /tmp/<target>.txt