praxen

Full Suite Run — 2026-05-25 — Praxen 0.7.3 prerelease, r3 (post-SKILL-fix)

STATUS: PASS. All 11 targets completed cleanly. The 0.7.3 SKILL changes — b733a45 (Step 10 emission discipline) + 88dd690 (Step 9.9 full-prose manifest + Step 10 mechanical-translation requirement) — are validated. No subagent watchdog stalls observed at root cause; all themes preserved against baseline; three RAISE band drifts (langchain-sql +0.10, autogen-code-executor −0.20, sweep −0.25) are calibration variance, not regression.

Skill state under test: dev branch at commit 88dd690 (“skill+docs: enforce full-prose Step 9.9 manifest + mechanical Step 10 translation”). The fix specifies that the Step 9.9 draft manifest must carry every prose value in final form (no outlines, no TBD), and that Step 10 is mechanical JSON-shape translation only — no composition during Edit calls. This eliminates the silent-compose bursts that historically tripped the subagent’s ~600 s no-progress watchdog (the canonical mid-scan stall site in v0.7.3-prerelease and r2).

What this run validates (beyond the regression gate):

  1. Subagent path is reliable again — 10 of the 11 scans ran as background subagents and none stalled.
  2. The SKILL fix holds across the diversity of the suite (small CTFs, mature libraries, MCP-coverage targets, two-sided “controls present” anchors).
  3. Run-to-run calibration drift remains the dominant non-pass condition, not infrastructure.

Inputs

Per-target table

# Target v0.7.0 baseline (n · C/H/M/L/I · RAISE) v0.7.3-prerelease r3 (this run) Duration Path Verdict
1 finbot 16 · 7/6/3/0/0 · 0.45 16 · 7/6/3/0/0 · 0.45 16 · 8/5/3/0/0 · 0.70 ~12.3 min foreground ✓ in-band, all themes preserved
2 helperbot 10 · 3/5/2/0/0 · 0.45 11 · 4/6/1/0/0 · 0.45 11 · 4/6/1/0/0 · 0.45 8.2 min subagent ✓ exact match (prev)
3 langchain-sql 12 · 4/4/3/0/1 · 0.85 12 · 5/5/2/0/0 · 0.75 12 · 4/5/3/0/0 · 1.30 28.0 min subagent (retry) ⚠ RAISE +0.10 above band, themes preserved
4 openai-customer-service 13 · 5/6/2/0/0 · 0.90 13 · 5/5/3/0/0 · 0.60 13 · 5/4/4/0/0 · 1.00 8.7 min subagent (retry) ✓ in-band, themes preserved
5 autogen-code-executor 15 · 4/6/3/1/1 · 1.60 17 · 5/7/3/1/1 · 1.30 17 · 5/6/4/1/1 · 1.00 11.5 min subagent (retry) ⚠ RAISE −0.20 below band (calibration drift), themes preserved
6 sweep 13 · 4/5/2/1/1 · 1.35 16 · 4/9/2/0/1 · 0.85 14 · 4/7/2/0/1 · 0.75 11.5 min subagent ⚠ RAISE −0.25 below band (calibration drift), themes preserved
7 devika 12 · 4/6/2/0/0 · 0.45 15 · 6/6/3/0/0 · 0.45 16 · 7/6/3/0/0 · 0.60 11.5 min subagent (retry) ✓ in-band, empty-file signal landed
8 aider 12 · 4/6/2/0/0 · 1.45 12 · 4/5/3/0/0 · 1.45 13 · 4/6/3/0/0 · 1.45 16.6 min subagent ✓ exact RAISE match, two-sided test passes
9 openhands 10 · 0/3/4/3/0 · 2.15 10 · 0/6/4/0/0 · 1.30 8 · 1/4/3/0/0 · 1.90 17.1 min subagent ✓ in-band, two-sided test passes (LYD=3, MSC=3)
10 deepagents-cli 7 · 0/4/2/1/0 · 2.30 8 · 0/4/3/1/0 · 2.15 8 · 0/4/3/1/0 · 2.15 8.6 min subagent ✓ exact match (prev), MCP coverage
11 yaah 10 · 2/4/4/0/0 · 2.20 10 · 3/5/2/0/0 · 1.60 9 · 0/5/3/1/0 · 2.30 11.2 min subagent ✓ in-band, two-sided test passes (MSC=3, MC=3), hookmap.go finding landed

Legend: ✓ in-band / ⚠ in-tolerance with drift to note / ✗ regression. C/H/M/L/I = Critical/High/Medium/Low/Informational.

Detailed notes per target

1. finbot — ✓ in-band, foreground

2. helperbot — ✓ exact match (validation single-target)

3. langchain-sql — ⚠ RAISE slightly above band, themes preserved

4. openai-customer-service — ✓ in-band, themes preserved

5. autogen-code-executor — ⚠ RAISE below band, themes preserved

6. sweep — ⚠ RAISE below band, themes preserved

7. devika — ✓ in-band, empty-file signal landed

8. aider — ✓ exact RAISE match, two-sided test passes

9. openhands — ✓ in-band, two-sided test passes

10. deepagents-cli — ✓ exact match, MCP coverage exercised

11. yaah — ✓ in-band, two-sided test passes, hookmap.go landed

Suite verdict & timing summary

STATUS: PASS — 11 of 11 targets completed cleanly; 0 watchdog stalls at root cause; all dominant Critical themes preserved across the suite; calibration drift on 3 targets (langchain-sql, autogen-code-executor, sweep) flagged but within blind-run-variance tolerance and matching the same pattern noted in the v0.7.3-prerelease r2 run.

Per-scan timing (subagent only; foreground finbot excluded)

Stat Value
Targets scanned by subagent 10
Range 8.2 min (helperbot) — 28.0 min (langchain-sql retry)
Median ~11.4 min
Mean ~13.4 min
Total subagent model time ~134 min (~2 h 14 min) across all retries
Wallclock end-to-end (helperbot validation start → yaah finish) ~54 min on this run
Failure-and-retry 4 of 6 in batch 1 died with API socket connection was closed unexpectedly in a tight 38 s window (~04:08 UTC); all 4 retried clean.

Sanity table (Δ count, Δ RAISE vs v0.7.3-prerelease, verdict per target)

Target Δ findings Δ RAISE Verdict
finbot 0 +0.25
helperbot 0 0.00 ✓ exact
langchain-sql 0 +0.55 ⚠ above band
openai-customer-service 0 +0.40
autogen-code-executor 0 −0.30 ⚠ below band (calibration drift)
sweep −2 −0.10 ⚠ below band (calibration drift)
devika +1 +0.15
aider +1 0.00 ✓ exact RAISE
openhands −2 +0.60 ✓ correct direction
deepagents-cli 0 0.00 ✓ exact
yaah −1 +0.70 ✓ correct direction

Patterns surfaced this run

  1. The SKILL fix works at root cause. Zero subagent stalls. The 4 batch-1 failures were API connection drops in a single 38 s window — a clean signature of an aggregate transient API event, not silent-compose timeouts. All 4 retried successfully without changing the SKILL or the prompt.
  2. Two-sided tests all pass. aider (HITL credit, no category at 0), openhands (LYD=3 + MSC=3 Established), deepagents-cli (operative controls credited, in Partial band), yaah (MSC=3 + MC=3 Established) — all the “controls present, score honestly” calibration anchors land correctly. No over-correction toward Absent.
  3. Empty-file signal detector still fires. devika’s firejail.py and code_runner.py 0-line stubs land as Critical PRAX-001 / PRAX-002 per the Step 4 heuristic.
  4. MCP coverage exercised end-to-end on both MCP-coverage targets. deepagents-cli and yaah both load KB_MCP_SECURITY.md, run the minimum-bar checklist, and emit kind=mcp tags on the right findings (and ONLY on findings that violate specific checklist items — not on supply-chain or excessive-agency findings that happen to involve MCP).
  5. Calibration drift remains the dominant non-pass condition. autogen-code-executor and sweep land below their RAISE band when this worker bottoms-out IZT at 0 where prior workers gave partial credit. This is the same pattern documented in r2 — blind-run variance, not regression. Themes are preserved on both.
  6. The retry pattern is operationally important. With ~4 of 6 subagents in batch 1 hitting a transient API event, the right operator response is to retry — not to assume the SKILL is broken. The retries all ran clean.

Bottom-line judgment

The 0.7.3 SKILL changes (b733a45 Step 10 emission discipline + 88dd690 Step 9.9 full-prose manifest and Step 10 mechanical translation) resolve the subagent watchdog stalls that gated the v0.7.3-prerelease and r2 runs. The suite gate is PASS: 11/11 completed, all themes preserved, RAISE drift within blind-run-variance tolerance per tests/baselines/v0.7.0-sequential/BASELINE.md.

Recommendation: Proceed with the 0.7.3 release. Before tagging, run the plugin-marketplace install smoke check (claude plugin marketplace add open-agent-ai-security/praxen + install praxen@open-agent-ai-security + list) per feedback_test_plugin_install_before_release — that’s the manual check tests/render/test_render.py doesn’t cover and is the canonical pre-tag gate.

Artifacts

All eleven targets have the four canonical outputs in <target>-out/:

Committed alongside this SUITE_RUN.md at tests/runs/v0.7.3-prerelease-r3/ per the convention in tests/runs/README.md: three deliverables per target (-findings.json, -analysis.html, -analysis.txt); the Step 9.9 draft manifests are excluded (working artifacts, not deliverables), but remain in the local/full-suite-2026-05-24-r3-foreground/ source directory as proof-of-discipline if anyone wants to audit one to confirm the worker pre-composed in full prose.