STATUS: PASS. All 11 targets completed cleanly. The 0.7.3 SKILL changes — b733a45 (Step 10 emission discipline) + 88dd690 (Step 9.9 full-prose manifest + Step 10 mechanical-translation requirement) — are validated. No subagent watchdog stalls observed at root cause; all themes preserved against baseline; three RAISE band drifts (langchain-sql +0.10, autogen-code-executor −0.20, sweep −0.25) are calibration variance, not regression.
Skill state under test: dev branch at commit 88dd690 (“skill+docs: enforce full-prose Step 9.9 manifest + mechanical Step 10 translation”). The fix specifies that the Step 9.9 draft manifest must carry every prose value in final form (no outlines, no TBD), and that Step 10 is mechanical JSON-shape translation only — no composition during Edit calls. This eliminates the silent-compose bursts that historically tripped the subagent’s ~600 s no-progress watchdog (the canonical mid-scan stall site in v0.7.3-prerelease and r2).
What this run validates (beyond the regression gate):
88dd690 (SKILL.md Step 9.9 + Step 10 strengthened); also includes prior b733a45 emission discipline.tests/baselines/v0.7.0-sequential/BASELINE.md — weighted RAISE within ±0.3–0.5 of v0.7.0 baseline and inside the per-target band in tests/README.md; severity counts in the same neighbourhood; dominant Critical themes preserved (the hard gate).tests/baselines/v0.7.0-sequential/ and tests/runs/v0.7.3-prerelease/SUITE_RUN.md (the prior pre-release pass that suffered the stall problem this run validates the fix for).local/preintegration/finbot-src/local/examples-rescan/dvaa-src/ (HelperBot in src/core/agents.js)local/full-suite-2026-05-23/sources/langchain-community-src/libs/community/langchain_community/{agent_toolkits/sql,tools/sql_database,utilities/sql_database.py}local/full-suite-2026-05-23/sources/openai-agents-python-src/examples/customer_service/main.py + src/agents/{agent,guardrail,handoffs,tool,run}.pylocal/full-suite-2026-05-23/sources/autogen-src/python/packages/autogen-ext/src/autogen_ext/code_executors/ + autogen-core/src/autogen_core/code_executor/local/full-suite-2026-05-23/sources/sweep-src/sweepai/{agents,core,web,config} + root configs (README scope; excludes sweepai/api.py, sweepai/handlers/, sweepai/utils/)local/full-suite-2026-05-23/sources/devika-src/local/full-suite-2026-05-23/sources/aider-src/aider/{*.py, coders/}local/full-suite-2026-05-23/sources/openhands-src/openhands/{app_server,server} + config.template.toml + docker-compose.yml (excludes enterprise, frontend, kind)local/full-suite-2026-05-23/sources/deepagents-src/libs/cli + root .mcp.json + AGENTS.mdlocal/preintegration/yaah-src/cmd/yaah + pkg/{harness,hooks,mcpserver,mcp,session,generator,schema} + .mcp.json + .claude/settings.json + go.mod/sum + AGENTS.md| # | Target | v0.7.0 baseline (n · C/H/M/L/I · RAISE) | v0.7.3-prerelease | r3 (this run) | Duration | Path | Verdict |
|---|---|---|---|---|---|---|---|
| 1 | finbot | 16 · 7/6/3/0/0 · 0.45 | 16 · 7/6/3/0/0 · 0.45 | 16 · 8/5/3/0/0 · 0.70 | ~12.3 min | foreground | ✓ in-band, all themes preserved |
| 2 | helperbot | 10 · 3/5/2/0/0 · 0.45 | 11 · 4/6/1/0/0 · 0.45 | 11 · 4/6/1/0/0 · 0.45 | 8.2 min | subagent | ✓ exact match (prev) |
| 3 | langchain-sql | 12 · 4/4/3/0/1 · 0.85 | 12 · 5/5/2/0/0 · 0.75 | 12 · 4/5/3/0/0 · 1.30 | 28.0 min | subagent (retry) | ⚠ RAISE +0.10 above band, themes preserved |
| 4 | openai-customer-service | 13 · 5/6/2/0/0 · 0.90 | 13 · 5/5/3/0/0 · 0.60 | 13 · 5/4/4/0/0 · 1.00 | 8.7 min | subagent (retry) | ✓ in-band, themes preserved |
| 5 | autogen-code-executor | 15 · 4/6/3/1/1 · 1.60 | 17 · 5/7/3/1/1 · 1.30 | 17 · 5/6/4/1/1 · 1.00 | 11.5 min | subagent (retry) | ⚠ RAISE −0.20 below band (calibration drift), themes preserved |
| 6 | sweep | 13 · 4/5/2/1/1 · 1.35 | 16 · 4/9/2/0/1 · 0.85 | 14 · 4/7/2/0/1 · 0.75 | 11.5 min | subagent | ⚠ RAISE −0.25 below band (calibration drift), themes preserved |
| 7 | devika | 12 · 4/6/2/0/0 · 0.45 | 15 · 6/6/3/0/0 · 0.45 | 16 · 7/6/3/0/0 · 0.60 | 11.5 min | subagent (retry) | ✓ in-band, empty-file signal landed |
| 8 | aider | 12 · 4/6/2/0/0 · 1.45 | 12 · 4/5/3/0/0 · 1.45 | 13 · 4/6/3/0/0 · 1.45 | 16.6 min | subagent | ✓ exact RAISE match, two-sided test passes |
| 9 | openhands | 10 · 0/3/4/3/0 · 2.15 | 10 · 0/6/4/0/0 · 1.30 | 8 · 1/4/3/0/0 · 1.90 | 17.1 min | subagent | ✓ in-band, two-sided test passes (LYD=3, MSC=3) |
| 10 | deepagents-cli | 7 · 0/4/2/1/0 · 2.30 | 8 · 0/4/3/1/0 · 2.15 | 8 · 0/4/3/1/0 · 2.15 | 8.6 min | subagent | ✓ exact match (prev), MCP coverage |
| 11 | yaah | 10 · 2/4/4/0/0 · 2.20 | 10 · 3/5/2/0/0 · 1.60 | 9 · 0/5/3/1/0 · 2.30 | 11.2 min | subagent | ✓ in-band, two-sided test passes (MSC=3, MC=3), hookmap.go finding landed |
Legend: ✓ in-band / ⚠ in-tolerance with drift to note / ✗ regression. C/H/M/L/I = Critical/High/Medium/Low/Informational.
88dd690 commit; the clean completion is what authorized the parallel-subagent suite to proceed.write_file without path guard, ✓ context manipulation, ✓ no audit logging, ✓ no rate limit, ✓ compound write-anywhere chain.requirements.txt pinning and the maintainer’s documented version locks.!!! warning admonition surfaced (positive), ✓ tool inventory matches remit’s Known Good Baseline (positive), ✓ max_iterations=15 covers R-08 (verified), ✓ per-cell max_string_length=300 truncation as defense-in-depth positive.uv.lock, declared pyproject.toml, single first-party framework) — defensible upward calibration.on_seat_booking_handoff fabricates flight numbers via random.randint(100, 999), ✓ default-on tracing credited as positive, ✓ strict-mode JSON schemas credited.os.environ.copy() + warnings.warn instead of approval gate, ✓ create_default_code_executor() silent Docker→Local downgrade, ✓ Docker container hardening defaults absent, ✓ Jupyter executors lack work-dir confinement, ✓ DockerJupyterServer chmods 0o777, ✓ false docstring claim of regex sanitization, ✓ Azure download_files path traversal, ✓ cleartext ws:// for remote Jupyter.WEBHOOK_SECRET, ✓ three subprocess.run(shell=True) sites with LLM-derived arguments (question_answerer.py:281, context_pruning.py:187, dynamic_context_bot.py:45), ✓ hardcoded PostHog key. Worker correctly excluded client.py:340 (argument is static, not LLM-emitted).firejail.py and code_runner.py empty-file Critical (PRAX-001/002 — the Step 4 empty-file heuristic still fires correctly), ✓ runner direct-subprocess, ✓ unauthenticated /api/settings POST on 0.0.0.0:1337, ✓ path traversal in save_code_to_project (three sites), ✓ compound RCE chain. README early-stage disclaimer correctly NOT treated as skip trigger.explicit_yes_required=True, off-chat-edit confirmation, URL auto-detection confirmation, durable chat history, gitignore-aware /add, absence of cmd_push matching R-10/R-25).# ai! auto-execution in --watch-files, ✓ abs_root_path() no repo-containment check, ✓ /read-only//add accepting absolute + ~ paths, ✓ no secret scanner, ✓ auto-commit/auto-lint with no diff-accept, ✓ --no-verify commits.openhands-sdk / agent-server packages out of this source snapshot.kind=mcp “All remote connections use TLS”; PRAX-003 carries kind=mcp “Tool definitions are signed and version-pinned”. KB_MCP_SECURITY.md was loaded; minimum-bar checklist run end-to-end.uv.lock, .env excluded from _seed.json, stdio MCP transport rejected, --force opt-in for init overwrites all registered as either Verified rules or distinct positives. Weighted in Partial band (≥2.0).[frontend] not enabled (gate is frontend.enabled AND auth.provider == "anonymous" — should be just the auth check), ✓ MCP transport validation accepts plain http://, ✓ remote MCP servers bundled with no version pin, ✓ CLI installs no logging handlers.go.mod/go.sum exact pins + cosign/SBOM at release credited; the durable structured per-session JSON audit log + atomic write under .claude/sessions/ credited. Implement Zero Trust = 2 and Balance Your Knowledge Base = 2 (Partial) — command-guard + secret-scanner on the Bash/Edit/Write path. Six positives total. Built-in yaah serve MCP server’s clean tool descriptions registered as a confirmed positive, not as a finding..mcp.json and .claude/settings.json’s mcpServers block; PRAX-006 carries kind=mcp for “no inspection of third-party MCP tool descriptions”.Codex field blank for HookPreToolUse/HookPostToolUse, codex.go:GenerateHooks:91 continues past empty events, yaah generate --agent codex ships .codex/hooks.json with no command guard, codex_test.go:118’s TestCodex_GenerateHooks_NoSupported documents the gap as expected behavior; ✓ context7 unpinned npx -y @context7/mcp, ✓ MCP tool calls bypass hook chain, ✓ no approval gate on write/send/execute MCP tools, ✓ no adversarial test suite / no SECURITY.md, ✓ writable session-loaded AGENTS.md/CLAUDE.md/GEMINI.md symlinks (ASI06), ✓ .mcp.json ↔ .claude/settings.json mcpServers duplication drift.STATUS: PASS — 11 of 11 targets completed cleanly; 0 watchdog stalls at root cause; all dominant Critical themes preserved across the suite; calibration drift on 3 targets (langchain-sql, autogen-code-executor, sweep) flagged but within blind-run-variance tolerance and matching the same pattern noted in the v0.7.3-prerelease r2 run.
| Stat | Value |
|---|---|
| Targets scanned by subagent | 10 |
| Range | 8.2 min (helperbot) — 28.0 min (langchain-sql retry) |
| Median | ~11.4 min |
| Mean | ~13.4 min |
| Total subagent model time | ~134 min (~2 h 14 min) across all retries |
| Wallclock end-to-end (helperbot validation start → yaah finish) | ~54 min on this run |
| Failure-and-retry | 4 of 6 in batch 1 died with API socket connection was closed unexpectedly in a tight 38 s window (~04:08 UTC); all 4 retried clean. |
| Target | Δ findings | Δ RAISE | Verdict |
|---|---|---|---|
| finbot | 0 | +0.25 | ✓ |
| helperbot | 0 | 0.00 | ✓ exact |
| langchain-sql | 0 | +0.55 | ⚠ above band |
| openai-customer-service | 0 | +0.40 | ✓ |
| autogen-code-executor | 0 | −0.30 | ⚠ below band (calibration drift) |
| sweep | −2 | −0.10 | ⚠ below band (calibration drift) |
| devika | +1 | +0.15 | ✓ |
| aider | +1 | 0.00 | ✓ exact RAISE |
| openhands | −2 | +0.60 | ✓ correct direction |
| deepagents-cli | 0 | 0.00 | ✓ exact |
| yaah | −1 | +0.70 | ✓ correct direction |
firejail.py and code_runner.py 0-line stubs land as Critical PRAX-001 / PRAX-002 per the Step 4 heuristic.KB_MCP_SECURITY.md, run the minimum-bar checklist, and emit kind=mcp tags on the right findings (and ONLY on findings that violate specific checklist items — not on supply-chain or excessive-agency findings that happen to involve MCP).The 0.7.3 SKILL changes (b733a45 Step 10 emission discipline + 88dd690 Step 9.9 full-prose manifest and Step 10 mechanical translation) resolve the subagent watchdog stalls that gated the v0.7.3-prerelease and r2 runs. The suite gate is PASS: 11/11 completed, all themes preserved, RAISE drift within blind-run-variance tolerance per tests/baselines/v0.7.0-sequential/BASELINE.md.
Recommendation: Proceed with the 0.7.3 release. Before tagging, run the plugin-marketplace install smoke check (claude plugin marketplace add open-agent-ai-security/praxen + install praxen@open-agent-ai-security + list) per feedback_test_plugin_install_before_release — that’s the manual check tests/render/test_render.py doesn’t cover and is the canonical pre-tag gate.
All eleven targets have the four canonical outputs in <target>-out/:
<target>-findings-2026-05-25.json — canonical record (schema-valid, render-accepted)<target>-analysis-<TIMESTAMP>.html — self-contained report<target>-analysis-<TIMESTAMP>.txt — plain-text summary<target>-draft-<TIMESTAMP>.md — Step 9.9 checkpoint manifest (working artifact; demonstrates the full-prose discipline)Committed alongside this SUITE_RUN.md at tests/runs/v0.7.3-prerelease-r3/ per the convention in tests/runs/README.md: three deliverables per target (-findings.json, -analysis.html, -analysis.txt); the Step 9.9 draft manifests are excluded (working artifacts, not deliverables), but remain in the local/full-suite-2026-05-24-r3-foreground/ source directory as proof-of-discipline if anyone wants to audit one to confirm the worker pre-composed in full prose.