STATUS: PASS. All 11 targets completed cleanly; all dominant Critical themes preserved against tests/runs/v0.7.3-prerelease-r3/; weighted RAISE values inside their per-target bands except langchain-sql (+0.10 above band, same as r3). r4 confirms that the polish commit (44057a8 — 8 SKILL.md polish items + Step 7 prominence tweak) did NOT introduce a regression. The SKILL is validated for tag.
Skill state under test: dev @ 44057a8 (“skill: address PR #30 review — 8 polish items + Step 7 compound prominence”). Built on top of 88dd690 (Step 9.9 full-prose manifest + Step 10 mechanical-translation requirement) which r3 validated.
Compared against: v0.7.0 frozen baseline (tests/baselines/v0.7.0-sequential/), tests/runs/v0.7.3-prerelease/, tests/runs/v0.7.3-prerelease-r3/.
44057a8. SKILL.md grew by 13 lines (13,326 → 13,896 words; +4%) across the polish commit — mostly inline notes in the Step 9.9 manifest template and the Step 10 Common validation errors section.tests/baselines/v0.7.0-sequential/BASELINE.md. Weighted RAISE within ±0.3–0.5 of v0.7.0 baseline AND inside the per-target band; severity counts in the same neighbourhood; dominant Critical themes preserved (the hard gate).tests/runs/v0.7.3-prerelease-r3/ — the previous SKILL-validation run, against 88dd690 before the polish commit.| # | Target | v0.7.0 baseline (n · C/H/M/L/I · RAISE) | r3 (vs 88dd690) | r4 (vs 44057a8) | Duration | Verdict |
|---|---|---|---|---|---|---|
| 1 | finbot | 16 · 7/6/3/0/0 · 0.45 | 16 · 8/5/3/0/0 · 0.70 | 13 · 6/4/3/0/0 · 0.60 | 8.9 min | ✓ in-band, all themes preserved |
| 2 | helperbot | 10 · 3/5/2/0/0 · 0.45 | 11 · 4/6/1/0/0 · 0.45 | 11 · 4/5/2/0/0 · 0.45 | 5.5 min | ✓ exact RAISE match (prev + baseline) |
| 3 | langchain-sql | 12 · 4/4/3/0/1 · 0.85 | 12 · 4/5/3/0/0 · 1.30 | 8 · 2/3/3/0/0 · 1.30 | 8.4 min | ⚠ RAISE +0.10 above band (same as r3), themes preserved as consolidation |
| 4 | openai-customer-service | 13 · 5/6/2/0/0 · 0.90 | 13 · 5/4/4/0/0 · 1.00 | 11 · 4/4/3/0/0 · 0.75 | 11.3 min | ✓ in-band |
| 5 | autogen-code-executor | 15 · 4/6/3/1/1 · 1.60 | 17 · 5/6/4/1/1 · 1.00 | 13 · 2/5/5/1/0 · 1.30 | 20.0 min | ✓ in-band, all themes preserved |
| 6 | sweep | 13 · 4/5/2/1/1 · 1.35 | 14 · 4/7/2/0/1 · 0.75 | 12 · 4/5/3/0/0 · 1.45 | 22.5 min | ✓ exactly inside band (1.0-1.7), themes preserved |
| 7 | devika | 12 · 4/6/2/0/0 · 0.45 | 16 · 7/6/3/0/0 · 0.60 | 13 · 4/7/2/0/0 · 0.60 | 10.4 min | ✓ in-band, empty-file signal landed |
| 8 | aider | 12 · 4/6/2/0/0 · 1.45 | 13 · 4/6/3/0/0 · 1.45 | 13 · 3/5/4/1/0 · 1.55 | 22.0 min | ✓ in-band, two-sided test passes |
| 9 | openhands | 10 · 0/3/4/3/0 · 2.15 | 8 · 1/4/3/0/0 · 1.90 | 8 · 1/3/3/1/0 · 1.90 | 10.0 min | ✓ exact RAISE match (prev), two-sided test passes |
| 10 | deepagents-cli | 7 · 0/4/2/1/0 · 2.30 | 8 · 0/4/3/1/0 · 2.15 | 8 · 0/4/3/1/0 · 2.00 | 6.3 min | ✓ in-band (low end of 2.0–2.5), MCP coverage exercised; target required 3 attempts (see Section: Anomalies) |
| 11 | yaah | 10 · 2/4/4/0/0 · 2.20 | 9 · 0/5/3/1/0 · 2.30 | 8 · 0/5/3/0/0 · 2.30 | 7.6 min | ✓ exact RAISE match (prev), two-sided test passes, hookmap.go finding landed |
Aggregate: 118 findings (30C / 50H / 34M / 4L / 0I) across the eleven targets.
/admin/finbot/goals, fraud-detection-not-invoked-before-approval, fraud_detection_enabled runtime-flippable, business-context override, manual-review threshold bypass, vendor auto-approval on registration, confidence_threshold declared-but-never-consulted, no auth on /admin/*, hardcoded SECRET_KEY, partial decision logging, unpinned deps, no rate limiting.44057a8 after the 6-concurrent batch stalled (see Anomalies).max_execution_time. Maintainer’s create_sql_agent warning surfaced as positive.LocalCommandLineCodeExecutor warnings.warn instead of approval gate, os.environ copy into subprocess, create_default_code_executor() silent Docker→Local downgrade, Docker hardening absent (no user=/read_only=/mem_limit=/cap_drop=/network isolation), Jupyter timeouts soft, no per-execution audit log, DockerJupyterServer chmod(bind_dir, 0o777). New finding (006) names work-directory containment more precisely.WEBHOOK_SECRET, three subprocess.run(shell=True) sites with LLM-derived arguments, hardcoded PostHog key.firejail.py / code_runner.py 0-line stubs land as PRAX-001 and PRAX-002 per the Step 4 heuristic). Other tier compressions are within blind-run variance.Runner direct-subprocess, unauthenticated /api/settings POST, path traversal in save_code_to_project, compound RCE chain.# ai! auto-execution, abs_root_path() no repo-containment, /read-only//add accept absolute + ~ paths, no secret scanner, auto-commit/auto-lint with no diff-accept, --no-verify commits..mcp.json, kind=mcp tags landed on the right findings.yaah serve MCP server’s clean tool descriptions registered as positive.The r4 run was launched as a 6-concurrent batch like r3’s first batch. Unlike r3 — which hit transient socket connection was closed unexpectedly errors in a tight 38 s window — r4’s 6-concurrent batch hit the 600 s no-progress watchdog on 5 of 6 subagents, scattered across various scan steps (some pre-Step-9.9, some at the Step 9.9→10 transition, one at Step 12 report-back composition). One subagent (openai-customer-service) actually completed the scan but stalled while composing its report-back to me, leaving valid outputs on disk.
Diagnostic: a solo helperbot subagent against the same 44057a8 SKILL completed cleanly in 5:31, identical to r3 helperbot’s profile. This confirmed the SKILL is not the regression source — the issue is concurrent-load-dependent variance. Same root pattern as the historical memory (feedback_regression_suite_parallel_runs): the safe ceiling is 4–8 concurrent and 6 is within tolerance under good conditions but the actual tolerance is condition-dependent.
Mitigation: the 9 remaining scans were run as 3 sub-batches of 3 concurrent each. All 9 completed cleanly, in line with the historical “smaller batches succeed where larger ones stall under bad conditions” pattern.
deepagents-cli stalled at 6-concurrent (mid-Step-4 silence) AND at 3-concurrent (same failure mode — silence between announcing the read pass and issuing Read calls). A third attempt with an explicit “issue Reads in parallel, no planning paragraphs between” instruction and pre-loaded KB_MCP_SECURITY.md at Step 3 completed cleanly in 6:15. The pathology appears to be worker-overplanning specific to this target’s complexity (deferred KB load + Step 6 MCP Server Evaluation + “controls present” two-sided test). Worth a SKILL note in a future release recommending the worker dispatch Step 4 reads immediately rather than via a planning paragraph; not a tag blocker.
The openai-customer-service initial-batch failure was specifically during the worker’s composing its long report-back to me — the scan itself completed, the render produced clean outputs, but the worker died composing the human-readable summary that wasn’t actually load-bearing. Adopting the explicit “KEEP IT TIGHT” instruction in subsequent sub-batch briefs eliminated this failure mode across all 10 remaining scans.
STATUS: PASS — 11 of 11 targets completed cleanly with the post-polish SKILL (44057a8); all dominant Critical themes preserved; weighted RAISE values inside their per-target bands except langchain-sql (+0.10 above, same as r3, calibration variance not regression). The polish commit landed cleanly. The SKILL is validated for tag.
| Stat | Value |
|---|---|
| Targets scanned by subagent | 11 |
| Range | 5.5 min (helperbot) — 22.5 min (sweep) |
| Median | ~10 min |
| Mean | ~12 min |
| Total subagent model time | ~133 min (~2 hr 13 min) across all scans (excluding failed attempts) |
| Wallclock end-to-end (initial launch → final completion) | ~2 hr 40 min (initial 6-concurrent batch + diagnostic + 3 sub-batches of 3 + deepagents-cli 3rd attempt) |
| Failure-and-retry overhead | 1 stalled batch (6 subagents, ~12 min wasted each), 1 mid-batch deepagents-cli retry stall (1 attempt). Most scans completed first-try at 3-concurrent. |
| Target | Δ findings | Δ RAISE | Verdict |
|---|---|---|---|
| finbot | −3 | −0.10 | ✓ |
| helperbot | 0 | 0.00 | ✓ exact RAISE |
| langchain-sql | −4 (consolidation) | 0.00 | ⚠ stable +0.10 above band (same as r3) |
| openai-customer-service | −2 | −0.25 | ✓ |
| autogen-code-executor | −4 (consolidation) | +0.30 (UP into band) | ✓ corrected direction vs r3 |
| sweep | −2 | +0.70 (UP into band) | ✓ corrected direction vs r3 |
| devika | −3 | 0.00 | ✓ exact RAISE |
| aider | 0 | +0.10 | ✓ |
| openhands | 0 | 0.00 | ✓ exact RAISE |
| deepagents-cli | 0 | −0.15 | ✓ |
| yaah | −1 | 0.00 | ✓ exact RAISE |
44057a8) is clean. No SKILL-level regression. All 11 targets completed; all themes preserved; calibration variance is within tolerance and matches the historical pattern.tests/README.md (widen to 0.6–1.4) or the MSC scoring guidance for mature libraries — but neither blocks tagging 0.7.3.The 0.7.3 SKILL changes through 44057a8 resolve the original subagent watchdog stalls (validated by r3) and the polish commit doesn’t regress that resolution (validated by r4). The remaining concurrency variance is operational, not a SKILL bug — it’s documented in feedback_regression_suite_parallel_runs and now in this SUITE_RUN’s Anomalies section.
Recommendation: Proceed with the 0.7.3 release. Pre-tag steps still owed: version bump (4 files in sync) + CHANGELOG [0.7.3] entry + plugin-install smoke check + explicit merge approval. No SKILL regression to fix; the deepagents-cli worker-overplanning pathology and the langchain-sql band-floor wobble are minor follow-ups for a future release, not blockers for this one.
All eleven targets have the four canonical outputs in <target>-out/reports/:
<target>-findings-2026-05-25.json<target>-analysis-<TIMESTAMP>.html<target>-analysis-<TIMESTAMP>.txt<target>-draft-<TIMESTAMP>.md (Step 9.9 working manifest; demonstrates the full-prose discipline still holding under 44057a8)If committing to tests/runs/v0.7.3-prerelease-r4/, copy the three deliverables per target (drop the draft manifests per tests/runs/README.md convention).