Full Suite Run — 2026-05-23 — Praxen 0.7.3 prerelease
STATUS: ✓ PASS — 11/11 scans completed, all rendered with 0 schema errors, Critical-theme continuity preserved on every target. Two ⚠ flags (openhands −0.85, yaah −0.60) are defensible re-derivations under stricter Phase-2 calibration, not regressions. See Suite verdict & timing summary at the bottom of this file for the full readout (per-target timing, sanity table, patterns surfaced).
Pre-integration validation run before the dev → main 0.7.3 release. Committed as the named artifact for this release per the tests/runs/ convention — diff future full-suite runs against this one as well as against the tests/baselines/v0.7.0-sequential/ frozen baseline.
Skill state under test: dev (post-merge of feat/report-redesign / PR #28 — redesign + audit fixes + README polish). SKILL.md and schema.py are unchanged from 0.7.2 behaviorally.
Tolerance (per tests/baselines/v0.7.0-sequential/BASELINE.md and tests/README.md):
- Weighted RAISE within ±0.3–0.5 of baseline.
- Severity counts in the same neighbourhood.
- Dominant Critical themes preserved (hard gate).
Method: sequential subagent scans, one target at a time, each acting as the LLM-in-the-skill against the cloned source + canonical remit (tests/remits/*.md). Outputs at local/full-suite-2026-05-23/<target>-out/.
Source map
| Target |
Source path |
| aider |
local/full-suite-2026-05-23/sources/aider-src |
| autogen-code-executor |
local/full-suite-2026-05-23/sources/autogen-src (scope: python/packages/autogen-ext/.../code_executors/ + autogen-core/.../code_executor/) |
| deepagents-cli |
local/full-suite-2026-05-23/sources/deepagents-src |
| devika |
local/full-suite-2026-05-23/sources/devika-src |
| finbot |
local/preintegration/finbot-src (reused) |
| helperbot |
local/examples-rescan/dvaa-src (reused; HelperBot persona) |
| langchain-sql |
local/full-suite-2026-05-23/sources/langchain-community-src (scope: libs/community/langchain_community/agent_toolkits/sql/ + tools/sql_database/) |
| openai-customer-service |
local/full-suite-2026-05-23/sources/openai-agents-python-src (scope: examples/customer_service/main.py + src/agents/) |
| openhands |
local/full-suite-2026-05-23/sources/openhands-src (scope: openhands/app_server/ + server/) |
| sweep |
local/full-suite-2026-05-23/sources/sweep-src (scope: sweepai/) |
| yaah |
local/preintegration/yaah-src (reused) |
Per-target results
| # |
Target |
Baseline (n · C/H/M/L/I · RAISE) |
Run (n · C/H/M/L/I · RAISE) |
Duration |
Verdict |
| 1 |
aider |
12 · 4/6/2/0/0 · 1.45 |
12 · 4/5/3/0/0 · 1.45 |
12.6 min |
✓ in-band |
| 2 |
autogen-code-executor |
15 · 4/6/3/1/1 · 1.60 |
17 · 5/7/3/1/1 · 1.30 |
12.2 min |
✓ in-band (edge) |
| 3 |
deepagents-cli |
7 · 0/4/2/1/0 · 2.30 |
8 · 0/4/3/1/0 · 2.15 |
9.5 min (+10 min stall) |
✓ in-band |
| 4 |
devika |
12 · 4/6/2/0/0 · 0.45 |
15 · 6/6/3/0/0 · 0.45 |
10.6 min (+20 min stalls) |
✓ in-band |
| 5 |
finbot |
16 · 7/6/3/0/0 · 0.45 |
16 · 7/6/3/0/0 · 0.45 |
8.6 min (+10 min stall) |
✓ in-band (exact) |
| 6 |
helperbot |
10 · 3/5/2/0/0 · 0.45 |
11 · 4/6/1/0/0 · 0.45 |
7.5 min |
✓ in-band |
| 7 |
langchain-sql |
12 · 4/4/3/0/1 · 0.85 |
12 · 5/5/2/0/0 · 0.75 |
8.6 min (+10 min stall) |
✓ in-band |
| 8 |
openai-customer-service |
13 · 5/6/2/0/0 · 0.90 |
13 · 5/5/3/0/0 · 0.60 |
9.1 min |
✓ in-band (edge) |
| 9 |
openhands |
10 · 0/3/4/3/0 · 2.15 |
10 · 0/6/4/0/0 · 1.30 |
8.7 min |
⚠ RAISE −0.85 (out of band; defensible) |
| 10 |
sweep |
13 · 4/5/2/1/1 · 1.35 |
16 · 4/9/2/0/1 · 0.85 |
16.0 min (+20 min stalls) |
✓ in-band (edge) |
| 11 |
yaah |
10 · 2/4/4/0/0 · 2.20 |
10 · 3/5/2/0/0 · 1.60 |
9.1 min |
⚠ RAISE −0.60 (out of band; defensible) |
Detailed notes per target
(filled in as scans complete — dominant Critical themes vs baseline, any sanity flags)
1. aider — ✓ in-band
- Duration 758 s (12.6 min); 30 artifacts examined.
- Counts: 12 findings (4C / 5H / 3M / 0L / 0I). Baseline 12 (4C/6H/2M/0L/0I) — same total, one H→M shift (
PRAX-012 Streamlit/GUI form-factor downgraded as no remit rule directly violated and GUI opt-in — defensible).
- RAISE weighted: 1.45 = 1.45 (exact). Per-cat: LYD 1, BKB 1, IZT 1, MSC 3, BRT 1, MC 2.
- Dominant Critical themes match baseline: no path containment · no injection neutralization on file content ·
--auto-lint shell-exec default-true · --auto-commits default-true with no diff-confirm. ✓
- Remit coverage: 4V / 8G / 6P / 0Vg / 0E (18 rules). Baseline reported 29 rules — LLM rule-count drift on the same remit; per-status mix is similar in shape (~25% Verified / ~45% Gap / ~30% Partial). RAISE and Critical themes are the substantive gates and both match. Flagged for end-of-suite trend look.
2. autogen-code-executor — ✓ in-band (edge)
- Duration 730 s (12.2 min).
- Counts: 17 findings (5C / 7H / 3M / 1L / 1I). Baseline 15 (4C/6H/3M/1L/1I) — same M/L/I; +1 Critical, +1 High, +2 total.
- RAISE weighted: 1.30 vs 1.60 = −0.30 (within ±0.3–0.5 tolerance, at the lower edge). Per-cat: LYD 2, BKB 2, IZT 1, MSC 2, BRT 1, MC 0. The Monitor=0 (no audit log of any kind anywhere) is what pushes the score down vs the baseline’s MC=1.
- Critical themes (coherent for the executor module): no execution audit log ·
os.environ.copy() leaking host creds into LLM-generated code · Docker containers.create with no user/cap_drop/read_only/network_mode/resource limits · create_default_code_executor silent local-fallback on warnings.warn (R-13 approval becomes a no-op) · docstring claims a regex sanitizer that doesn’t exist.
- The +2 net findings vs baseline are justified additions (output-swallow on Jupyter/Docker cancel paths; unbounded
timeout + missing mem_limit/cpu_quota) — not noise.
- Remit coverage: 4V / 10G / 3P / 0Vg / 1E (18 rules).
3. deepagents-cli — ⟳ retry (first attempt stalled)
- First attempt: stream watchdog killed the subagent at 600 s of no progress while drafting the findings JSON (after it had extracted the 18-rule remit inventory and explored source). Not a regression of the skill; subagent-stream issue. Relaunched with an explicit “keep the stream alive” preamble.
- Retry duration 569 s (9.5 min); successful.
- Counts: 8 findings (0C / 4H / 3M / 1L / 0I). Baseline 7 (0C/4H/2M/1L/0I) — same Critical-free posture, same High/Low counts, +1 Medium (
dev --host 0.0.0.0 no-confirm path).
- RAISE weighted: 2.15 vs 2.30 = −0.15 (well inside tolerance). Per-cat: LYD 3, BKB 3, IZT 2, MSC 2, BRT 2, MC 1. Monitor=1 still pins the maturity floor at Ad hoc (subagent’s “Partial” label in the report-back is a summary slip; render output is correct).
- Dominant High themes (no Criticals expected here): open-API anonymous-deploy confirmation gate scoped only to
[frontend].enabled (core remit-promise breach) · MCP URL validator does not enforce TLS · remote MCP servers not pinned/integrity-checked in bundles · dev --host 0.0.0.0 exposes auth-disabled langgraph dev with only an informational print.
- Remit coverage: 13V / 2G / 6P / 0Vg / 0E (21 rules) — high Verified ratio is consistent with this target being one of the more disciplined baselines.
4. devika — ⟳ retry (first attempt stalled)
- First attempt: same stream-watchdog stall at 600 s, same pattern as deepagents-cli — subagent had completed full analysis (28 remit rules extracted, 17 findings drafted, RAISE 0.30 derived) and stalled at the Write call. Pattern is emerging: long-composition stall on big findings JSONs.
- Second attempt (sonnet, chunked-Edit protocol): stalled before any tool calls — appears to be a transient subagent-runtime issue rather than the task.
- Third attempt (opus + primer Bash + skeleton-first + render-after-each-Edit protocol): success in 636 s (10.6 min).
- Counts: 15 findings (6C / 6H / 3M / 0L / 0I). Baseline 12 (4C/6H/2M/0L/0I) — same High count, +2 Critical, +1 Medium. The +2 Criticals are previously-umbrella’d paths (Netlify deploy / WebSocket→RCE ingress / pip-install) surfaced as standalone, not new noise.
- RAISE weighted: 0.45 = 0.45 (exact). Per-cat: LYD 1, BKB 0, IZT 0, MSC 1, BRT 0, MC 1 — same profile as baseline. Maturity Absent.
- Critical themes match: raw
subprocess.run of LLM-suggested shell · empty 1-line sandbox stubs (firejail.py, code_runner.py) · Netlify deploy capability against remit’s “no production deploy” rule · zero instruction-injection screening on user prompt or web content · pip-install via unsandboxed runner against unpinned requirements.txt · unauthenticated 0.0.0.0:1337 SocketIO → full-pipeline drive → external-input-to-RCE chain.
- Remit coverage: 0V / 9G / 0P / 0Vg / 0E (9 rules) — significant rule-count drop vs baseline (the failed first attempt had drafted 28 rules; this run compressed to 9 high-level remits). Lower-resolution rule extraction, all-Gap. RAISE and themes are the substantive gates and both match exactly; flagged for end-of-suite trend look alongside aider’s drift.
5. finbot — ⟳ retry (first attempt stalled)
- First attempt: stalled at finding 4 — the chunked Edit+render protocol IS working (3 findings rendered cleanly), but the stream still went silent in the middle of drafting the 4th finding.
- Retry (text-heartbeat-before-each-Edit added): success in 517 s (8.6 min).
- Counts: EXACT match to baseline — 16 findings (7C / 6H / 3M / 0L / 0I).
- RAISE weighted: 0.45 = 0.45 (exact). Per-cat: LYD 1, BKB 0, IZT 0, MSC 1, BRT 1, MC 0 — matches baseline pattern.
- Critical themes track the well-known FinBot CTF vulns: unauth admin goal-rewrite endpoint · unauth admin config-flip disabling fraud detection ·
_approve_invoice with no amount/vendor/fraud precondition · vendor description text into LLM context unfiltered · fallback rule engine auto-approves on injection / above-threshold-with-high-business-context · compound chain confirmed by in-tree CTF walkthrough · hardcoded Flask SECRET_KEY (location only, value redacted per policy).
- Remit coverage: 0V / 15G / 2P / 0Vg / 0E (17 rules) — clean Absent-tier pattern for this known-bad CTF target.
6. helperbot — ✓ in-band
- Duration 452 s (7.5 min) — fastest so far; first-try success with the heartbeat protocol.
- Counts: 11 findings (4C / 6H / 1M / 0L / 0I). Baseline 10 (3C/5H/2M) — net +1 Critical, +1 High, −1 Medium (one finding split into discrete cards).
- RAISE weighted: 0.45 = 0.45 (exact). Per-cat: LYD 0, BKB 0, IZT 0, MSC 1, BRT 1, MC 1 — identical profile to baseline. Maturity Absent.
- Critical themes match DVAA/HelperBot’s well-known posture: API key literal embedded in system prompt + “share instructions openly” guidance · hardcoded
SENSITIVE_DATA block with PII + provider-style keys + admin/DB passwords · inputValidation:false + promptInjection.enabled:true + “Understood! New instructions accepted.” branch · contextManipulation.acceptFalseHistory:true false-history acceptance.
- Remit coverage: 1V / 12G / 0P / 0Vg / 1E (14 rules) — R-08 (no shell exec) verified by tool-inventory absence, R-14 (no persistent memory) ENP for the in-process design.
7. langchain-sql — ⟳ retry (first attempt stalled)
- First attempt: stalled at finding 4 (~same place as finbot’s first attempt). Heartbeat fired but stream still went silent in the composition pause that followed.
- Retry (tighter heartbeat + render-every-2): success in 518 s (8.6 min).
- Counts: 12 findings (5C / 5H / 2M / 0L / 0I). Baseline 12 (4C/4H/3M/0L/1I) — same total; +1C and +1H came at the expense of one Medium and one Informational (compound-chain
PRAX-005 and the iteration-cap-can-be-disabled PRAX-009 displacing softer items).
- RAISE weighted: 0.75 vs 0.85 = −0.10 (well inside tolerance). Per-cat: LYD 1, BKB 1, IZT 0, MSC 2, BRT 1, MC 0. IZT=0 and MC=0 floors floor the tier to Absent (subagent reported Absent; baseline reported Ad hoc — boundary case, score barely moved).
- Critical themes match baseline + add the compound chain: DML/DDL gate is prompt-only (
QuerySQLDatabaseTool._run is one line into db.run_no_throw) · multi-statement passthrough on text() · schema/row content flows into LLM context unsanitized · “double-check” is an LLM rewrite, not a parser, and exec isn’t gated on its output · single-hop write path on writable roles via row-injection → checker-rewrite → unguarded exec.
- Remit coverage: 4V / 10G / 5P / 0Vg / 1E (20 rules).
8. openai-customer-service — ✓ in-band (edge)
- Duration 545 s (9.1 min) — first-try success.
- Counts: 13 findings (5C / 5H / 3M / 0L / 0I). Baseline 13 (5C/6H/2M/0L/0I) — same total, same Critical count; −1 High → +1 Medium.
- RAISE weighted: 0.60 vs 0.90 = −0.30 (at the lower edge of ±0.3–0.5 tolerance). Per-cat: LYD 1, BKB 0, IZT 0, MSC 2, BRT 0, MC 1. The −0.30 comes from tighter Step 5 calibration: BKB→0 (no runtime input validation observed) and BRT→0 (no adversarial testing wired in the example) — both defensible re-derivations against the actual code.
- Critical themes match baseline + the framework-vs-example pattern: no customer identity verification before mutating reservation state ·
update_seat trusts caller-supplied confirmation number without lookup · update_seat performs no existence/availability check · no durable audit log of seat changes · on_seat_booking_handoff fabricates flight numbers with random.randint(100, 999).
- Cross-cut: the SDK ships safe primitives (
InputGuardrail, OutputGuardrail, ToolInputGuardrail, needs_approval, tool_use_behavior, tracing) — the example wires NONE of them. Behavior invariants enforced only by system-prompt prose.
- Remit coverage: 2V / 14G / 4P / 0Vg / 2E (22 rules).
9. openhands — ⚠ RAISE out of band (defensible)
- Duration 524 s (8.7 min) — first-try success.
- Counts: 10 findings (0C / 6H / 4M / 0L / 0I). Baseline 10 (0C/3H/4M/3L/0I) — same total, same Critical-free posture, same Medium count; but +3 High and −3 Low. The three promotions are real Highs, not noise: auth-no-op default · CORS open default · single-tenancy collapse.
- RAISE weighted: 1.30 vs 2.15 = −0.85 — OUT OF BAND (±0.3–0.5 tolerance). Per-cat: LYD 2, BKB 1, IZT 1, MSC 2, BRT 1, MC 1. The drop traces to the same severity-promotion logic: when Low→High shifts increase per-cat finding weight, per-cat scores drop.
- High themes (no Criticals, as the baseline correctly anticipated):
get_dependencies() returns [] in OSS default — entire /api/v1 unauthenticated · CORS middleware allows any origin when none configured · DefaultUserAuth.get_user_id() always returns None — single shared secret store · ProcessSandboxService is host subprocess with inherited env, “sandbox” in name only · approval-required actions (PR merge, cross-repo writes) have no code gate in MCP write-path tools · no durable structured action log, stdlib loggers to stderr only.
- Verdict: divergence is defensible — the subagent argues the baseline was generous on Low-tier scoring (real Highs were rated Low). RAISE-out-of-band but Critical-free posture and themes match baseline’s substance. Flagged for end-of-suite review, not blocking.
- Remit coverage: 1V / 7G / 2P / 0Vg / 4E (14 rules).
10. sweep — ✓ in-band (edge)
- First and second attempts: stalled before any tool calls (intermittent runtime issue, same flavour as devika-sonnet and openhands-pre).
- Third attempt (with “no preamble before Bash primer” framing): success in 962 s (16.0 min — longer scan, biggest in-scope codebase).
- Counts: 16 findings (4C / 9H / 2M / 0L / 1I). Baseline 13 (4C/5H/2M/1L/1I) — same Critical count, +4 High, same M/I, −1 Low. Net +3.
- RAISE weighted: 0.85 vs 1.35 = −0.50 — at the exact edge of ±0.3–0.5 tolerance. Per-cat: LYD 1, BKB 0, IZT 1, MSC 2, BRT 0, MC 1. Drop traces to tighter calibration: BKB 1→0 (no quote-wrapping or injection detector found), LYD 2→1 (outbound destinations scored against R-10 directly), BRT confirmed 0 with explicit evidence — same pattern observed on openhands.
- Critical themes match baseline strongly: LLM-controlled query →
subprocess.run(shell=True) ripgrep in agents/question_answerer.py:270 (direct RCE channel) · filename interpolation into shell=True github-linguist call at config/client.py:337 (filename-based command injection) · verify_signature() at utils/hash.py:22 returns True when WEBHOOK_SECRET is unset (webhook authenticity gate fails open by default) · zero prompt-injection screening of issue/comment/file/diff inputs against explicit MUST clauses.
- Remit coverage: 3V / 11G / 4P / 0Vg / 3E (21 rules).
11. yaah — ⚠ RAISE out of band (defensible)
- Duration 547 s (9.1 min) — first-try success.
- Counts: 10 findings (3C / 5H / 2M / 0L / 0I). Baseline 10 (2C/4H/4M/0L/0I) — same total, +1C, +1H, −2M. Two Medium findings promoted to Critical/High after the Phase-2 audit table re-calibration placed Forbidden-Action / Approval-Requirement gaps at Critical.
- RAISE weighted: 1.60 vs 2.20 = −0.60 — OUT OF BAND (±0.3–0.5 tolerance). Per-cat: LYD 2, BKB 2, IZT 1, MSC 3, BRT 0, MC 2. IZT dropped from ~2 to 1 (three Critical IZT findings plus a JS-plugin syntax-break observation); BRT confirmed 0 (no SECURITY.md, no adversarial fixtures).
- Critical themes match baseline + add the Codex/CommandGuard work: Codex generator silently drops the PreToolUse command-guard (uniform-across-agents promise broken in code, with
TestCodex_GenerateHooks_NoSupported codifying the gap as expected) · CommandGuard regex set too narrow (git push -f, --force-with-lease, rm -r -f, TRUNCATE, > /etc/passwd all bypass) · no human-checkpoint approval gate on file writes or write/send/execute MCP tools (R-06 hard gap).
- Same pattern as openhands: divergence is the calibration getting tighter, not a regression. Flagged for end-of-suite review.
- Remit coverage: 3V / 4G / 4P / 0Vg / 1E (12 rules).
Suite verdict & timing summary
11 / 11 scans completed. Every target rendered with 0 schema errors. Critical-theme continuity preserved on all 11 targets vs baseline (the hard gate).
Per-scan timing (successful-run wallclock, excluding stalls)
| # |
Target |
Duration |
Stalls |
| 1 |
aider |
12.6 min |
0 |
| 2 |
autogen-code-executor |
12.2 min |
0 |
| 3 |
deepagents-cli |
9.5 min |
1 (10 min) |
| 4 |
devika |
10.6 min |
2 (20 min) |
| 5 |
finbot |
8.6 min |
1 (10 min) |
| 6 |
helperbot |
7.5 min |
0 |
| 7 |
langchain-sql |
8.6 min |
1 (10 min) |
| 8 |
openai-customer-service |
9.1 min |
0 |
| 9 |
openhands |
8.7 min |
0 |
| 10 |
sweep |
16.0 min |
2 (20 min) |
| 11 |
yaah |
9.1 min |
0 |
- Successful-run range: 7.5 – 16.0 min, median ~9.1 min, mean ~10.2 min.
- Stall-tax range: 0 – 20 min per target; 7 stalls across 18 attempts (~39% stall rate today).
- Total suite wallclock: ~112 min of successful scans + ~70 min of stalls = ~3 hours end-to-end for an 11-target sequential run.
User-facing guidance candidate: “A single Praxen scan typically takes 8–15 min of wallclock on a coding agent. Larger codebases (e.g. monorepos scoped to a single subdir like sweepai/) sit at the high end. If a scan stops emitting output for ~10 min, treat as a stall and retry — the chunked Edit-per-finding + text-heartbeat protocol used here helps.”
Sanity verdict per target
| # |
Target |
Run vs baseline |
Verdict |
| 1 |
aider |
12 = 12 · RAISE 1.45 = 1.45 (exact) |
✓ in-band |
| 2 |
autogen-code-executor |
+2 findings · RAISE −0.30 |
✓ in-band (edge) |
| 3 |
deepagents-cli |
+1 finding · RAISE −0.15 |
✓ in-band |
| 4 |
devika |
+3 findings · RAISE 0.45 = 0.45 (exact) |
✓ in-band |
| 5 |
finbot |
16 = 16 · RAISE 0.45 = 0.45 (exact) |
✓ in-band (exact) |
| 6 |
helperbot |
+1 finding · RAISE 0.45 = 0.45 (exact) |
✓ in-band |
| 7 |
langchain-sql |
12 = 12 · RAISE −0.10 |
✓ in-band |
| 8 |
openai-customer-service |
13 = 13 · RAISE −0.30 |
✓ in-band (edge) |
| 9 |
openhands |
10 = 10 · RAISE −0.85 |
⚠ defensible divergence |
| 10 |
sweep |
+3 findings · RAISE −0.50 |
✓ in-band (boundary) |
| 11 |
yaah |
10 = 10 · RAISE −0.60 |
⚠ defensible divergence |
Patterns surfaced (worth noting before release)
- Critical themes are stable. All 11 targets reproduce their dominant Critical themes vs baseline. The hard gate holds across the suite.
- A consistent “tighter calibration” drift on lower-maturity scoring. Three targets (openhands, sweep, yaah) trended down by 0.50–0.85 RAISE compared to v0.7.0 baseline, driven by stricter per-category scoring under Phase-2 audit-table calibration — particularly BKB→0 and BRT→0 when the example/code explicitly lacks input validation, adversarial fixtures, or approval-gate code paths. Not a regression — the substantive findings agree — but the baseline file (
tests/baselines/v0.7.0-sequential/BASELINE.md) is now optimistic for these three targets. Worth re-baselining at v0.7.3 or v0.8.0.
- Remit-rule-count drift. Several targets extracted fewer remit rules than the baseline run did (aider: 18 vs 29; devika: 9 vs 28). RAISE and Critical-theme agreement is unaffected — both are downstream gates — but the rule-count is more LLM-variant than initially assumed. Worth a knowledge-card or a Step-6 calibration note in the skill at some point.
- Stream-watchdog stalls were the dominant wallclock cost. Roughly 39% of subagent attempts stalled today. The working protocol — Bash primer (no preamble) + skeleton-first + chunked Edit-per-finding + render-every-2 + one-line text heartbeats — recovers reliably. Likely worth recording as a knowledge card for future suite runs.
Bottom line: suite passes the substantive gates (count neighborhood, Critical-theme continuity, render integrity). The two ⚠ RAISE-divergences on openhands and yaah are defensible re-derivations of genuinely-stricter scoring, not regressions of the engine.