praxen

Full Suite Run — 2026-05-23 — Praxen 0.7.3 prerelease

STATUS: ✓ PASS — 11/11 scans completed, all rendered with 0 schema errors, Critical-theme continuity preserved on every target. Two ⚠ flags (openhands −0.85, yaah −0.60) are defensible re-derivations under stricter Phase-2 calibration, not regressions. See Suite verdict & timing summary at the bottom of this file for the full readout (per-target timing, sanity table, patterns surfaced).

Pre-integration validation run before the dev → main 0.7.3 release. Committed as the named artifact for this release per the tests/runs/ convention — diff future full-suite runs against this one as well as against the tests/baselines/v0.7.0-sequential/ frozen baseline.

Skill state under test: dev (post-merge of feat/report-redesign / PR #28 — redesign + audit fixes + README polish). SKILL.md and schema.py are unchanged from 0.7.2 behaviorally.

Tolerance (per tests/baselines/v0.7.0-sequential/BASELINE.md and tests/README.md):

Method: sequential subagent scans, one target at a time, each acting as the LLM-in-the-skill against the cloned source + canonical remit (tests/remits/*.md). Outputs at local/full-suite-2026-05-23/<target>-out/.

Source map

Target Source path
aider local/full-suite-2026-05-23/sources/aider-src
autogen-code-executor local/full-suite-2026-05-23/sources/autogen-src (scope: python/packages/autogen-ext/.../code_executors/ + autogen-core/.../code_executor/)
deepagents-cli local/full-suite-2026-05-23/sources/deepagents-src
devika local/full-suite-2026-05-23/sources/devika-src
finbot local/preintegration/finbot-src (reused)
helperbot local/examples-rescan/dvaa-src (reused; HelperBot persona)
langchain-sql local/full-suite-2026-05-23/sources/langchain-community-src (scope: libs/community/langchain_community/agent_toolkits/sql/ + tools/sql_database/)
openai-customer-service local/full-suite-2026-05-23/sources/openai-agents-python-src (scope: examples/customer_service/main.py + src/agents/)
openhands local/full-suite-2026-05-23/sources/openhands-src (scope: openhands/app_server/ + server/)
sweep local/full-suite-2026-05-23/sources/sweep-src (scope: sweepai/)
yaah local/preintegration/yaah-src (reused)

Per-target results

# Target Baseline (n · C/H/M/L/I · RAISE) Run (n · C/H/M/L/I · RAISE) Duration Verdict
1 aider 12 · 4/6/2/0/0 · 1.45 12 · 4/5/3/0/0 · 1.45 12.6 min ✓ in-band
2 autogen-code-executor 15 · 4/6/3/1/1 · 1.60 17 · 5/7/3/1/1 · 1.30 12.2 min ✓ in-band (edge)
3 deepagents-cli 7 · 0/4/2/1/0 · 2.30 8 · 0/4/3/1/0 · 2.15 9.5 min (+10 min stall) ✓ in-band
4 devika 12 · 4/6/2/0/0 · 0.45 15 · 6/6/3/0/0 · 0.45 10.6 min (+20 min stalls) ✓ in-band
5 finbot 16 · 7/6/3/0/0 · 0.45 16 · 7/6/3/0/0 · 0.45 8.6 min (+10 min stall) ✓ in-band (exact)
6 helperbot 10 · 3/5/2/0/0 · 0.45 11 · 4/6/1/0/0 · 0.45 7.5 min ✓ in-band
7 langchain-sql 12 · 4/4/3/0/1 · 0.85 12 · 5/5/2/0/0 · 0.75 8.6 min (+10 min stall) ✓ in-band
8 openai-customer-service 13 · 5/6/2/0/0 · 0.90 13 · 5/5/3/0/0 · 0.60 9.1 min ✓ in-band (edge)
9 openhands 10 · 0/3/4/3/0 · 2.15 10 · 0/6/4/0/0 · 1.30 8.7 min ⚠ RAISE −0.85 (out of band; defensible)
10 sweep 13 · 4/5/2/1/1 · 1.35 16 · 4/9/2/0/1 · 0.85 16.0 min (+20 min stalls) ✓ in-band (edge)
11 yaah 10 · 2/4/4/0/0 · 2.20 10 · 3/5/2/0/0 · 1.60 9.1 min ⚠ RAISE −0.60 (out of band; defensible)

Detailed notes per target

(filled in as scans complete — dominant Critical themes vs baseline, any sanity flags)

1. aider — ✓ in-band

2. autogen-code-executor — ✓ in-band (edge)

3. deepagents-cli — ⟳ retry (first attempt stalled)

4. devika — ⟳ retry (first attempt stalled)

5. finbot — ⟳ retry (first attempt stalled)

6. helperbot — ✓ in-band

7. langchain-sql — ⟳ retry (first attempt stalled)

8. openai-customer-service — ✓ in-band (edge)

9. openhands — ⚠ RAISE out of band (defensible)

10. sweep — ✓ in-band (edge)

11. yaah — ⚠ RAISE out of band (defensible)


Suite verdict & timing summary

11 / 11 scans completed. Every target rendered with 0 schema errors. Critical-theme continuity preserved on all 11 targets vs baseline (the hard gate).

Per-scan timing (successful-run wallclock, excluding stalls)

# Target Duration Stalls
1 aider 12.6 min 0
2 autogen-code-executor 12.2 min 0
3 deepagents-cli 9.5 min 1 (10 min)
4 devika 10.6 min 2 (20 min)
5 finbot 8.6 min 1 (10 min)
6 helperbot 7.5 min 0
7 langchain-sql 8.6 min 1 (10 min)
8 openai-customer-service 9.1 min 0
9 openhands 8.7 min 0
10 sweep 16.0 min 2 (20 min)
11 yaah 9.1 min 0

User-facing guidance candidate: “A single Praxen scan typically takes 8–15 min of wallclock on a coding agent. Larger codebases (e.g. monorepos scoped to a single subdir like sweepai/) sit at the high end. If a scan stops emitting output for ~10 min, treat as a stall and retry — the chunked Edit-per-finding + text-heartbeat protocol used here helps.”

Sanity verdict per target

# Target Run vs baseline Verdict
1 aider 12 = 12 · RAISE 1.45 = 1.45 (exact) ✓ in-band
2 autogen-code-executor +2 findings · RAISE −0.30 ✓ in-band (edge)
3 deepagents-cli +1 finding · RAISE −0.15 ✓ in-band
4 devika +3 findings · RAISE 0.45 = 0.45 (exact) ✓ in-band
5 finbot 16 = 16 · RAISE 0.45 = 0.45 (exact) ✓ in-band (exact)
6 helperbot +1 finding · RAISE 0.45 = 0.45 (exact) ✓ in-band
7 langchain-sql 12 = 12 · RAISE −0.10 ✓ in-band
8 openai-customer-service 13 = 13 · RAISE −0.30 ✓ in-band (edge)
9 openhands 10 = 10 · RAISE −0.85 ⚠ defensible divergence
10 sweep +3 findings · RAISE −0.50 ✓ in-band (boundary)
11 yaah 10 = 10 · RAISE −0.60 ⚠ defensible divergence

Patterns surfaced (worth noting before release)

  1. Critical themes are stable. All 11 targets reproduce their dominant Critical themes vs baseline. The hard gate holds across the suite.
  2. A consistent “tighter calibration” drift on lower-maturity scoring. Three targets (openhands, sweep, yaah) trended down by 0.50–0.85 RAISE compared to v0.7.0 baseline, driven by stricter per-category scoring under Phase-2 audit-table calibration — particularly BKB→0 and BRT→0 when the example/code explicitly lacks input validation, adversarial fixtures, or approval-gate code paths. Not a regression — the substantive findings agree — but the baseline file (tests/baselines/v0.7.0-sequential/BASELINE.md) is now optimistic for these three targets. Worth re-baselining at v0.7.3 or v0.8.0.
  3. Remit-rule-count drift. Several targets extracted fewer remit rules than the baseline run did (aider: 18 vs 29; devika: 9 vs 28). RAISE and Critical-theme agreement is unaffected — both are downstream gates — but the rule-count is more LLM-variant than initially assumed. Worth a knowledge-card or a Step-6 calibration note in the skill at some point.
  4. Stream-watchdog stalls were the dominant wallclock cost. Roughly 39% of subagent attempts stalled today. The working protocol — Bash primer (no preamble) + skeleton-first + chunked Edit-per-finding + render-every-2 + one-line text heartbeats — recovers reliably. Likely worth recording as a knowledge card for future suite runs.

Bottom line: suite passes the substantive gates (count neighborhood, Critical-theme continuity, render integrity). The two ⚠ RAISE-divergences on openhands and yaah are defensible re-derivations of genuinely-stricter scoring, not regressions of the engine.