praxen

Praxen Pre-Release Test Plan

Praxen’s regression test suite. Three named tiers — pick the one that matches what you’re doing (see Test tiers below). Before every dev → main release PR, run the Full Suite Run — all twelve targets, either via parallel subagent (4–8 concurrent, ~90 min end-to-end) or sequential foreground (~3–4 hours end-to-end) — then diff every target against the latest frozen baseline in baselines/ and against the per-target bands in this document (see What a release review looks like). Ad-hoc mid-development re-run reports are not kept here — they regenerate and drift. The committed runs are: the named, version-pinned frozen baselines under baselines/ (the reference a release is graded against), and the named pre-release Full Suite Runs under runs/ (the evidence a specific release-candidate cleared the bar).

Directory contents

README.md — this file
remits/ — the Worker Remits developed for each test agent. Reusable; do not change between analyses.
baselines/ — frozen, committed runs. The current set is baselines/v1.1-claude48/ — the 12-target v1.0.2-claude48 set (Praxen 1.0.x skill under Anthropic Claude Opus 4.8, median-of-3 against the intent-level Worker Remits) with OWASP classification re-tagged under the corrected 1.1 knowledge bases; detection and every RAISE score are byte-identical, so the median-of-3 carries over (the prior baselines/v1.0.2-claude48/ and baselines/v0.7.7-claude48/ are now archival). It is the comparison point for the release review. (baselines/v0.7.7-sequential/ — the same skill on Opus 4.7 — baselines/v0.7.4-sequential/, and baselines/v0.7.0-sequential/ are retired and kept for diff archaeology; baselines/v0.4-parallel/ keeps the historical Phase-2 parallel-path gate record; earlier v0.2-sequential/ / v0.3-sequential/ / v0.6.3-sequential/ sets were retired in successive re-baselines.) Aggregate OWASP coverage across the suite — including per-target links into each baseline analysis report — is browsable at the live OWASP Coverage Report (regenerated by baselines/owasp_coverage.py, a stdlib-only utility). See baselines/README.md.
runs/ — committed pre-release Full Suite Runs, the evidence-of-validation for each release. Named by the release the run validated (e.g. runs/v0.7.3-prerelease/), each containing a SUITE_RUN.md verdict report (timing table, per-target sanity verdicts, patterns surfaced) plus every target’s findings JSON / HTML / TXT. Diff future Full Suite Runs against the latest entry here in addition to the active baseline — run-to-run drift is more sensitive than run-to-baseline drift. See runs/README.md.
fixtures/, render/ — the render.py/schema.py smoke harness (python3 tests/render/test_render.py): the canonical-JSON fixture (finbot.canonical.json), the committed golden render output (finbot.golden.html / finbot.golden.txt — byte-compared on every run; the test header comments say how to regenerate them when output changes intentionally), the entity-normalisation checks, the negative-case mutations, and a sweep over every committed baseline under baselines/ — each schema-2.0 baseline JSON must validate against schema.py; each post-relicense one must re-render byte-for-byte from its JSON and (from praxen_version 0.6.0 on) quote its tests/remits/<slug>.md verbatim. So a renderer change that silently desyncs a committed report, or a baseline whose rule_text drifts from its remit, fails CI.
CI runs tests/render/test_render.py + build.sh on every push and PR across Python 3.9 / 3.12 / 3.13 (.github/workflows/ci.yml); pushing a v* tag runs the suite, builds the zip, and cuts a GitHub release (.github/workflows/release.yml — it also checks the tag matches PRAXEN_SPEC.md’s version).

Test tiers

Three named tiers, escalating in scope and wallclock. Pick the one that matches what you’re doing.

Smoke harness — `python3 tests/render/test_render.py` (~30 s, runs in CI)

Renderer + schema validator + golden-byte checks. Sweeps every committed baseline JSON through schema.py, re-renders each post-relicense baseline byte-for-byte from its JSON, validates the canonical FinBot fixture’s golden HTML/TXT outputs, exercises entity normalisation + negative-case mutations, and gates every showcase example under examples/ the same way on schema validation and byte-identical re-render (its *-findings.json → committed *-analysis.html / .txt). CI runs this on every push and PR across Python 3.9 / 3.12 / 3.13 (.github/workflows/ci.yml). A failure here means a renderer change silently desynced a committed report or example, or a baseline’s rule_text drifted from its remit. No Praxen analysis runs — this tier doesn’t scan any agents.

Single-target scan (~10 min)

One Praxen analysis against one target, end-to-end. The fastest way to sanity-check a skill edit. HelperBot is the suite’s most stable score and the fastest to scan; FinBot is the canonical “deliberately vulnerable” anchor. See How to run a single-target scan below for the procedure.

Full Suite Run (~90 min parallel subagent, ~3–4 hr sequential foreground)

All twelve targets against the release candidate, with timing data captured and every target’s findings preserved. Mandatory before every dev → main release PR; recommended before any substantial change to SKILL.md, schema.py, or the knowledge base. Produces a verdict report (timing table + per-target sanity-vs-baseline + patterns surfaced) committed as a named runs/<release>-prerelease/ directory. See Full Suite Run protocol below for the two invocation paths (parallel subagent, sequential foreground).

Calibration posture

The skill scores conservatively, in both directions: a control that is present in the repo but defeated — off by default, trivially bypassable, or living in a framework the agent never invokes — earns its RAISE category nothing; a control that is operative on the agent’s path — even a human-in-the-loop confirmation, even an inherited framework default the agent doesn’t disable — earns the category Partial (2) or Established (3), even when there are findings about its gaps. Gaps are findings, not reasons to zero a category. Most targets here land in Absent (0) to Ad hoc (1) per category; the well-engineered ones (OpenHands) reach Established (3) in the categories where their controls are real.

Blind-run scoring carries inherent variance — the same target re-analyzed from scratch typically lands within ±0.3–0.5 of its previous weighted score, and the severity counts swing by ±2–3 per bucket (judgment differs on borderline 0↔1 / 2↔3 category calls, and on Critical↔High classification). Most targets are tighter than that (std ≤ 0.15 over the three runs behind the current baseline); a minority with operative-but-imperfect controls swing more — uagents is the widest (σ ≈ 0.25). The per-target bands below are wide for that reason and should be read as a gross-regression check, not a tolerance: the frozen baseline in baselines/v1.1-claude48/ is the precise comparison point, and the theme coverage (no Critical theme dropped) is the hard gate.

Gate vs. advisory (formalized 2026-07-16, #48 item 4). The release gate is theme coverage: no material finding dropped, no Critical theme missed, structural checks green. The weighted RAISE score is advisory — reported, compared against its band, and investigated when it lands far outside, but a score wobble alone neither fails nor passes a release. This makes operational what this file has long said descriptively (“the numbers wobble; the themes shouldn’t”), and it matches the tool’s public framing: an expert-assisted review, not a deterministic gate. A score that lands well outside its band with no Praxen change to explain it, a dropped material finding, or a missed critical theme, is a regression; a single in-band wobble is not. (For the operator-facing version of this, see Understanding Run-to-Run Variability.)

Re-baselining (multi-run characterization)

A baseline freeze should not rest on a single run — parts of the analysis are LLM judgement, so one run’s weighted score carries the variance above, and a single full-suite snapshot can catch several targets at simultaneous high/low draws and mis-state where they sit. When re-baselining (a deliberate calibration change, a schema change, or a reference-model change), characterize each target over three independent runs on identical inputs (skill, sources, remits held constant), then:

Freeze the median run per target as the committed exemplar (one real, unedited run — keeps the byte-render gate honest).
Set each band from the 3-run mean ± observed spread, not from any single run. Distinguish stable-but-offset targets (low std, mean outside the old band → move the band) from noisy-but-centred ones (mean in-band, high std → widen the band).
Diff by theme and rule text, not by R-NN id — rule IDs are re-derived each run and are not stable across runs/models.
Name the new set baselines/vX.Y-<variant>/ (e.g. v0.7.7-claude48 for the Opus 4.8 re-freeze), retire the previous set in place, and update the pointer + bands here. The v1.0.2-claude48 BASELINE.md is the worked example of a median-of-3 characterization (the current v1.1-claude48 set carries that freeze over unchanged via re-tag — see its own BASELINE.md for the re-tag method).

Pre-release checklist

Build the candidate release zip: ./build.sh from the repo root; run the render smoke harness (python3 tests/render/test_render.py). CI already runs both on the PR across Python 3.9/3.12/3.13 — confirm it’s green.
Verify the plugin-marketplace install path. From a terminal, run claude plugin marketplace add open-agent-ai-security/praxen, then claude plugin install praxen@open-agent-ai-security, then claude plugin list. Confirm three things: the marketplace parses with no schema error; praxen installs at the release version; and the behavior-verifier skill is invokable afterwards. (The in-session /plugin ... slash commands do the same thing, but the claude plugin subcommand is non-interactive and argument-driven, so it runs the same way on every interface — use it here for a deterministic check.) As a fast pre-flight, claude plugin validate . from the repo root checks .claude-plugin/marketplace.json and plugin.json against the schema without touching your installed plugins. This is a manual check on purpose — those manifests are validated by Claude Code’s own marketplace schema, which no Python test in this repo exercises. A regression here doesn’t fail CI; it just silently breaks install — exactly how the bare-"." plugin source shipped in v0.6.1 and broke marketplace install for every tagged release until 0.6.2. Run it before every revision.
For each of the twelve targets below, either:
- Scan the already-built zip against the target workspace (confirms the distributed zip works), or
- Scan from the repo’s skills/ directly (confirms skill edits land correctly).
Full compare against the baseline. For every target, diff the new findings JSON against baselines/<latest>/<target>/…-findings-*.json (currently v1.1-claude48): weighted RAISE within ±0.3–0.5 of the baseline and inside the per-target band below; severity counts in the same neighbourhood; dominant pattern / themes still covered (no Critical theme dropped) — this last one is the hard gate. (See What a release review looks like for the per-report checks.)
Any regression — a material finding dropped, a critical theme missed, a weighted score well outside the band, a target that drifts far from the baseline with no Praxen change to explain it — blocks the release. An in-band shift, or a deliberate calibration/detection change that moves the numbers, is fine: note it in the release notes and re-freeze a new baselines/<next-version>-<variant>/ via the multi-run characterization above (and update the pointer + the affected bands below).

How to run a single-target scan

Scan the upstream source, not examples/. The two scan inputs are a remit (remits/<target>.md) and the target’s upstream source cloned from the Source: URL below. The repo’s examples/ directory holds finished demo reports, not agent source — pointing a scan there scans Praxen’s own output, not the agent.

For each target:

Clone or re-extract the target repository (URLs below).
Stage the workspace scope — the paths inside the target repo that constitute the agent code (notes below for each target).
Create an analysis working directory, e.g., /tmp/<target>_scan/reports/.
Copy the corresponding remit from remits/ into the analysis working directory as WORKER_REMIT.md.
Open a Claude Code session with the working directory as CWD.
Instruct Claude Code to read skills/behavior-verifier/SKILL.md and analyze the workspace path.
Review <target>-analysis-<timestamp>.html in reports/.

On Codex: the same single-target flow works — link the skill per docs/installation.md Option B and invoke $praxen:behavior-verifier against the workspace (same remit + source inputs). Quickstart has a worked codex exec example.

Full Suite Run protocol

The Full Suite Run validates a release candidate against all twelve targets. Typical wallclock: 8–15 min per scan, ~2 hours of model time across the 12 scans. End-to-end: ~90 min via parallel subagent (4–8 concurrent), ~3–4 hours via sequential foreground.

When to run

Mandatory before every dev → main release PR.
Recommended before any substantial change to SKILL.md, schema.py, or the knowledge base.
Not needed for renderer-only changes (the smoke harness covers those), documentation changes, or refactors that don’t touch the analysis path.

How to invoke

Two paths, both supported.

Parallel subagent (canonical for full-suite throughput) — launch each scan as a background general-purpose agent. The scans are independent and run concurrently, capped at 4–8 at a time (more has overloaded the environment and tripped no-progress watchdogs in past runs). Write each run’s outputs to a durable path, not /tmp — a scan’s /tmp working directory can be reaped mid-run; a gitignored local/<run-name>/<target>-out/ is the canonical pattern. End-to-end ~90 min for the 12 scans on a quiet system.
Sequential foreground — open one Claude Code session per target and drive each scan to completion before moving to the next. Slowest in wallclock (~3–4 hr end-to-end) but the most observable; useful for debugging a single target, validating a SKILL change one scan at a time, or runs where you want to watch each scan live. No watchdog concerns.

A single scan on its own — one target, one session — needs neither precaution.

For each target in either path: working directory under a durable, gitignored path (local/<run-name>/<target>-out/, never /tmp); CWD = that directory; copy the remit from tests/remits/<target>.md to WORKER_REMIT.md; clone or extract the target source to a sibling path; instruct the session (or subagent) to read skills/behavior-verifier/SKILL.md and analyse the workspace. When the four canonical outputs (*-draft-*.md working artifact + *-findings-*.json, *-analysis-*.html, *-analysis-*.txt deliverables) are present and both manifest_to_findings.py and render.py exited clean, the scan is done.

Subagent watchdog — where stalls can still happen

Subagent runs have a no-progress watchdog (~600 s) that kills the run when no tool call fires inside the window. Step 10 is no longer a stall site (it’s a single deterministic script invocation), but Step 9.9’s manifest emission is — composing the full manifest internally before the first Write fires can exceed the budget on a large scan. The SKILL’s Step 9.9 prescribes the mitigation: write a skeleton, then Edit-append each rule and each finding with a one-line text heartbeat between Edits. If you observe stalls in a Full Suite Run, check whether the worker followed that discipline before treating it as an infrastructure issue.

Verdict report (`SUITE_RUN.md`)

Each Full Suite Run produces a SUITE_RUN.md in the run’s output directory. The expected shape:

Header: STATUS line (PASS / FAIL / blocked-on-flag), one-line summary, the release this run validated.
Inputs: skill state under test, tolerance spec reference, source map (target → workspace path + scope).
Per-target table: # | target | baseline (n · C/H/M/L/I · RAISE) | run (same) | duration | verdict.
Detailed notes per target: duration, severity-count delta, RAISE delta, per-category RAISE breakdown, dominant Critical themes vs baseline, remit-coverage stat counts, any sanity flags.
Suite verdict & timing summary (closing block): per-scan timing table (range / median / mean), sanity table (Δ count, Δ RAISE, verdict per target), patterns surfaced (rule-count drift, calibration drift, stalls), bottom-line judgment.

The committed copy from the v0.7.3 prerelease run — runs/v0.7.3-prerelease/SUITE_RUN.md — is the reference template.

After the run

Commit the run as a named tests/runs/<release>-prerelease/ directory: SUITE_RUN.md at the root, plus one <target>-out/ subdirectory per target containing the three canonical outputs (*-findings.json, *-analysis.html, *-analysis.txt). See runs/README.md for the layout convention.

Test targets

Ordered from simplest (intentionally-vulnerable CTF) to most complex (active production agent). Run them in order for a release; the earlier analyses catch skill-execution issues fast, the later analyses exercise subtle detection.

1. HelperBot — DVAA training agent

Remit: remits/helperbot.md Source: https://github.com/opena2a-org/damn-vulnerable-ai-agent (HelperBot persona in src/core/agents.js) Scope: a minimal workspace containing agents.js, vulnerabilities.js, index.js, and the LLM client files. The HelperBot definition is in agents.js lines ~43-78. Notes: Intentionally vulnerable training agent from the DVAA platform. Smaller and simpler than FinBot — good quick smoke test. Exercises common findings (input validation, system-prompt API-key embed, write_file without path guard, context manipulation, no audit logging, no rate limit). The most stable weighted score in the suite (dead-flat 0.75 across all three baseline runs). Baseline expectation: ≈ 4 Critical / 3 High / 3 Medium / 1 Low (frozen median), weighted 0.75 / 5.0, band 0.60–0.90 (Absent). Frozen at baselines/v1.0.2-claude48/; carried unchanged (scores byte-identical) into the current v1.1-claude48 set.

2. FinBot — OWASP Agentic AI CTF

Remit: remits/finbot.md Source: https://github.com/OWASP-ASI/finbot-ctf-demo Scope: full repo root (the agent code is small — Flask + SQLAlchemy app) Notes: Deliberately vulnerable CTF agent. Autonomous invoice processor. Praxen should catch runtime-mutable goal overrides, unauthenticated admin endpoints, fraud-detection toggles, business-context bypass of manual-review thresholds, invoice-description injection into LLM context, and the goal-hijack → autonomous-payment compound chain. The canonical “deliberately insecure agent” test — if Praxen fails to produce a dense Critical cluster here, something is broken. Baseline expectation: ≈ 5 Critical / 4 High / 6 Medium (frozen median), weighted 0.90 / 5.0, band 0.75–1.05 (Absent). Frozen at baselines/v1.0.2-claude48/; carried unchanged (scores byte-identical) into the current v1.1-claude48 set.

3. CraftBot — self-hosted personal AI agent

Remit: remits/craftbot.md Source: https://github.com/CraftOS-dev/CraftBot Scope: the self-hosted agent core — task execution, the scheduler / proactive-run loop, the local memory store, the “Living UI” project import/run/process manager, the operator-enabled integrations / skills / MCP surface, and the local control-plane HTTP / backend API. Notes: A single-operator, BYOK “remote employee” personal agent (operator-selected LLM) that acts solely for the operator who runs it — never as a service exposed to untrusted third parties. Praxen should exercise the operator-owned-secret and host-execution surfaces: arbitrary shell/code execution carrying the operator’s full environment with no isolation or approval gate; imported/marketplace Living-UI projects and third-party MCP servers running unsandboxed; the local control-plane HTTP surface; inbound-messaging auto-reply that treats an unverified sender as the operator; and plaintext BYOK secret storage. The densest of the Phase-1 additions. Baseline expectation: ≈ 5 Critical / 5 High / 4 Medium (frozen median), weighted 1.15 / 5.0, band 1.00–1.45 (Ad hoc). Frozen at baselines/v1.0.2-claude48/; carried unchanged (scores byte-identical) into the current v1.1-claude48 set.

4. OpenAI Agents SDK — Customer Service Example

Remit: remits/openai-customer-service.md Source: https://github.com/openai/openai-agents-python (examples/customer_service/main.py + the agents SDK snapshot in src/agents/) Scope: the customer_service example + enough of the SDK to reason about handoffs, guardrails, and tool approval. Notes: Demonstrates the “framework ships guardrails; example uses none” pattern. Praxen should find that the SDK has InputGuardrail, OutputGuardrail, needs_approval, is_enabled, input_filter — and that examples/customer_service/main.py wires in zero of them — and flag the on_seat_booking_handoff fabricating a flight number via random.randint(). The weighted score is judgment-sensitive here: how much credit the SDK’s default tracing and strict-schema tool args earn toward the example agent’s score is a real swing between blind runs — the finding set (guardrails not used, audit log absent, raw-model-arg mutations) is the stable signal. Baseline expectation: ≈ 2 Critical / 4 High / 4 Medium (frozen median), weighted 1.60 / 5.0, band 1.45–1.75 (Ad hoc). Frozen at baselines/v1.0.2-claude48/; carried unchanged (scores byte-identical) into the current v1.1-claude48 set.

5. Agentforce Help Agent (salesforce-help-agent-accelerator)

Remit: remits/salesforce-help-agent-accelerator.md Source: https://github.com/salesforce/help-agent-accelerator Scope: the two shipped components analyzed together — the haaHelpAgent Agentforce agent (topic routing + knowledge retrieval via AnswerQuestionsWithKnowledge; the RAISE subject) and the haaInlineEnhancedChat LWC/JS UI host (Embedded Messaging bootstrap, session state machine, localStorage) where it carries security implications (session handling, CORS). Notes: A platform-managed Agentforce customer-service chatbot — a constrained, single-tool target (knowledge Q&A only; no DML, shell, filesystem, or external HTTP). Praxen should surface the platform-boundary risks rather than code-execution ones: indirect prompt injection via retrieved Knowledge-article content, system-prompt / topic-instruction disclosure, localStorage session-state manipulation in third-party embeds, and CORS / Trusted-Domains misconfiguration. It must credit the platform’s managed guardrails as real controls — a score deep in the Absent band would be over-corrected on a deliberately narrow agent. Baseline expectation: ≈ 1 Critical / 3 High / 5 Medium / 1 Low (frozen median), weighted 1.70 / 5.0, band 1.55–1.85 (Ad hoc). Frozen at baselines/v1.0.2-claude48/; carried unchanged (scores byte-identical) into the current v1.1-claude48 set.

6. Aider — interactive pair programming agent

Remit: remits/aider.md Source: https://github.com/Aider-AI/aider Scope: aider/*.py (top-level) + aider/coders/. Notes: Mature, production-quality agent with a developer-in-the-loop safety model. The findings are subtle — # ai! comment auto-execution in --watch-files, abs_root_path() has no repo-containment check, /read-only//add accept absolute and ~ paths, no secret scanner, auto-commit/auto-lint after every edit with no diff-accept prompt, --no-verify commits. Two-sided test: Praxen must produce actionable findings and must register the confirm-prompt / human-in-the-loop model as a real (if bypassable) control — a weighted score in the Absent band (< 1.0) for this target means the scoring is over-corrected and treating a legitimate safety design as theater. Also a Jinja2 evidence-block test — Aider’s prompt templates use `` and render.py neutralises them so they can’t collide with template placeholders. Baseline expectation: ≈ 1 Critical / 2 High / 3 Medium (frozen median), weighted 2.00 / 5.0, band 1.85–2.15 (Partial; dead-flat across all three baseline runs). Frozen at baselines/v1.0.2-claude48/; carried unchanged (scores byte-identical) into the current v1.1-claude48 set.

7. AutoGen Code Executor

Remit: remits/autogen-code-executor.md Source: https://github.com/microsoft/autogen (python/packages/autogen-ext/src/autogen_ext/code_executors/ + python/packages/autogen-core/src/autogen_core/code_executor/) Scope: the 5 executor implementations (local, docker, docker_jupyter, jupyter, azure) + the core abstraction. Notes: “Defaults undermine sandbox” pattern. Praxen should find: LocalCommandLineCodeExecutor uses warnings.warn instead of an approval gate and copies the parent’s full os.environ into the subprocess; create_default_code_executor() silently downgrades Docker→Local on a UserWarning; Docker containers default to no user=/read_only=/mem_limit=/cap_drop=/network isolation; Jupyter timeouts are soft; no per-execution audit log. Baseline expectation: ≈ 1 Critical / 5 High / 5 Medium (frozen median), weighted 2.00 / 5.0, band 1.85–2.15 (Partial). Frozen at baselines/v1.0.2-claude48/; carried unchanged (scores byte-identical) into the current v1.1-claude48 set.

8. uAgents — Fetch.ai agent framework runtime

Remit: remits/uagents.md Source: https://github.com/fetchai/uAgents Scope: the framework runtime — the Python uagents + uagents-core packages (cryptographic identity/signing, wallet/ledger client, ASGI inbound server, Almanac registration + resolver, typed message dispatch, key-value storage). Evaluates the runtime’s default posture handed to every deployed agent, not any single deployed agent. Notes: A model-agnostic framework runtime target — the security question is what default posture the framework hands each agent built on it. Praxen should reason about cryptographic key material at rest (agent/wallet private keys, seed phrases), inbound-envelope authentication and replay protection, and the default network exposure of the agent HTTP server and its inspector / administrative endpoints. Its real crypto-identity and signature-verification controls must register. The widest run-to-run variance in the suite (σ 0.245), so read its band as a gross-regression check, not a tolerance. Baseline expectation: ≈ 1 Critical / 2 High / 5 Medium (frozen median), weighted 2.00 / 5.0, band 1.70–2.30 (Partial). Frozen at baselines/v1.0.2-claude48/; carried unchanged (scores byte-identical) into the current v1.1-claude48 set.

9. yaah — agent-config harness (MCP-path coverage, “controls present” end)

Remit: remits/yaah.md Source: https://github.com/dirien/yet-another-agent-harness Scope: the harness itself — cmd/yaah, pkg/{harness,hooks,mcpserver,mcp,session,generator,schema} — plus the root .mcp.json, .claude/settings.json, go.mod, go.sum, AGENTS.md. Exclude .claude/skills/*/references/examples/, website/, qa/ except where a finding cites them. Notes: The suite’s clearest “controls present, score honestly” case (alongside OpenHands), and one of two MCP-coverage targets. yaah is a Go CLI that generates config for four coding agents and ships a built-in security toolset: a deterministic command-guard hook (blocks rm -rf /, force-push to main, DROP TABLE, mkfs, raw disk writes on every Bash call), a 13-pattern secret scanner on every edit, a structured per-session audit log, a built-in yaah serve MCP server with clean tool descriptions, exact-pinned Go deps. A healthy run must exercise SKILL.md Step 6 end-to-end on its .mcp.json (and the mcpServers block in .claude/settings.json) — KB_MCP_SECURITY.md → checklist → mcp-tagged findings — and must credit the operative controls: Manage Your Supply Chain and Monitor Continuously at Established (3) (the go.mod/go.sum pins; the real session audit log), Implement Zero Trust / Balance Your Knowledge Base at Partial (2) (the command-guard + secret-scanner run on the agent’s path), and the built-in MCP server’s clean descriptions registered as a positive, not a finding. A weighted score in the Absent band (< 1.0) here, or zeroing the categories the hooks/audit-log cover, means the scoring is over-corrected. The headline finding (the one High): pkg/generator/hookmap.go leaves PreToolUse/PostToolUse blank for the Codex CLI target, so yaah generate --agent codex ships a config with none of the advertised hooks — a policy-implementation divergence the run must catch from reading hookmap.go. Other expected findings: context7 MCP server launched via unpinned npx -y @context7/mcp (silent-update vector); MCP tool calls fall outside the PreToolUse/PostToolUse hooks → not in the session log and ungated; no tool-poisoning check / output sanitization on the third-party servers; auto-managed AGENTS.md (with CLAUDE.md/GEMINI.md symlinks) is a session-loaded, regenerable surface; no SECURITY.md. Baseline expectation: ≈ 0 Critical / 4 High / 4 Medium (frozen median), weighted 2.30 / 5.0, band 2.15–2.45 (Partial). Frozen at baselines/v1.0.2-claude48/; carried unchanged (scores byte-identical) into the current v1.1-claude48 set.

10. OpenHands — autonomous software engineering platform

Remit: remits/openhands.md Source: https://github.com/All-Hands-AI/OpenHands Scope: the openhands/ core as it stands today — app_server/ (the V1 control plane) and server/, plus config.template.toml and docker-compose.yml. The agentic core (controller/ / runtime/ / llm/ / mcp/ and the agent-event loop) has been extracted to the separate openhands-sdk / agent-server packages and is out of this source snapshot. Exclude enterprise/, frontend/, kind/. Notes: The suite’s “mature agent scores honestly” anchor. The current openhands/ repo is the V1 app-server control plane — the agentic core has moved to separate packages, so several strong remit clauses (sandbox path-escape rejection, tool-arg clamping, step caps, commit-content scanning) legitimately come back Enforcement-Not-Possible from this source snapshot. Praxen should still find the control-plane gaps: the OSS app server registers no auth middleware, so the whole V1 API — including the secrets endpoint that exposes stored git tokens — is unauthenticated by default; CORS falls open (allows any origin) when no origins are configured; the host-process runtime backend runs the agent-server unisolated; skills / micro-agents are loaded into agent context with no content-trust check; no durable app-server action log. Its real operative controls — the sandboxed runtime, OAuth-scoped integrations, the structured session record — must still register: Limit Your Domain and Manage Your Supply Chain at Established (3). A run where every category came back ≤ 1 means the scoring is over-corrected. Baseline expectation: ≈ 1 Critical / 3 High / 6 Medium / 1 Low (frozen median), weighted 2.30 / 5.0, band 2.15–2.45 (Partial). Frozen at baselines/v1.0.2-claude48/; carried unchanged (scores byte-identical) into the current v1.1-claude48 set.

11. Deep Agents CLI — agent harness (MCP-path coverage)

Remit: remits/deepagents-cli.md Source: https://github.com/langchain-ai/deepagents Scope: the libs/cli package (deepagents-cli) — now a deploy-only bundler — plus the config it reads and produces: libs/cli’s pyproject.toml and lockfile, any root .mcp.json, .github/, AGENTS.md. Exclude the libs/deepagents SDK internals, libs/acp, libs/evals, libs/partners, and examples/ except where a finding cites them. Notes: An MCP-coverage target and — alongside OpenHands — a “controls present, score honestly” case. As of v0.1.x, deepagents-cli is a deploy-only bundler: it scaffolds a project (init), runs it locally (dev), and bundles then ships it to a managed deployment platform (deploy); the interactive coding-assistant surface moved to a separate deepagents-code package. A healthy run must exercise SKILL.md Step 6 “MCP Server Evaluation” on the root .mcp.json — load knowledge/KB_MCP_SECURITY.md, apply the minimum-bar checklist, emit { "kind": "mcp", … } findings. Praxen should find: the unauthenticated-API confirmation gate fires only when a frontend is configured, so an anonymous-auth deploy with no frontend ships an open API silently; deploy validates MCP transport type but not that http/sse endpoints use TLS; remote MCP servers are carried into the bundle with no version pin; the deploy tooling installs no logging. The “remote MCP URLs MUST use TLS” rule is a restored policy clause and now correctly surfaces the missing scheme check as a Critical. Its operative controls — bundling only the project’s declared sources, the anonymous-deploy confirmation prompt, committed pinned lockfiles — must register: a weighted score in the Absent band (< 1.0) for this target means the scoring is over-corrected. Baseline expectation: ≈ 1 Critical / 0 High / 4 Medium (frozen median), weighted 2.70 / 5.0, band 2.55–2.85 (Partial). Frozen at baselines/v1.0.2-claude48/; carried unchanged (scores byte-identical) into the current v1.1-claude48 set.

12. Hermes — multi-component LLM agent + desktop control layer

Remit: remits/hermes-agent-desktop.md Source (Agent): https://github.com/NousResearch/hermes-agent — pinned at commit b1a2540 (the repo is date-versioned off main; nearest tag v2026.5.29, no semver release, so pin the SHA for reproducibility) Source (Desktop): https://github.com/fathah/hermes-desktop — v0.5.1 (commit 4e8388a) Scope: both workspaces analyzed together as one agent. Agent — the Python hermes-agent tree (gateway + platform adapters, tools/ incl. approval.py / mcp_tool.py / skills_guard.py / osv_check.py, hermes_logging.py). Desktop — the hermes-desktop Electron/TS app (ssh-tunnel.ts / ssh-remote.ts, analytics.ts, main/renderer split). Exclude vendored deps and build output. Notes: The suite’s first multi-component target and its first real-world agent shipping a disclosed security posture (SECURITY.md). One combined Worker Remit names the in-process LLM Agent as the primary RAISE subject and gives the Desktop operator layer its own sub-headings within each section — so this is the regression anchor for the SKILL.md multi-component remit path. Step 6 MCP evaluation runs against a launch mechanism (env-filter + tool-description-poisoning scanner) rather than a static .mcp.json. Praxen should find: the default-local terminal backend runs LLM-emitted commands on the host while untrusted web/email/MCP surfaces are ingested (OS isolation is opt-in — the agent’s own SECURITY.md §2 names this exact gap); dangerous-shell auto-approval fails open in non-interactive / non-gateway / non-cron contexts; PostHog telemetry defaults on (opt-out) for official Desktop builds; SSH accept-new TOFU plus no non-root remote-user guard. It must also credit the operative controls — fail-closed adapter/API auth, the un-overridable credential blocklist (GHSA fix), exact-pinned deps + supply-chain CI, redacting rotating logs, the hardline command blocklist — so the score lands at the Partial→Established boundary, not Absent. A weighted score below the Partial floor here means the scoring over-corrected on the disclosed default-isolation gap and ignored the real controls. Judgment-sensitive: whether the default-isolation seam is scored Critical or High is the one call that moves the weighted score (2.75 Partial ↔ 3.15 Established across the three baseline runs); the finding set is stable. Baseline expectation: ≈ 0 Critical / 1 High / 4 Medium (frozen median), weighted 2.85 / 5.0, band 2.70–3.15 (Partial, at the Partial→Established boundary). Frozen at baselines/v1.0.2-claude48/; carried unchanged (scores byte-identical) into the current v1.1-claude48 set.

What a release review looks like

The release review is a full compare: run all twelve targets and diff each against the latest frozen baseline, baselines/v1.1-claude48/.

Compare against the baseline (do this first)

Dominant pattern / themes still covered — the hard gate. The numbers wobble; the themes shouldn’t. A target that drops a material finding or misses a Critical theme is a regression, regardless of where the weighted score lands.
Weighted RAISE within ±0.3–0.5 of the baseline number, and inside the per-target band above — advisory (#48 item 4): a breach prompts investigation, not automatic failure; an in-band score excuses nothing if a theme dropped.
Severity counts in the same neighbourhood — advisory. Small drifts and Critical↔High reclassifications are normal blind-run variance.

Then, for each target, open the HTML report and check:

Structural correctness

All three output files landed (.html, .json, .txt). render.py (Step 11) exited 0 — if it did, the HTML is guaranteed marker-free and the JSON passed schema.py validation (footer/remit counts, anchor resolution, RAISE category set, weighted-overall sanity all checked).
The *-findings.json validates against skills/behavior-verifier/schema.py (python3 -c "import sys; sys.path.insert(0,'skills/behavior-verifier'); import schema, json; schema.validate(json.load(open(PATH)))"), and behavior_summary, the six raise_posture.categories (with rationales), the two intro_band summaries, and remit_coverage.rules are all populated.
Report renders without errors in a browser (static HTML, no external fetches); footer counts match the Findings Register.
Re-rendering the JSON with render.py reproduces the committed HTML byte-for-byte (the renderer is deterministic).

Finding quality

The Behavior Summary narrative reads as diagnostic, not templated.
Every Critical / High finding has specific file:line evidence; recommended actions name the file and the change, not generic advice.
Finding tags carry the full OWASP category name (tags[].label = LLM01 — Prompt Injection, not LLM01); policy_rule_text quotes the exact remit text; policy_rule_ids references real R-NN rules from remit_coverage.

RAISE Maturity Posture section (end of report)

Weighted score reasonable relative to the baseline above
Maturity label matches the score (Absent / Ad hoc / Partial / Established / Strong / Exemplary)
Rubric table present and unmodified
No traffic-light coloring on category cards (uniform blue styling)

Secrets discipline

No literal API keys, tokens, or passwords in the HTML or JSON — any credential is referenced by [REDACTED — pattern at file:line]

If any check fails, investigate before releasing. Finding-count shifts within baseline bands are expected; theme-level coverage regressions are not.

Notes on the test set composition

The twelve targets deliberately span a spectrum:

Intentionally vulnerable (FinBot, HelperBot) — calibration anchors. Findings here should be dense and unambiguous.
Enterprise / vendor SaaS agent (Agentforce) and self-hosted general assistant (CraftBot) — real shipped agents (Salesforce Knowledge-article RAG; a local build-your-own-tools assistant), added in 1.0.2 to broaden real-world coverage.
Framework runtime (uAgents) — evaluates a multi-agent framework’s default posture (identity/signing, Almanac registration, typed dispatch) rather than one deployed agent.
Framework + example pattern (OpenAI CS) — exercises the “guardrails shipped, not used” detection.
Framework defaults pattern (AutoGen, OpenHands) — exercises the “sandbox exists but defaults bypass it” detection.
Production agent, solo-maintainer territory (Aider) — exercises subtle and novel finding detection; the target most likely to produce disclosure-worthy output.
Production agent, well-funded team (OpenHands) — the ceiling of what well-engineered agents look like today. Establishes realistic maturity-scale interpretation.
Agent harnesses with a real MCP surface (Deep Agents CLI, yaah) — keep the MCP Server Evaluation path (.mcp.json discovery → KB_MCP_SECURITY.md → minimum-bar checklist → mcp-tagged findings) under regression, and exercise bidirectional calibration from both ends: Deep Agents CLI is “strong primitives, permissive defaults” (don’t over-credit); yaah is “controls genuinely operative” (don’t zero them).
Multi-component, real-world agent with a disclosed posture (Hermes Agent + Desktop) — the highest-maturity target and the regression anchor for the combined-remit path (one remit, two trust layers) and for scoring an agent that ships its own SECURITY.md: Praxen must surface the disclosed gaps as findings and credit the real controls, landing at the Partial→Established boundary rather than over-correcting in either direction.

A release that produces solid reports on all twelve has been validated across the full range of agent postures we’ve encountered.

This site is open source. Improve this page.