Praxen’s regression test suite. Three named tiers — pick the one that matches what you’re doing (see Test tiers below). Before every dev → main release PR, run the Full Suite Run — all twelve targets, either via parallel subagent (4–8 concurrent, ~90 min end-to-end) or sequential foreground (~3–4 hours end-to-end) — then diff every target against the latest frozen baseline in baselines/ and against the per-target bands in this document (see What a release review looks like). Ad-hoc mid-development re-run reports are not kept here — they regenerate and drift. The committed runs are: the named, version-pinned frozen baselines under baselines/ (the reference a release is graded against), and the named pre-release Full Suite Runs under runs/ (the evidence a specific release-candidate cleared the bar).
README.md — this fileremits/ — the Worker Remits developed for each test agent. Reusable; do not change between analyses.baselines/ — frozen, committed runs. The current set is baselines/v0.7.7-claude48/ — all twelve targets on the Praxen v0.7.7 skill under Anthropic Claude Opus 4.8, frozen via a median-of-3 characterization against the intent-level Worker Remits. It is the comparison point for the release review. (baselines/v0.7.7-sequential/ — the same skill on Opus 4.7 — baselines/v0.7.4-sequential/, and baselines/v0.7.0-sequential/ are retired and kept for diff archaeology; baselines/v0.4-parallel/ keeps the historical Phase-2 parallel-path gate record; earlier v0.2-sequential/ / v0.3-sequential/ / v0.6.3-sequential/ sets were retired in successive re-baselines.) Aggregate OWASP coverage across the suite — including per-target links into each baseline analysis report — is browsable at the live OWASP Coverage Report (regenerated by baselines/owasp_coverage.py, a stdlib-only utility). See baselines/README.md.runs/ — committed pre-release Full Suite Runs, the evidence-of-validation for each release. Named by the release the run validated (e.g. runs/v0.7.3-prerelease/), each containing a SUITE_RUN.md verdict report (timing table, per-target sanity verdicts, patterns surfaced) plus every target’s findings JSON / HTML / TXT. Diff future Full Suite Runs against the latest entry here in addition to the active baseline — run-to-run drift is more sensitive than run-to-baseline drift. See runs/README.md.fixtures/, render/ — the render.py/schema.py smoke harness (python3 tests/render/test_render.py): the canonical-JSON fixture (finbot.canonical.json), the committed golden render output (finbot.golden.html / finbot.golden.txt — byte-compared on every run; the test header comments say how to regenerate them when output changes intentionally), the entity-normalisation checks, the negative-case mutations, and a sweep over every committed baseline under baselines/ — each schema-2.0 baseline JSON must validate against schema.py; each post-relicense one must re-render byte-for-byte from its JSON and (from praxen_version 0.6.0 on) quote its tests/remits/<slug>.md verbatim. So a renderer change that silently desyncs a committed report, or a baseline whose rule_text drifts from its remit, fails CI.tests/render/test_render.py + build.sh on every push and PR across Python 3.9 / 3.12 / 3.13 (.github/workflows/ci.yml); pushing a v* tag runs the suite, builds the zip, and cuts a GitHub release (.github/workflows/release.yml — it also checks the tag matches PRAXEN_SPEC.md’s version).Three named tiers, escalating in scope and wallclock. Pick the one that matches what you’re doing.
python3 tests/render/test_render.py (~30 s, runs in CI)Renderer + schema validator + golden-byte checks. Sweeps every committed baseline JSON through schema.py, re-renders each post-relicense baseline byte-for-byte from its JSON, validates the canonical FinBot fixture’s golden HTML/TXT outputs, and exercises entity normalisation + negative-case mutations. CI runs this on every push and PR across Python 3.9 / 3.12 / 3.13 (.github/workflows/ci.yml). A failure here means a renderer change silently desynced a committed report, or a baseline’s rule_text drifted from its remit. No Praxen analysis runs — this tier doesn’t scan any agents.
One Praxen analysis against one target, end-to-end. The fastest way to sanity-check a skill edit. HelperBot is the suite’s most stable score and the fastest to scan; FinBot is the canonical “deliberately vulnerable” anchor. See How to run a single-target scan below for the procedure.
All twelve targets against the release candidate, with timing data captured and every target’s findings preserved. Mandatory before every dev → main release PR; recommended before any substantial change to SKILL.md, schema.py, or the knowledge base. Produces a verdict report (timing table + per-target sanity-vs-baseline + patterns surfaced) committed as a named runs/<release>-prerelease/ directory. See Full Suite Run protocol below for the two invocation paths (parallel subagent, sequential foreground).
The skill scores conservatively, in both directions: a control that is present in the repo but defeated — off by default, trivially bypassable, or living in a framework the agent never invokes — earns its RAISE category nothing; a control that is operative on the agent’s path — even a human-in-the-loop confirmation, even an inherited framework default the agent doesn’t disable — earns the category Partial (2) or Established (3), even when there are findings about its gaps. Gaps are findings, not reasons to zero a category. Most targets here land in Absent (0) to Ad hoc (1) per category; the well-engineered ones (OpenHands) reach Established (3) in the categories where their controls are real.
Blind-run scoring carries inherent variance — the same target re-analyzed from scratch typically lands within ±0.3–0.5 of its previous weighted score, and the severity counts swing by ±2–3 per bucket (judgment differs on borderline 0↔1 / 2↔3 category calls, and on Critical↔High classification). Most targets are tighter than that (std ≤ 0.15 over the three runs behind the current baseline); a minority with operative-but-imperfect controls swing more — openai-customer-service is the widest (std ≈ 0.28). The per-target bands below are wide for that reason and should be read as a gross-regression check, not a tolerance: the frozen baseline in baselines/v0.7.7-claude48/ is the precise comparison point, and the theme coverage (no Critical theme dropped) is the hard gate. A score that lands well outside its band with no Praxen change to explain it, a dropped material finding, or a missed critical theme, is a regression; a single in-band wobble is not. (For the operator-facing version of this, see Understanding Run-to-Run Variability.)
A baseline freeze should not rest on a single run — parts of the analysis are LLM judgement, so one run’s weighted score carries the variance above, and a single full-suite snapshot can catch several targets at simultaneous high/low draws and mis-state where they sit. When re-baselining (a deliberate calibration change, a schema change, or a reference-model change), characterize each target over three independent runs on identical inputs (skill, sources, remits held constant), then:
R-NN id — rule IDs are re-derived each run and are not stable across runs/models.baselines/vX.Y-<variant>/ (e.g. v0.7.7-claude48 for the Opus 4.8 re-freeze), retire the previous set in place, and update the pointer + bands here. The current set’s BASELINE.md is the worked example../build.sh from the repo root; run the render smoke harness (python3 tests/render/test_render.py). CI already runs both on the PR across Python 3.9/3.12/3.13 — confirm it’s green.claude plugin marketplace add open-agent-ai-security/praxen, then claude plugin install praxen@open-agent-ai-security, then claude plugin list. Confirm three things: the marketplace parses with no schema error; praxen installs at the release version; and the behavior-verifier skill is invokable afterwards. (The in-session /plugin ... slash commands do the same thing, but the claude plugin subcommand is non-interactive and argument-driven, so it runs the same way on every interface — use it here for a deterministic check.) As a fast pre-flight, claude plugin validate . from the repo root checks .claude-plugin/marketplace.json and plugin.json against the schema without touching your installed plugins. This is a manual check on purpose — those manifests are validated by Claude Code’s own marketplace schema, which no Python test in this repo exercises. A regression here doesn’t fail CI; it just silently breaks install — exactly how the bare-"." plugin source shipped in v0.6.1 and broke marketplace install for every tagged release until 0.6.2. Run it before every revision.skills/ directly (confirms skill edits land correctly).baselines/<latest>/<target>/…-findings-*.json (currently v0.7.7-claude48): weighted RAISE within ±0.3–0.5 of the baseline and inside the per-target band below; severity counts in the same neighbourhood; dominant pattern / themes still covered (no Critical theme dropped) — this last one is the hard gate. (See What a release review looks like for the per-report checks.)baselines/<next-version>-<variant>/ via the multi-run characterization above (and update the pointer + the affected bands below).For each target:
/tmp/<target>_scan/reports/.remits/ into the analysis working directory as WORKER_REMIT.md.skills/behavior-verifier/SKILL.md and analyze the workspace path.<target>-analysis-<timestamp>.html in reports/.The Full Suite Run validates a release candidate against all twelve targets. Typical wallclock: 8–15 min per scan, ~2 hours of model time across the 12 scans. End-to-end: ~90 min via parallel subagent (4–8 concurrent), ~3–4 hours via sequential foreground.
dev → main release PR.SKILL.md, schema.py, or the knowledge base.Two paths, both supported.
Parallel subagent (canonical for full-suite throughput) — launch each scan as a background general-purpose agent. The scans are independent and run concurrently, capped at 4–8 at a time (more has overloaded the environment and tripped no-progress watchdogs in past runs). Write each run’s outputs to a durable path, not /tmp — a scan’s /tmp working directory can be reaped mid-run; a gitignored local/<run-name>/<target>-out/ is the canonical pattern. End-to-end ~90 min for the 12 scans on a quiet system.
Sequential foreground — open one Claude Code session per target and drive each scan to completion before moving to the next. Slowest in wallclock (~3–4 hr end-to-end) but the most observable; useful for debugging a single target, validating a SKILL change one scan at a time, or runs where you want to watch each scan live. No watchdog concerns.
A single scan on its own — one target, one session — needs neither precaution.
For each target in either path: working directory under a durable, gitignored path (local/<run-name>/<target>-out/, never /tmp); CWD = that directory; copy the remit from tests/remits/<target>.md to WORKER_REMIT.md; clone or extract the target source to a sibling path; instruct the session (or subagent) to read skills/behavior-verifier/SKILL.md and analyse the workspace. When the four canonical outputs (*-draft-*.md working artifact + *-findings-*.json, *-analysis-*.html, *-analysis-*.txt deliverables) are present and both manifest_to_findings.py and render.py exited clean, the scan is done.
Subagent runs have a no-progress watchdog (~600 s) that kills the run when no tool call fires inside the window. Step 10 is no longer a stall site (it’s a single deterministic script invocation), but Step 9.9’s manifest emission is — composing the full manifest internally before the first Write fires can exceed the budget on a large scan. The SKILL’s Step 9.9 prescribes the mitigation: write a skeleton, then Edit-append each rule and each finding with a one-line text heartbeat between Edits. If you observe stalls in a Full Suite Run, check whether the worker followed that discipline before treating it as an infrastructure issue.
SUITE_RUN.md)Each Full Suite Run produces a SUITE_RUN.md in the run’s output directory. The expected shape:
# | target | baseline (n · C/H/M/L/I · RAISE) | run (same) | duration | verdict.The committed copy from the v0.7.3 prerelease run — runs/v0.7.3-prerelease/SUITE_RUN.md — is the reference template.
Commit the run as a named tests/runs/<release>-prerelease/ directory: SUITE_RUN.md at the root, plus one <target>-out/ subdirectory per target containing the three canonical outputs (*-findings.json, *-analysis.html, *-analysis.txt). See runs/README.md for the layout convention.
Ordered from simplest (intentionally-vulnerable CTF) to most complex (active production agent). Run them in order for a release; the earlier analyses catch skill-execution issues fast, the later analyses exercise subtle detection.
Remit: remits/finbot.md
Source: https://github.com/OWASP-ASI/finbot-ctf-demo
Scope: full repo root (the agent code is small — Flask + SQLAlchemy app)
Notes: Deliberately vulnerable CTF agent. Autonomous invoice processor. Praxen should catch runtime-mutable goal overrides, unauthenticated admin endpoints, fraud-detection toggles, business-context bypass of manual-review thresholds, invoice-description injection into LLM context, and the goal-hijack → autonomous-payment compound chain. The canonical “deliberately insecure agent” test — if Praxen fails to produce 6+ Critical findings here, something is broken.
Baseline expectation: ≈ 3-7 Critical / 6-10 High / 1-5 Medium, weighted ≈ 0.3-0.9 / 5.0 (Absent). Frozen at baselines/v0.7.7-claude48/.
Remit: remits/helperbot.md
Source: https://github.com/opena2a-org/damn-vulnerable-ai-agent (HelperBot persona in src/core/agents.js)
Scope: a minimal workspace containing agents.js, vulnerabilities.js, index.js, and the LLM client files. The HelperBot definition is in agents.js lines ~43-78.
Notes: Intentionally vulnerable training agent from the DVAA platform. Smaller and simpler than FinBot — good quick smoke test. Exercises common findings (input validation, system-prompt API-key embed, write_file without path guard, context manipulation, no audit logging, no rate limit). The most stable weighted score in the suite.
Baseline expectation: ≈ 3-7 Critical / 4-8 High / 1-5 Medium, weighted ≈ 0.2-0.8 / 5.0 (Absent). Frozen at baselines/v0.7.7-claude48/.
Remit: remits/langchain-sql.md
Source: https://github.com/langchain-ai/langchain-community (the classic create_sql_agent is in libs/community/langchain_community/agent_toolkits/sql/ and libs/community/langchain_community/tools/sql_database/)
Scope: the agent_toolkits/sql/ + tools/sql_database/ trees + utilities/sql_database.py.
Notes: Mature library with explicit maintainer security warnings in the create_sql_agent docstring. Praxen correctly identifies the DML-prohibition-is-prompt-only pattern and surfaces the maintainer warning rather than skipping it. Not a disclosure target (maintainer has already warned). Kept as a “skill validates on a mature codebase” test. Mature-library calibration: the toolkit’s tool inventory matches the remit’s Known Good Baseline exactly, deps are pinned/versioned, there’s a max_iterations runaway cap and result-cell truncation — so the score lands in Ad hoc, not Absent, even though the SQL-prohibition enforcement is prompt-only.
Baseline expectation: ≈ 1-5 Critical / 2-6 High / 2-6 Medium, weighted ≈ 0.9-1.6 / 5.0 (Ad hoc). Frozen at baselines/v0.7.7-claude48/.
Remit: remits/openai-customer-service.md
Source: https://github.com/openai/openai-agents-python (examples/customer_service/main.py + the agents SDK snapshot in src/agents/)
Scope: the customer_service example + enough of the SDK to reason about handoffs, guardrails, and tool approval.
Notes: Demonstrates the “framework ships guardrails; example uses none” pattern. Praxen should find that the SDK has InputGuardrail, OutputGuardrail, needs_approval, is_enabled, input_filter — and that examples/customer_service/main.py wires in zero of them — and flag the on_seat_booking_handoff fabricating a flight number via random.randint(). The weighted score is judgment-sensitive here: how much credit the SDK’s default tracing and strict-schema tool args earn toward the example agent’s score is a real 0.6↔1.8 swing between blind runs — the finding set (guardrails not used, audit log absent, raw-model-arg mutations) is the stable signal.
Baseline expectation: ≈ 1-5 Critical / 3-7 High / 1-5 Medium, weighted ≈ 0.7-1.7 / 5.0 (Absent → Ad hoc; highest-variance target). Frozen at baselines/v0.7.7-claude48/.
Remit: remits/autogen-code-executor.md
Source: https://github.com/microsoft/autogen (python/packages/autogen-ext/src/autogen_ext/code_executors/ + python/packages/autogen-core/src/autogen_core/code_executor/)
Scope: the 5 executor implementations (local, docker, docker_jupyter, jupyter, azure) + the core abstraction.
Notes: “Defaults undermine sandbox” pattern. Praxen should find: LocalCommandLineCodeExecutor uses warnings.warn instead of an approval gate and copies the parent’s full os.environ into the subprocess; create_default_code_executor() silently downgrades Docker→Local on a UserWarning; Docker containers default to no user=/read_only=/mem_limit=/cap_drop=/network isolation; Jupyter timeouts are soft; no per-execution audit log.
Baseline expectation: ≈ 2-6 Critical / 4-8 High / 2-6 Medium, weighted ≈ 1.0-1.7 / 5.0 (Ad hoc). Frozen at baselines/v0.7.7-claude48/.
Remit: remits/sweep.md
Source: https://github.com/sweepai/sweep (sweepai/ subtree: agents, core, web, config)
Scope: sweepai/agents/, sweepai/core/, sweepai/web/, sweepai/config/, plus sweep.yaml, Dockerfile, docker-compose.yml, pyproject.toml.
Notes: Exercises the declared-but-never-consulted-config detector (WEBHOOK_SECRET defined, HMAC check fails open by default), subprocess.run(shell=True) sites with LLM/repo-derived arguments, a hardcoded PostHog key. Scope-sensitive: with the scope above (sweepai/agents|core|web|config + root configs), Praxen sees a tamer agent — ≈ 4 Critical / ≈ 1.4 / 5.0 — because the webhook receiver and the worst Criticals live in sweepai/api.py / sweepai/handlers/ / sweepai/utils/hash.py, outside this scope; widen the workspace to include those and the count and severity climb sharply (≈ 7+ Critical, ≈ 0.9 / 5.0). Pick a scope and stick with it across releases. Also represents the “disclosure-worthy in theory, dormant maintainer in practice” class.
Baseline expectation (README scope): ≈ 3-7 Critical / 1-5 High / 3-7 Medium, weighted ≈ 0.9-1.4 / 5.0 (Ad hoc, scope-dependent). Frozen at baselines/v0.7.7-claude48/.
Remit: remits/devika.md
Source: https://github.com/stitionai/devika
Scope: devika.py + src/ (agents, llm, memory, apis) + sample.config.toml, devika.dockerfile, requirements.txt, ARCHITECTURE.md.
Notes: Exercises the empty-file signal detector — src/sandbox/firejail.py and src/sandbox/code_runner.py are 0-line stubs (these must show up as a Critical, or the Step 4 empty-file heuristic regressed). Runner calls subprocess.run directly. Unauthenticated /api/settings POST on 0.0.0.0:1337. Path traversal in save_code_to_project. Compound RCE chain (web → researcher → formatter → coder/runner → subprocess). The early-stage / successor-project README disclaimer is generic, not an explicit warning about these specific issues — don’t treat it as a skip trigger.
Baseline expectation: ≈ 5-9 Critical / 7-11 High / 2-6 Medium / 0-2 Info, weighted ≈ 0.3-0.9 / 5.0 (Absent). Frozen at baselines/v0.7.7-claude48/.
Remit: remits/aider.md
Source: https://github.com/Aider-AI/aider
Scope: aider/*.py (top-level) + aider/coders/.
Notes: Mature, production-quality agent with a developer-in-the-loop safety model. The findings are subtle — # ai! comment auto-execution in --watch-files, abs_root_path() has no repo-containment check, /read-only//add accept absolute and ~ paths, no secret scanner, auto-commit/auto-lint after every edit with no diff-accept prompt, --no-verify commits. Two-sided test: Praxen must produce actionable findings and must register the confirm-prompt / human-in-the-loop model as a real (if bypassable) control — a weighted score in the Absent band (< 1.0) for this target means the scoring is over-corrected and treating a legitimate safety design as theater. Also a Jinja2 evidence-block test — Aider’s prompt templates use `` and render.py neutralises them so they can’t collide with template placeholders.
Baseline expectation: ≈ 1-5 Critical / 2-6 High / 1-4 Medium, weighted ≈ 1.8-2.4 / 5.0 (Partial). Frozen at baselines/v0.7.7-claude48/.
Remit: remits/openhands.md
Source: https://github.com/All-Hands-AI/OpenHands
Scope: the openhands/ core as it stands today — app_server/ (the V1 control plane) and server/, plus config.template.toml and docker-compose.yml. The agentic core (controller/ / runtime/ / llm/ / mcp/ and the agent-event loop) has been extracted to the separate openhands-sdk / agent-server packages and is out of this source snapshot. Exclude enterprise/, frontend/, kind/.
Notes: The suite’s “mature agent scores honestly” anchor. The current openhands/ repo is the V1 app-server control plane — the agentic core has moved to separate packages, so several strong remit clauses (sandbox path-escape rejection, tool-arg clamping, step caps, commit-content scanning) legitimately come back Enforcement-Not-Possible from this source snapshot. Praxen should still find the control-plane gaps: the OSS app server registers no auth middleware, so the whole V1 API — including the secrets endpoint that exposes stored git tokens — is unauthenticated by default; CORS falls open (allows any origin) when no origins are configured; the host-process runtime backend runs the agent-server unisolated; skills / micro-agents are loaded into agent context with no content-trust check; no durable app-server action log. Its real operative controls — the sandboxed runtime, OAuth-scoped integrations, the structured session record — must still register: Limit Your Domain and Manage Your Supply Chain at Established (3). A run where every category came back ≤ 1 means the scoring is over-corrected.
Baseline expectation: ≈ 2-6 Critical / 2-6 High / 2-6 Medium, weighted ≈ 1.7-2.5 / 5.0 (Partial). Frozen at baselines/v0.7.7-claude48/.
Remit: remits/deepagents-cli.md
Source: https://github.com/langchain-ai/deepagents
Scope: the libs/cli package (deepagents-cli) — now a deploy-only bundler — plus the config it reads and produces: libs/cli’s pyproject.toml and lockfile, any root .mcp.json, .github/, AGENTS.md. Exclude the libs/deepagents SDK internals, libs/acp, libs/evals, libs/partners, and examples/ except where a finding cites them.
Notes: An MCP-coverage target and — alongside OpenHands — a “controls present, score honestly” case. As of v0.1.x, deepagents-cli is a deploy-only bundler: it scaffolds a project (init), runs it locally (dev), and bundles then ships it to a managed deployment platform (deploy); the interactive coding-assistant surface moved to a separate deepagents-code package. A healthy run must exercise SKILL.md Step 6 “MCP Server Evaluation” on the root .mcp.json — load knowledge/KB_MCP_SECURITY.md, apply the minimum-bar checklist, emit { "kind": "mcp", … } findings. Praxen should find: the unauthenticated-API confirmation gate fires only when a frontend is configured, so an anonymous-auth deploy with no frontend ships an open API silently; deploy validates MCP transport type but not that http/sse endpoints use TLS; remote MCP servers are carried into the bundle with no version pin; the deploy tooling installs no logging. Its operative controls — bundling only the project’s declared sources, the anonymous-deploy confirmation prompt, committed pinned lockfiles — must register: a weighted score in the Absent band (< 1.0) for this target means the scoring is over-corrected.
Baseline expectation: ≈ 0-2 Critical / 0-3 High / 3-7 Medium, weighted ≈ 2.1-2.8 / 5.0 (Partial). Frozen at baselines/v0.7.7-claude48/.
Remit: remits/yaah.md
Source: https://github.com/dirien/yet-another-agent-harness
Scope: the harness itself — cmd/yaah, pkg/{harness,hooks,mcpserver,mcp,session,generator,schema} — plus the root .mcp.json, .claude/settings.json, go.mod, go.sum, AGENTS.md. Exclude .claude/skills/*/references/examples/, website/, qa/ except where a finding cites them.
Notes: The second MCP-coverage target, and the suite’s clearest “controls present, score honestly” case (alongside OpenHands). yaah is a Go CLI that generates config for four coding agents and ships a built-in security toolset: a deterministic command-guard hook (blocks rm -rf /, force-push to main, DROP TABLE, mkfs, raw disk writes on every Bash call), a 13-pattern secret scanner on every edit, a structured per-session audit log, a built-in yaah serve MCP server with clean tool descriptions, exact-pinned Go deps. A healthy run must exercise SKILL.md Step 6 end-to-end on its .mcp.json (and the mcpServers block in .claude/settings.json) — KB_MCP_SECURITY.md → checklist → mcp-tagged findings — and must credit the operative controls: Manage Your Supply Chain and Monitor Continuously at Established (3) (the go.mod/go.sum pins; the real session audit log), Implement Zero Trust / Balance Your Knowledge Base at Partial (2) (the command-guard + secret-scanner run on the agent’s path), and the built-in MCP server’s clean descriptions registered as a positive, not a finding. A weighted score in the Absent band (< 1.0) here, or zeroing the categories the hooks/audit-log cover, means the scoring is over-corrected. The headline finding (the one High): pkg/generator/hookmap.go leaves PreToolUse/PostToolUse blank for the Codex CLI target, so yaah generate --agent codex ships a config with none of the advertised hooks — a policy-implementation divergence the run must catch from reading hookmap.go. Other expected findings: context7 MCP server launched via unpinned npx -y @context7/mcp (silent-update vector); MCP tool calls fall outside the PreToolUse/PostToolUse hooks → not in the session log and ungated; no tool-poisoning check / output sanitization on the third-party servers; auto-managed AGENTS.md (with CLAUDE.md/GEMINI.md symlinks) is a session-loaded, regenerable surface; no SECURITY.md.
Baseline expectation: ≈ 0-4 Critical / 1-5 High / 1-5 Medium, weighted ≈ 1.8-2.5 / 5.0 (Partial). Frozen at baselines/v0.7.7-claude48/.
Remit: remits/hermes-agent-desktop.md
Source (Agent): https://github.com/NousResearch/hermes-agent — pinned at commit b1a2540 (the repo is date-versioned off main; nearest tag v2026.5.29, no semver release, so pin the SHA for reproducibility)
Source (Desktop): https://github.com/fathah/hermes-desktop — v0.5.1 (commit 4e8388a)
Scope: both workspaces analyzed together as one agent. Agent — the Python hermes-agent tree (gateway + platform adapters, tools/ incl. approval.py / mcp_tool.py / skills_guard.py / osv_check.py, hermes_logging.py). Desktop — the hermes-desktop Electron/TS app (ssh-tunnel.ts / ssh-remote.ts, analytics.ts, main/renderer split). Exclude vendored deps and build output.
Notes: The suite’s first multi-component target and its first real-world agent shipping a disclosed security posture (SECURITY.md). One combined Worker Remit names the in-process LLM Agent as the primary RAISE subject and gives the Desktop operator layer its own sub-headings within each section — so this is the regression anchor for the SKILL.md multi-component remit path. Step 6 MCP evaluation runs against a launch mechanism (env-filter + tool-description-poisoning scanner) rather than a static .mcp.json. Praxen should find: the default-local terminal backend runs LLM-emitted commands on the host while untrusted web/email/MCP surfaces are ingested (OS isolation is opt-in — the agent’s own SECURITY.md §2 names this exact gap); dangerous-shell auto-approval fails open in non-interactive / non-gateway / non-cron contexts; PostHog telemetry defaults on (opt-out) for official Desktop builds; SSH accept-new TOFU plus no non-root remote-user guard. It must also credit the operative controls — fail-closed adapter/API auth, the un-overridable credential blocklist (GHSA fix), exact-pinned deps + supply-chain CI, redacting rotating logs, the hardline command blocklist — so the score lands at the Partial→Established boundary, not Absent. A weighted score below the Partial floor here means the scoring over-corrected on the disclosed default-isolation gap and ignored the real controls. Judgment-sensitive: whether the default-isolation seam is scored Critical or High is the one call that moves the weighted score (2.75 Partial ↔ 3.15 Established across the three baseline runs); the finding set is stable.
Baseline expectation: ≈ 0-1 Critical / 1-3 High / 4-6 Medium, weighted ≈ 2.6-3.4 / 5.0 (Established; second-widest variance in the suite, σ 0.23). Frozen at baselines/v0.7.7-claude48/.
The release review is a full compare: run all twelve targets and diff each against the latest frozen baseline, baselines/v0.7.7-claude48/.
Compare against the baseline (the hard gate — do this first)
Then, for each target, open the HTML report and check:
Structural correctness
.html, .json, .txt). render.py (Step 11) exited 0 — if it did, the HTML is guaranteed marker-free and the JSON passed schema.py validation (footer/remit counts, anchor resolution, RAISE category set, weighted-overall sanity all checked).*-findings.json validates against skills/behavior-verifier/schema.py (python3 -c "import sys; sys.path.insert(0,'skills/behavior-verifier'); import schema, json; schema.validate(json.load(open(PATH)))"), and behavior_summary, the six raise_posture.categories (with rationales), the two intro_band summaries, and remit_coverage.rules are all populated.render.py reproduces the committed HTML byte-for-byte (the renderer is deterministic).Finding quality
tags[].label = LLM01 — Prompt Injection, not LLM01); policy_rule_text quotes the exact remit text; policy_rule_ids references real R-NN rules from remit_coverage.RAISE Maturity Posture section (end of report)
Secrets discipline
[REDACTED — pattern at file:line]If any check fails, investigate before releasing. Finding-count shifts within baseline bands are expected; theme-level coverage regressions are not.
The twelve targets deliberately span a spectrum:
.mcp.json discovery → KB_MCP_SECURITY.md → minimum-bar checklist → mcp-tagged findings) under regression, and exercise bidirectional calibration from both ends: Deep Agents CLI is “strong primitives, permissive defaults” (don’t over-credit); yaah is “controls genuinely operative” (don’t zero them).SECURITY.md: Praxen must surface the disclosed gaps as findings and credit the real controls, landing at the Partial→Established boundary rather than over-correcting in either direction.A release that produces solid reports on all twelve has been validated across the full range of agent postures we’ve encountered.