Frozen runs of all eleven test targets (../../README.md) on the Praxen v0.7.7 skill, against the intent-level Worker Remits (tests/remits/*.md). The eleven runs were produced cold on 2026-05-29 against the release/0.7.7 branch (off dev) with version-source-of-truth bumped to 0.7.7, sources fresh-cloned from upstream. Retires the previous v0.7.4-sequential/ set, which is no longer the comparison point but is kept on disk for diff archaeology.
RotatingFileHandler / FileHandler, Node.js winston / pino file transports, Go log.SetOutput / zap file sinks, or language-equivalent log-routing configuration), the scanner now infers the runtime log file locations and records each with mtime: "unknown" and status: "inferred". These rows give the operator an accurate picture of where runtime logs will appear on a deployed instance and lift Monitor Continuously scoring on source-only scans where it had previously bottomed at 0/1.inferred rows in the log-files table communicate the situation; a “no logging” finding is warranted only when there is no logging infrastructure at all.[Inferred] rather than writing MUST NOT clauses based on assumed scope. Multi-component-deployment guidance covers when to combine vs split remits and how to structure a combined remit (scope note in Mission designating the primary RAISE subject, sub-headings within existing sections). This change does NOT affect scan-time behavior — pre-flight only runs during remit authoring.findings.schema.json enum cleanup. log_files row status enum is now ["active", "inferred"]. inferred is new (PR #43); new was removed as a vestige — it had appeared exactly once across all committed baselines (the v0.7.4 aider scan) and that one use was a misclassification of what should have been inferred. The scanner is read-only and has no scan-start-time comparison logic, so “freshly created this run” was never a semantically distinguishable case from active. schema_version stays at "2.0".log-status-inferred CSS class (muted color), id="logs" anchor on the Discovered Log Files section, and a Logs entry in the jump nav.manifest_to_findings.py and the four knowledge bases (KB_RAISE_SCANNING, KB_LLM_TOP10, KB_AGENTIC_TOP10, KB_MCP_SECURITY) are byte-identical to 0.7.6.| Metric | v0.7.4 | v0.7.7 | Delta |
|---|---|---|---|
| Total findings (across 11 targets) | 135 | 109 | −19.3% |
| Weighted RAISE deltas per target | — | — | all within ±0.45 |
| Targets in expected band | 11/11 | 11/11 | — |
| Critical themes preserved | yes | yes | — |
| Inferred log rows surfaced | n/a | 9 (across 6 targets) | new behavior |
The 19.3% drop in total findings is within natural run-to-run variance (the SKILL’s Build an AI Red Team calibration note documents ±2–3 swings per severity bucket per blind run). Per-target weighted RAISE is stable: 10 of 11 targets within ±0.15, the largest delta is deepagents-cli at −0.45 (boundary of natural variance, scope unchanged). Every Critical theme catalogued in ../../README.md’s per-target notes is present in the v0.7.7 set — the “no Critical theme dropped” hard gate passes for all 11.
The new inferred log-file rows fired correctly on six targets: openhands (3), yaah (2), and airline-customer-service / sweep / devika / deepagents-cli (1 each) — 9 inferred rows total across the suite. The remaining five targets had no logging infrastructure visible in the in-scope source files.
Sorted by weighted RAISE, ascending.
| Target | Critical | High | Medium | Low | Info | Weighted | Maturity |
|---|---|---|---|---|---|---|---|
| finbot | 7 | 4 | 4 | 0 | 0 | 0.45 | Absent |
| helperbot | 3 | 4 | 1 | 0 | 0 | 0.45 | Absent |
| devika | 6 | 6 | 3 | 0 | 0 | 0.60 | Absent |
| openai-customer-service | 4 | 4 | 2 | 0 | 0 | 0.90 | Absent |
| sweep | 4 | 3 | 2 | 0 | 0 | 0.90 | Absent |
| langchain-sql | 2 | 5 | 2 | 0 | 0 | 1.00 | Ad hoc |
| autogen-code-executor | 4 | 4 | 1 | 0 | 0 | 1.15 | Ad hoc |
| aider | 3 | 5 | 2 | 0 | 0 | 1.40 | Ad hoc |
| deepagents-cli | 0 | 3 | 3 | 0 | 0 | 2.00 | Partial |
| openhands | 2 | 3 | 4 | 1 | 1 | 2.05 | Partial |
| yaah | 0 | 3 | 4 | 0 | 0 | 2.15 | Partial |
The Full Suite Run was executed in two sessions across 2026-05-28 → 2026-05-29 because the first attempt coincided with the Claude 4.8 release and tripped concurrency-induced watchdog stalls. The pattern that worked: parallel subagents in batches of 4 with explicit heartbeat-discipline at the top of each prompt, Step 9.9 disk-write mandate emphasized, plus a continuation-subagent pattern for any agent that died mid-findings-loop (continuation reads the on-disk draft, completes remaining findings + positives + log_files, runs manifest_to_findings.py + render.py).
airline-customer-service-agent corrected to openai-customer-service on freeze), autogen-code-executor, deepagents-cli.manifest_to_findings.py: sweep.All eleven outputs went through manifest_to_findings.py and render.py cleanly; the byte-identity gate in tests/render/test_render.py passes 242/0 against this set.
# Diff a single target's findings JSON across baseline sets
diff <(python3 -m json.tool tests/baselines/v0.7.4-sequential/finbot/finbot-findings-2026-05-26.json) \
<(python3 -m json.tool tests/baselines/v0.7.7-sequential/finbot/finbot-findings-2026-05-29.json)
# Re-render any baseline from its JSON (byte-identical re-render is enforced by tests/render/test_render.py)
python3 skills/behavior-verifier/render.py \
--findings tests/baselines/v0.7.7-sequential/<target>/<target>-findings-<date>.json \
--template skills/behavior-verifier/report_template.html \
--out-html /tmp/<target>.html --out-txt /tmp/<target>.txt