PRAXEN
agent behavior verifier
AutoGen Code Execution Subsystem Analysis Report
Completed May 29, 2026
12Findings
4Critical
5High
3Medium
RAISE maturity 1.45 / 5.0
Executive Summary
Agent Remit (as declared)
The AutoGen Code Execution Subsystem is the execution tier of a generator/executor agent pattern: it receives LLM-generated code blocks (Python or shell), runs them in a configured environment, and returns stdout, stderr, and exit status to the calling agent. It ships five executor kinds — a local host executor, a Docker container executor (the declared production path), a containerized Jupyter executor, a non-containerized Jupyter executor, and an Azure Container Apps dynamic-sessions executor. The remit requires that every execution be timeout-bounded, work-directory-confined, isolated from the host's environment and credentials, and recorded to a per-execution audit log; it requires human approval for local-executor production use, network egress, host-volume mounts, and any resource-ceiling increase.
Behavior Summary (as observed)
The dominant pattern is *sandbox with the shape of isolation but not the substance*: every isolation guarantee the remit demands is either defaulted off, downgraded silently, or never built. LocalCommandLineCodeExecutor copies the parent process's entire os.environ into the subprocess and gates nothing — it merely emits a warnings.warn — while its own docstring claims a dangerous-command regex denylist that does not exist anywhere in the code. DockerCommandLineCodeExecutor creates containers with no user, read_only, mem_limit, cap_drop, or network restriction, so model-generated code runs as root with full capabilities and outbound network; create_default_code_executor() silently downgrades that already-thin Docker path to the local executor on a mere UserWarning. No executor writes a per-execution audit record, so none of these gaps is detectable after the fact.
Scope of Analysis
A Python library under autogen_ext.code_executors (five executors) plus the autogen_core.code_executor abstraction. Each executor is a CodeExecutor subclass with an execute_code_blocks method that writes the model's code to a work-directory file and runs it via asyncio.create_subprocess_exec (local), a Docker exec_run (docker), an nbclient kernel (jupyter), a kernel-gateway websocket (docker_jupyter), or an Azure dynamic-sessions HTTP endpoint (azure). A create_default_code_executor() factory in __init__.py selects Docker when available and otherwise falls back to the local executor. The shared _common.py holds the file-naming and pip-silencing helpers. There is no audit-logging surface and no approval gate anywhere in the subsystem; security-relevant container, environment, and timeout behavior is governed entirely by constructor defaults.
Remit Coverage

Every actionable rule in the Worker Remit, checked against the running code. Gap = declared but unenforced; Partial = enforced but incomplete or bypassable; Vague Policy = too imprecise to verify.

Verified: 4 Gap: 8 Partial: 4 Vague Policy: 0 Enforcement Not Possible: 2 Total Rules: 18
Rule ID Section Rule (quoted) Status Finding
R-01 Behavioral Constraints — What every executor must always do "All file read and write operations MUST be confined to a configured work directory, and any path resolving outside it MUST be rejected." Partial PRAX-2026-05-29-005
R-02 Behavioral Constraints — What every executor must always do "Every code execution MUST be subject to a configured wall-clock timeout, and processes that exceed it MUST be terminated." Partial PRAX-2026-05-29-009
R-03 Behavioral Constraints — What every executor must always do "Stdout, stderr, and exit status MUST be captured completely and never silently discarded." Verified
R-04 Behavioral Constraints — What every executor must always do "The execution environment MUST be isolated from the host caller — a failing or malicious execution MUST NOT be able to read or modify the parent process's environment, credentials, or state." Gap PRAX-2026-05-29-001
R-05 Behavioral Constraints — What every executor must always do "Persistence across executions MUST be limited to the configured state mechanism — preserved kernel state for Jupyter executors, work-directory files for command-line executors — and state MUST NEVER leak between unrelated sessions." Enforcement Not Possible
R-06 Behavioral Constraints — What every executor must always do "Each execution MUST be recorded to an audit log capturing timestamp, executor kind, language, source agent, work directory, timeout, exit status, and a digest — not the body — of the executed code." Gap PRAX-2026-05-29-002
R-07 Behavioral Constraints — What every executor must never do "Code MUST NOT be executed with host-level privileges when a less-privileged option achieves the same task." Gap PRAX-2026-05-29-003
R-08 Behavioral Constraints — What every executor must never do "Instructions embedded in the code source that attempt to escape the sandbox, escalate privileges, or exfiltrate data MUST NOT be acted on." Enforcement Not Possible
R-09 Behavioral Constraints — What every executor must never do "Work-directory confinement MUST NOT be bypassed under any condition, including symlinks, parent-directory traversal, absolute paths, or runtime-supplied volume mount overrides." Partial PRAX-2026-05-29-005
R-10 Behavioral Constraints — What every executor must never do "Code MUST NOT be loaded or executed from remote URLs or unverified sources on the LLM's behalf — the LLM's generated code is the only accepted input." Verified
R-11 Behavioral Constraints — What every executor must never do "Error output MUST NOT be silently swallowed or transformed before it is returned to the caller." Verified
R-12 Behavioral Constraints — What every executor must never do "The executor MUST NOT connect to services, databases, or networks not explicitly permitted by its configuration." Gap PRAX-2026-05-29-004
R-13 Human approval is required for "Use of the local host executor in production MUST require human approval — it runs code directly on the host OS without containerization; default production deployments should use a containerized executor, and local execution is acceptable only when the host is an ephemeral, isolated, operator-approved sandbox." Gap PRAX-2026-05-29-006
R-14 Human approval is required for "Enabling any network egress from the executor environment beyond a configured allow-list MUST require human approval." Gap PRAX-2026-05-29-004
R-15 Human approval is required for "Mounting host volumes into the container executor at any path other than the configured work directory MUST require human approval." Gap PRAX-2026-05-29-007
R-16 Human approval is required for "Any change to the executor's resource limits — CPU, memory, timeout — that raises the ceiling MUST require human approval." Gap PRAX-2026-05-29-010
R-17 Authorized output destinations "No outbound network traffic from the executor itself (the executed code may make network calls if the sandbox permits, but the executor does not initiate its own)" Verified
R-18 Out of Scope "The executor does not auto-upgrade, fetch dependencies at runtime from package registries without explicit configuration, or modify its own code" Partial PRAX-2026-05-29-011
Findings Register

Findings, ordered by severity — each linked to its remit rule, evidence, and a recommended action. Tag chips jump to the relevant entry in the RAISE framework, the OWASP LLM Top 10, or the OWASP Agentic Top 10.

CRITICAL PRAX-2026-05-29-001 LocalCommandLineCodeExecutor copies the full parent os.environ into the subprocess and gates execution with only a warning, not host isolation.
Policy Rule — R-04 (Worker Remit):
"The execution environment MUST be isolated from the host caller — a failing or malicious execution MUST NOT be able to read or modify the parent process's environment, credentials, or state."
python/packages/autogen-ext/src/autogen_ext/code_executors/local/__init__.py:397 — env = os.environ.copy() — full parent environment passed as the subprocess env at create_subprocess_exec(..., env=env), lines 397-434 python/packages/autogen-ext/src/autogen_ext/code_executors/local/__init__.py:163 — __init__ issues warnings.warn(UserWarning) recommending Docker — advisory only, no approval gate or block before code runs
Recommended Action
  • In _execute_code_dont_check_setup, build the subprocess env from an explicit allow-list (PATH plus virtualenv additions) instead of os.environ.copy(), so host credentials are never inherited by executed code.
  • Add an explicit opt-in flag (e.g. allow_host_env=False default) and require the operator application to set it before any host-environment passthrough occurs.
CRITICAL PRAX-2026-05-29-002 No executor records a per-execution audit log, leaving every code execution and every isolation gap undetectable after the fact.
Policy Rule — R-06 (Worker Remit):
"Each execution MUST be recorded to an audit log capturing timestamp, executor kind, language, source agent, work directory, timeout, exit status, and a digest — not the body — of the executed code."
python/packages/autogen-ext/src/autogen_ext/code_executors/local/__init__.py:341 — _execute_code_dont_check_setup writes and runs the code file with no audit log call — only logging.error on temp-file unlink failure python/packages/autogen-ext/src/autogen_ext/code_executors/docker/_docker_code_executor.py:327 — docker _execute_code_dont_check_setup similarly records no structured per-execution entry; only logging.debug lifecycle lines exist across the subsystem
Recommended Action
Add a structured audit-log emission at the top of each executor's execute_code_blocks recording timestamp, executor class, language, configured work_dir, timeout, and a SHA-256 digest of the code (not the body), and a second entry with exit_code on completion.
CRITICAL PRAX-2026-05-29-003 DockerCommandLineCodeExecutor creates containers with no user, read_only, cap_drop, or mem_limit, so model code runs as root with full capabilities.
Policy Rule — R-07 (Worker Remit):
"Code MUST NOT be executed with host-level privileges when a less-privileged option achieves the same task."
python/packages/autogen-ext/src/autogen_ext/code_executors/docker/_docker_code_executor.py:537 — containers.create(image, ..., tty=True, detach=True, auto_remove=..., volumes=..., working_dir="/workspace", ...) lines 537-550 — no user=, read_only=, cap_drop=, security_opt=, or mem_limit= argument python/packages/autogen-ext/src/autogen_ext/code_executors/docker_jupyter/_jupyter_server.py:363 — client.containers.run(image, detach=True, auto_remove=..., publish_all_ports=True, volumes=...) — likewise no privilege-dropping or resource-limit kwargs
Recommended Action
  • In containers.create, set user="1000:1000" (or a non-root UID baked into the image), read_only=True with a writable /workspace volume, cap_drop=["ALL"], and security_opt=["no-new-privileges"] as the defaults.
  • Add mem_limit and pids_limit defaults and surface them as constructor parameters governed by the resource-ceiling approval rule.
CRITICAL PRAX-2026-05-29-004 Docker containers are created with default networking and no egress control, letting model code exfiltrate data or pull payloads with no approval.
Policy Rule — R-12, R-14 (Worker Remit):
"The executor MUST NOT connect to services, databases, or networks not explicitly permitted by its configuration. / Enabling any network egress from the executor environment beyond a configured allow-list MUST require human approval."
python/packages/autogen-ext/src/autogen_ext/code_executors/docker/_docker_code_executor.py:537 — containers.create(...) lines 537-550 omits network_mode/network_disabled — container uses the default bridge network with outbound egress enabled python/packages/autogen-ext/src/autogen_ext/code_executors/docker_jupyter/_jupyter_server.py:368 — client.containers.run(..., publish_all_ports=True, ...) — exposes all container ports to the host with no network restriction
Recommended Action
  • Default network_mode="none" on containers.create and require an explicit, operator-approved allow-list parameter to enable any egress, satisfying the network-approval rule.
  • Replace publish_all_ports=True with binding only the single kernel-gateway port to 127.0.0.1.
HIGH PRAX-2026-05-29-005 Work-directory confinement is enforced only for code carrying an explicit "# filename:" header; code without one and runtime volume overrides bypass it.
Policy Rule — R-01, R-09 (Worker Remit):
"All file read and write operations MUST be confined to a configured work directory, and any path resolving outside it MUST be rejected. / Work-directory confinement MUST NOT be bypassed under any condition, including symlinks, parent-directory traversal, absolute paths, or runtime-supplied volume mount overrides."
python/packages/autogen-ext/src/autogen_ext/code_executors/_common.py:99 — get_file_name_from_content only runs the relative_to() confinement check when first_line.startswith("# filename:"); returns None otherwise (no boundary enforced) python/packages/autogen-ext/src/autogen_ext/code_executors/docker/_docker_code_executor.py:546 — volumes={... work_dir ...}, **self._extra_volumes — caller extra_volumes merged into the mount set with no path validation against the work directory
Recommended Action
  • Document explicitly that work-directory confinement governs only where the executor writes the code file, not where executed code may read/write, and rely on container read_only + non-root user (PRAX-2026-05-29-003) for the runtime boundary.
  • Validate extra_volumes at construction and reject any host bind outside the configured work directory unless an explicit approval flag is set.
HIGH PRAX-2026-05-29-006 create_default_code_executor silently downgrades Docker to the local host executor on a UserWarning, with no approval gate for local-in-production.
Policy Rule — R-13 (Worker Remit):
"Use of the local host executor in production MUST require human approval — it runs code directly on the host OS without containerization; default production deployments should use a containerized executor, and local execution is acceptable only when the host is an ephemeral, isolated, operator-approved sandbox."
python/packages/autogen-ext/src/autogen_ext/code_executors/__init__.py:69 — warnings.warn("Docker is not available ... Using LocalCommandLineCodeExecutor ...") then returns LocalCommandLineCodeExecutor — advisory warning, no gate python/packages/autogen-ext/src/autogen_ext/code_executors/__init__.py:64 — except Exception: pass — Docker init failure is swallowed and execution falls through to the local-host fallback
Recommended Action
  • Make the fallback opt-in: raise instead of downgrading unless the caller passed an explicit allow_local_fallback=True, so local-in-production requires a deliberate operator decision.
  • Narrow the bare except Exception at line 64 to specific Docker exceptions and log the failure rather than silently discarding it.
HIGH PRAX-2026-05-29-007 extra_volumes host mounts are accepted into the Docker container with no approval gate or path restriction.
Policy Rule — R-15 (Worker Remit):
"Mounting host volumes into the container executor at any path other than the configured work directory MUST require human approval."
python/packages/autogen-ext/src/autogen_ext/code_executors/docker/_docker_code_executor.py:546 — volumes={str(self.bind_dir.resolve()): {"bind": "/workspace", "mode": "rw"}, **self._extra_volumes} — extra_volumes merged unconditionally with no approval or path check python/packages/autogen-ext/src/autogen_ext/code_executors/docker/_docker_code_executor.py:231 — self._extra_volumes = extra_volumes if extra_volumes is not None else {} — stored verbatim from the constructor with no validation
Recommended Action
Gate extra_volumes behind an explicit approval flag and, by default, reject any bind whose host path is outside the configured work directory.
HIGH PRAX-2026-05-29-008 LocalCommandLineCodeExecutor's docstring claims a dangerous-command regex denylist that does not exist anywhere in the code.
python/packages/autogen-ext/src/autogen_ext/code_executors/local/__init__.py:57 — docstring "Command line code is sanitized using regular expression match against a list of dangerous commands" — no corresponding implementation in the class python/packages/autogen-ext/src/autogen_ext/code_executors/_common.py:114 — the only regex in the shared module is silence_pip(), which appends -qqq to pip install lines — no dangerous-command matching
Recommended Action
Either remove the false sanitization claim from the docstring, or implement an actual denylist check in _execute_code_dont_check_setup before the subprocess runs and document its exact coverage and limits.
HIGH PRAX-2026-05-29-009 The non-containerized Jupyter executor's timeout shields the running cell, so an over-time execution is not interrupted in the kernel.
Policy Rule — R-02 (Worker Remit):
"Every code execution MUST be subject to a configured wall-clock timeout, and processes that exceed it MUST be terminated."
python/packages/autogen-ext/src/autogen_ext/code_executors/jupyter/_jupyter_code_executor.py:207 — output_cell = await asyncio.wait_for(asyncio.shield(execute_task), timeout=self._timeout) — shield prevents cancellation, so the cell is not terminated on timeout python/packages/autogen-ext/src/autogen_ext/code_executors/jupyter/_jupyter_code_executor.py:243 — _execute_cell awaits client.async_execute_cell with no kernel-interrupt path invoked on the wait_for timeout
Recommended Action
On timeout, send a kernel interrupt (and restart if the interrupt is not honored) rather than shielding the task, so the cell is actually terminated as the remit requires.
MEDIUM PRAX-2026-05-29-010 No memory or PID resource ceiling is configurable on the Docker executor, so the resource-ceiling approval rule has nothing to gate.
Policy Rule — R-16 (Worker Remit):
"Any change to the executor's resource limits — CPU, memory, timeout — that raises the ceiling MUST require human approval."
python/packages/autogen-ext/src/autogen_ext/code_executors/docker/_docker_code_executor.py:68 — DockerCommandLineCodeExecutorConfig defines image/timeout/work_dir/volumes but no mem_limit, pids_limit, or cpu field — no resource ceiling exists to approve python/packages/autogen-ext/src/autogen_ext/code_executors/docker/_docker_code_executor.py:537 — containers.create(...) sets no mem_limit/pids_limit/cpu_quota — container memory and process count are unbounded
Recommended Action
Add mem_limit and pids_limit config fields with conservative defaults and pass them to containers.create, then treat raising them as the approval-gated action the remit describes.
MEDIUM PRAX-2026-05-29-011 The default Docker image is the moving tag python:3-slim and dependencies are floor-pinned, so the executor runtime is not reproducibly fixed.
Policy Rule — R-18 (Worker Remit):
"The executor does not auto-upgrade, fetch dependencies at runtime from package registries without explicit configuration, or modify its own code"
python/packages/autogen-ext/src/autogen_ext/code_executors/docker/_docker_code_executor.py:71 — image: str = "python:3-slim" — default image is a moving tag with no @sha256 digest pin; start() pulls it on demand if absent python/packages/autogen-ext/pyproject.toml:31 — docker = ["docker~=7.0", ...]; nbclient>=0.10.2; websockets>=15.0.1 — floor/compatible pins, no committed lockfile for the executor extras
Recommended Action
  • Pin the default image to a specific python:3-slim@sha256:... digest and document the update cadence.
  • Tighten the executor-extra version specifiers and commit a lockfile so the runtime is reproducible.
MEDIUM PRAX-2026-05-29-012 DockerJupyterServer chmods the host bind directory to world-writable 0o777, widening host filesystem exposure.
python/packages/autogen-ext/src/autogen_ext/code_executors/docker_jupyter/_jupyter_server.py:328 — os.chmod(bind_dir, 0o777) — host bind directory made world-readable/writable on every server start
Recommended Action
Replace the 0o777 chmod with the narrowest permission that lets the container user access the bind mount (e.g. group ownership matching the container UID, or 0o770).
What's Working Well

Controls and behaviors that are correctly implemented and verified during this scan. These represent areas where the agent's implementation aligns with its stated policy and security best practices.

Language whitelist with explicit unknown-language rejection

Every executor matches the requested language against a fixed <code>SUPPORTED_LANGUAGES</code> list and returns exit code 1 for anything outside it, constraining execution to Python and a known shell set (Azure is Python-only).

python/packages/autogen-ext/src/autogen_ext/code_executors/local/__init__.py:130

Work-directory path-traversal check on explicit filenames

When model code carries a <code># filename:</code> header, <code>get_file_name_from_content</code> resolves the path and calls <code>Path.relative_to(workspace)</code>, raising and aborting if the file would land outside the work directory.

python/packages/autogen-ext/src/autogen_ext/code_executors/_common.py:96

Azure dynamic-sessions executor uses a managed sandbox with scoped bearer tokens

The Azure executor delegates execution to an Azure Container Apps dynamic-sessions endpoint, authenticating per request with a scoped <code>dynamicsessions.io/.default</code> access token rather than running code on the host.

python/packages/autogen-ext/src/autogen_ext/code_executors/azure/_azure_container_code_executor.py:153

Every executor enforces a configurable wall-clock timeout

All five executors accept a <code>timeout</code> (default 60s), reject values below 1, and bound each execution with <code>asyncio.wait_for</code> or the container <code>timeout</code> command.

python/packages/autogen-ext/src/autogen_ext/code_executors/docker/_docker_code_executor.py:360
Discovered Log Files

Log files found in the agent's workspace during this scan. Reviewing these files provides runtime evidence to complement the static analysis above.

Path Source Content Type Purpose Last Modified Status
(runtime stderr / Python logging handlers — not configured by the subsystem) docker/_docker_code_executor.py, local/__init__.py, docker_jupyter/_jupyter_server.py unstructured plaintext (logging.debug/info/error and warnings.warn) container lifecycle debug, cancellation diagnostics, temp-file cleanup errors, and security warnings — not per-execution audit records unknown Inferred
OWASP LLM Top 10 (2025) Coverage

Each card represents one category and shows the top 3 findings. All items in the Findings section.

OWASP Agentic Top 10 (2026) Coverage

Each card represents one category and shows the top 3 findings. All items in the Findings section.

ASI01 Agent Goal Hijack
No findings
ASI04 Agentic Supply Chain Vulnerabilities
No findings
ASI06 Memory and Context Poisoning
No findings
ASI07 Insecure Inter-Agent Communication
No findings
ASI08 Cascading Failures
No findings
ASI09 Human-Agent Trust Exploitation
No findings
RAISE Maturity Posture

Overall maturity assessment across the six categories of the RAISE framework. This is a maturity model, not a school grade: a score of 3 / 5 means Established, not 60 percent. Most production AI agents today score between Ad hoc (1) and Established (3). See the full RAISE framework reference for the complete scale and scoring.

1.45 / 5.0
Weighted Maturity Score · Ad hoc
Ad hoc. The subsystem has a coherent, narrow purpose and inherits language-whitelisting and timeout primitives, but its Zero Trust posture is near-absent: host-environment isolation, container hardening, and any approval gate are all missing or defaulted off, and the local path copies live credentials into the child process. Monitoring is entirely absent — no executor records what code it ran — which leaves every divergence the scan found both exploitable and undetectable. Supply-chain and domain controls are partial; adversarial testing of the isolation boundary is effectively nonexistent.
Limit Your Domain
3/ 5
Confidence: High  |  Weight: 15%  |  Weighted: 0.45
The subsystem is structurally narrow — it only executes code and returns output, enforces a fixed <code>SUPPORTED_LANGUAGES</code> whitelist (Python and a handful of shells; Azure is Python-only), and rejects unknown languages with exit code 1, so its capability surface matches its single declared mission.
Balance Your Knowledge Base
2/ 5
Confidence: Medium  |  Weight: 15%  |  Weighted: 0.30
The executor explicitly defers semantic safety of the code to the caller (Scope Boundaries) and performs no validation of the code source, but it is the execution tier rather than an LLM-context assembler, so the unvalidated-content surface is the executed code itself rather than retrieved data feeding a prompt.
Implement Zero Trust
1/ 5
Confidence: High  |  Weight: 25%  |  Weighted: 0.25
Host isolation is defaulted off across the board — the local executor copies the full parent <code>os.environ</code> into the subprocess and only warns rather than gating, the Docker executor sets no <code>user</code>/<code>read_only</code>/<code>cap_drop</code>/<code>mem_limit</code>/network controls, and there is no approval gate on any high-risk path including the local-in-production case the remit says requires one.
Manage Your Supply Chain
2/ 5
Confidence: Medium  |  Weight: 15%  |  Weighted: 0.30
Dependencies are floor-pinned with <code>>=</code> and <code>~=</code> (e.g. <code>docker~=7.0</code>, <code>nbclient>=0.10.2</code>) rather than locked, and the default Docker image <code>python:3-slim</code> is a moving tag with no digest pin, so the executor runtime is not reproducibly fixed.
Build an AI Red Team
1/ 5
Confidence: Medium  |  Weight: 15%  |  Weighted: 0.15
A functional test suite exists for all five executors, but it contains no adversarial cases — no sandbox-escape, path-traversal, environment-leak, or resource-exhaustion tests — so there is no evidence the team's own adversarial testing shaped the isolation design.
Monitor Continuously
0/ 5
Confidence: High  |  Weight: 15%  |  Weighted: 0.00
No executor records a per-execution audit entry; the only logging in scope is <code>logging.debug</code>/<code>info</code> for container lifecycle and a few error lines, none of which capture timestamp, source agent, language, work directory, or a code digest as the remit requires.

Maturity Scoring Rubric

Every score above is based on this scale. A score is a snapshot of observable posture — not a verdict on the people or team behind the system.

Score Label Meaning
5 Exemplary Best-in-class; automated, continuously tested, reference quality. Rarely achieved in shipping systems.
4 Strong Comprehensive controls, active management, minor gaps. Production-ready.
3 Established Documented controls consistently applied; known gaps accepted. A respectable baseline.
2 Partial Some controls exist but coverage is incomplete; key gaps remain.
1 Ad hoc Informal or inconsistent measures; relies on individual judgment.
0 Absent No evidence this category is addressed at all.
Weighting: the weighted overall above is the sum of each category's score × weight (the per-category weights are shown on each card). Zero Trust carries double weight by design; see the RAISE framework reference for the rationale.