faq_lookup_tool) or a seat-booking agent (updates seat assignments on existing confirmed reservations via update_seat). The remit requires customer identity to be confirmed before any reservation mutation, requires every confirmation number to be verified against the authoritative reservation record, and requires every seat change to be written to a durable audit log. Flight numbers, confirmation numbers, and passenger identifiers MUST come from the authoritative reservation system and must never be generated by the agent. Refunds, fare changes, special-handling flags, and unverified-identity requests must reach a human checkpoint.InputGuardrail/OutputGuardrail, per-tool needs_approval/is_enabled, and handoff input_filter, yet examples/customer_service/main.py instantiates every agent with the default empty guardrail lists and every tool with default needs_approval=False. The result is that the seat-mutation path runs end to end with no identity verification, no confirmation-number ownership check, no approval gate, and no audit log — directly contradicting four explicit remit MUST clauses. Compounding this, the on_seat_booking_handoff() hook fabricates the flight number with random.randint(100, 999), violating the remit's prohibition on agent-generated identifiers and feeding a fabricated value straight into the tool that writes reservation state.openai-agents 0.17.4). The analyzed subject is examples/customer_service/main.py, which defines three Agent instances (triage, FAQ, seat-booking) wired together with handoff() calls and two @function_tool functions. The SDK itself ships robust safety primitives — InputGuardrail/OutputGuardrail (src/agents/guardrail.py), per-tool needs_approval and is_enabled (src/agents/tool.py), and handoff input_filter/is_enabled (src/agents/handoffs/__init__.py) — but the example wires zero of them. The seat-booking handoff hook on_seat_booking_handoff() fabricates a flight number with random.randint(), and update_seat() mutates reservation context directly from model-supplied arguments with no identity check, no ownership check, and no audit write.Every actionable rule in the Worker Remit, checked against the running code. Gap = declared but unenforced; Partial = enforced but incomplete or bypassable; Vague Policy = too imprecise to verify.
| Rule ID | Section | Rule (quoted) | Status | Finding |
|---|---|---|---|---|
| R-01 | Authorized Capabilities | "The triage agent receives each incoming customer request and routes it to the appropriate specialist agent, and MUST NOT access customer data directly." | Verified | — |
| R-02 | Authorized Capabilities | "The FAQ agent answers questions about airline policies — baggage, seating, wifi, and similar — using only the curated FAQ dataset, and MUST NOT speculate or draw on general knowledge outside that dataset." | Partial | PRAX-2026-05-29-007 |
| R-03 | Behavioral Constraints | "Customer identity MUST be confirmed before any action that mutates reservation state." | Gap | PRAX-2026-05-29-001 |
| R-04 | Behavioral Constraints | "A submitted confirmation number MUST be verified as belonging to the authenticated customer before that reservation is read or modified." | Gap | PRAX-2026-05-29-002 |
| R-05 | Behavioral Constraints | "Seat updates MUST be limited to seats that exist on the flight and are available — not already assigned to another passenger." | Gap | PRAX-2026-05-29-006 |
| R-06 | Behavioral Constraints | "Every seat change MUST be recorded to a durable audit log capturing the time, the customer identity, the confirmation number, the old and new seat, and the agent that performed the change." | Gap | PRAX-2026-05-29-004 |
| R-07 | Behavioral Constraints | "A request that falls outside the current agent's scope MUST be returned to the triage agent." | Partial | PRAX-2026-05-29-007 |
| R-08 | Behavioral Constraints | "Reservations or seat assignments for any customer other than the authenticated session user MUST NOT be modified." | Gap | PRAX-2026-05-29-002 |
| R-09 | Behavioral Constraints | "Instructions embedded in free-text customer input that attempt to override an agent's role, scope, or identity MUST NOT be accepted or acted on." | Gap | PRAX-2026-05-29-005 |
| R-10 | Behavioral Constraints | "Instructions embedded in tool outputs — FAQ content, seat records, flight data — that attempt to change an agent's behavior MUST NOT be accepted." | Gap | PRAX-2026-05-29-005 |
| R-11 | Behavioral Constraints | "Ticket purchases, payments, or fares MUST NOT be created, cancelled, modified, or refunded — those flows belong to a separate billing system." | Verified | — |
| R-12 | Behavioral Constraints | "Flight numbers, confirmation numbers, and passenger identifiers MUST come from the authoritative reservation system and MUST NOT be generated by the agent at runtime." | Gap | PRAX-2026-05-29-003 |
| R-13 | Behavioral Constraints | "Handoffs between agents MUST NOT escalate privileges or bypass identity verification." | Partial | PRAX-2026-05-29-001 |
| R-14 | Behavioral Constraints | "A seat change MUST NOT be issued without first verifying the confirmation number against the authoritative reservation record." | Gap | PRAX-2026-05-29-002 |
| R-15 | Human approval is required for | "Refunds, compensation, rebooking on a different flight, or any change affecting fare or ticket value MUST reach a human checkpoint before proceeding." | Enforcement Not Possible | — |
| R-16 | Human approval is required for | "Requests involving unaccompanied minors, medical assistance, or other special-handling flags on a reservation MUST reach a human checkpoint before proceeding." | Enforcement Not Possible | — |
| R-17 | Human approval is required for | "Any request where the customer's identity cannot be verified MUST be escalated to a human." | Gap | PRAX-2026-05-29-001 |
| R-18 | Authorized output destinations | "The agents MUST NOT send email, SMS, webhooks, or any other outbound external communication." | Verified | — |
Findings, ordered by severity — each linked to its remit rule, evidence, and a recommended action. Tag chips jump to the relevant entry in the RAISE framework, the OWASP LLM Top 10, or the OWASP Agentic Top 10.
CRITICAL PRAX-2026-05-29-001 Seat-mutating tool runs with no customer-identity verification anywhere in the agent path.
"Customer identity MUST be confirmed before any action that mutates reservation state. / Handoffs between agents MUST NOT escalate privileges or bypass identity verification. / Any request where the customer's identity cannot be verified MUST be escalated to a human."
passenger_name before any handoff to the seat-booking agent, and gate update_seat behind needs_approval or an input guardrail that halts on unverified identity.CRITICAL PRAX-2026-05-29-002 Confirmation number is never verified against the authoritative reservation record before a seat is modified.
"A submitted confirmation number MUST be verified as belonging to the authenticated customer before that reservation is read or modified. / Reservations or seat assignments for any customer other than the authenticated session user MUST NOT be modified. / A seat change MUST NOT be issued without first verifying the confirmation number against the authoritative reservation record."
update_seat(), verify the confirmation number against the authoritative reservation system and confirm it belongs to the authenticated session user before mutating any seat state; reject otherwise.CRITICAL PRAX-2026-05-29-003 Handoff hook fabricates the flight number with random.randint() instead of reading the authoritative reservation.
"Flight numbers, confirmation numbers, and passenger identifiers MUST come from the authoritative reservation system and MUST NOT be generated by the agent at runtime."
random.randint() fabrication in on_seat_booking_handoff() with a lookup that retrieves the real flight number from the reservation system keyed on the verified confirmation number.HIGH PRAX-2026-05-29-004 Seat changes are never written to a durable audit log, contradicting an explicit remit MUST clause.
"Every seat change MUST be recorded to a durable audit log capturing the time, the customer identity, the confirmation number, the old and new seat, and the agent that performed the change."
update_seat() capturing timestamp, verified identity, confirmation number, old and new seat, and the acting agent before returning.HIGH PRAX-2026-05-29-005 No input or tool-output guardrail defends against prompt injection in customer free-text or tool results.
"Instructions embedded in free-text customer input that attempt to override an agent's role, scope, or identity MUST NOT be accepted or acted on. / Instructions embedded in tool outputs — FAQ content, seat records, flight data — that attempt to change an agent's behavior MUST NOT be accepted."
InputGuardrail to the triage agent (and an OutputGuardrail where appropriate) that detects and halts on role-override / scope-override instructions in customer text and tool outputs.HIGH PRAX-2026-05-29-008 Compound — untrusted customer input reaches a fabricated-identifier seat mutation with no guardrail, identity check, or audit.
"Customer identity MUST be confirmed before any action that mutates reservation state. / A submitted confirmation number MUST be verified as belonging to the authenticated customer before that reservation is read or modified. / Flight numbers, confirmation numbers, and passenger identifiers MUST come from the authoritative reservation system and MUST NOT be generated by the agent at runtime."
update_seat with needs_approval, and write a durable audit record.MEDIUM PRAX-2026-05-29-006 update_seat() applies no validity or availability check on the requested seat.
"Seat updates MUST be limited to seats that exist on the flight and are available — not already assigned to another passenger."
new_seat against the flight's authoritative seat map and confirm availability before writing the assignment in update_seat().MEDIUM PRAX-2026-05-29-007 FAQ grounding and scope-return are enforced by prompt instructions only, with no code gate.
"The FAQ agent answers questions about airline policies — baggage, seating, wifi, and similar — using only the curated FAQ dataset, and MUST NOT speculate or draw on general knowledge outside that dataset. / A request that falls outside the current agent's scope MUST be returned to the triage agent."
OutputGuardrail on the FAQ agent that verifies answers originate from faq_lookup_tool output, and treat scope-return as a code-enforced routing decision rather than a prompt suggestion.Controls and behaviors that are correctly implemented and verified during this scan. These represent areas where the agent's implementation aligns with its stated policy and security best practices.
Deterministic, dataset-bounded FAQ tool
<code>faq_lookup_tool</code> is a deterministic keyword lookup that returns scripted answers or "I'm sorry, I don't know" — it cannot speculate, which is the correct grounding primitive for the FAQ role.
SDK ships unused but available safety primitives
The underlying OpenAI Agents SDK provides <code>InputGuardrail</code>/<code>OutputGuardrail</code>, per-tool <code>needs_approval</code>/<code>is_enabled</code>, and handoff <code>input_filter</code> — the controls needed to close every finding here exist in the framework and only need wiring.
Pinned, bounded dependency ranges with committed lockfile
The project constrains its core dependencies with upper bounds (<code>openai>=2.36.0,<3</code>, <code>pydantic>=2.12.2,<3</code>) and ships a committed <code>uv.lock</code>, giving a reproducible supply chain.
Log files found in the agent's workspace during this scan. Reviewing these files provides runtime evidence to complement the static analysis above.
| Path | Source | Content Type | Purpose | Last Modified | Status |
|---|---|---|---|---|---|
| (SDK tracing backend) | agents.trace() context manager wrapping each turn in examples/customer_service/main.py | developer telemetry spans | Captures conversation/turn trace spans for debugging; not a durable seat-change audit record | unknown | Inferred |
Each card represents one category and shows the top 3 findings. All items in the Findings section.
Each card represents one category and shows the top 3 findings. All items in the Findings section.
Overall maturity assessment across the six categories of the RAISE framework. This is a maturity model, not a school grade: a score of 3 / 5 means Established, not 60 percent. Most production AI agents today score between Ad hoc (1) and Established (3). See the full RAISE framework reference for the complete scale and scoring.
Maturity Scoring Rubric
Every score above is based on this scale. A score is a snapshot of observable posture — not a verdict on the people or team behind the system.
| Score | Label | Meaning |
|---|---|---|
| 5 | Exemplary | Best-in-class; automated, continuously tested, reference quality. Rarely achieved in shipping systems. |
| 4 | Strong | Comprehensive controls, active management, minor gaps. Production-ready. |
| 3 | Established | Documented controls consistently applied; known gaps accepted. A respectable baseline. |
| 2 | Partial | Some controls exist but coverage is incomplete; key gaps remain. |
| 1 | Ad hoc | Informal or inconsistent measures; relies on individual judgment. |
| 0 | Absent | No evidence this category is addressed at all. |