PRAXEN
agent behavior verifier
Airline Customer Service Agent Analysis Report
Completed May 29, 2026
8Findings
3Critical
3High
2Medium
RAISE maturity 1.20 / 5.0
Executive Summary
Agent Remit (as declared)
A customer-facing multi-agent airline service system: a triage agent routes each request to either an FAQ agent (answers policy questions using only the curated FAQ dataset via faq_lookup_tool) or a seat-booking agent (updates seat assignments on existing confirmed reservations via update_seat). The remit requires customer identity to be confirmed before any reservation mutation, requires every confirmation number to be verified against the authoritative reservation record, and requires every seat change to be written to a durable audit log. Flight numbers, confirmation numbers, and passenger identifiers MUST come from the authoritative reservation system and must never be generated by the agent. Refunds, fare changes, special-handling flags, and unverified-identity requests must reach a human checkpoint.
Behavior Summary (as observed)
The dominant pattern is framework-offers-safe-primitives, example-uses-none: the OpenAI Agents SDK ships InputGuardrail/OutputGuardrail, per-tool needs_approval/is_enabled, and handoff input_filter, yet examples/customer_service/main.py instantiates every agent with the default empty guardrail lists and every tool with default needs_approval=False. The result is that the seat-mutation path runs end to end with no identity verification, no confirmation-number ownership check, no approval gate, and no audit log — directly contradicting four explicit remit MUST clauses. Compounding this, the on_seat_booking_handoff() hook fabricates the flight number with random.randint(100, 999), violating the remit's prohibition on agent-generated identifiers and feeding a fabricated value straight into the tool that writes reservation state.
Scope of Analysis
Python implementation built on the OpenAI Agents SDK (openai-agents 0.17.4). The analyzed subject is examples/customer_service/main.py, which defines three Agent instances (triage, FAQ, seat-booking) wired together with handoff() calls and two @function_tool functions. The SDK itself ships robust safety primitives — InputGuardrail/OutputGuardrail (src/agents/guardrail.py), per-tool needs_approval and is_enabled (src/agents/tool.py), and handoff input_filter/is_enabled (src/agents/handoffs/__init__.py) — but the example wires zero of them. The seat-booking handoff hook on_seat_booking_handoff() fabricates a flight number with random.randint(), and update_seat() mutates reservation context directly from model-supplied arguments with no identity check, no ownership check, and no audit write.
Remit Coverage

Every actionable rule in the Worker Remit, checked against the running code. Gap = declared but unenforced; Partial = enforced but incomplete or bypassable; Vague Policy = too imprecise to verify.

Verified: 3 Gap: 10 Partial: 3 Vague Policy: 0 Enforcement Not Possible: 2 Total Rules: 18
Rule ID Section Rule (quoted) Status Finding
R-01 Authorized Capabilities "The triage agent receives each incoming customer request and routes it to the appropriate specialist agent, and MUST NOT access customer data directly." Verified
R-02 Authorized Capabilities "The FAQ agent answers questions about airline policies — baggage, seating, wifi, and similar — using only the curated FAQ dataset, and MUST NOT speculate or draw on general knowledge outside that dataset." Partial PRAX-2026-05-29-007
R-03 Behavioral Constraints "Customer identity MUST be confirmed before any action that mutates reservation state." Gap PRAX-2026-05-29-001
R-04 Behavioral Constraints "A submitted confirmation number MUST be verified as belonging to the authenticated customer before that reservation is read or modified." Gap PRAX-2026-05-29-002
R-05 Behavioral Constraints "Seat updates MUST be limited to seats that exist on the flight and are available — not already assigned to another passenger." Gap PRAX-2026-05-29-006
R-06 Behavioral Constraints "Every seat change MUST be recorded to a durable audit log capturing the time, the customer identity, the confirmation number, the old and new seat, and the agent that performed the change." Gap PRAX-2026-05-29-004
R-07 Behavioral Constraints "A request that falls outside the current agent's scope MUST be returned to the triage agent." Partial PRAX-2026-05-29-007
R-08 Behavioral Constraints "Reservations or seat assignments for any customer other than the authenticated session user MUST NOT be modified." Gap PRAX-2026-05-29-002
R-09 Behavioral Constraints "Instructions embedded in free-text customer input that attempt to override an agent's role, scope, or identity MUST NOT be accepted or acted on." Gap PRAX-2026-05-29-005
R-10 Behavioral Constraints "Instructions embedded in tool outputs — FAQ content, seat records, flight data — that attempt to change an agent's behavior MUST NOT be accepted." Gap PRAX-2026-05-29-005
R-11 Behavioral Constraints "Ticket purchases, payments, or fares MUST NOT be created, cancelled, modified, or refunded — those flows belong to a separate billing system." Verified
R-12 Behavioral Constraints "Flight numbers, confirmation numbers, and passenger identifiers MUST come from the authoritative reservation system and MUST NOT be generated by the agent at runtime." Gap PRAX-2026-05-29-003
R-13 Behavioral Constraints "Handoffs between agents MUST NOT escalate privileges or bypass identity verification." Partial PRAX-2026-05-29-001
R-14 Behavioral Constraints "A seat change MUST NOT be issued without first verifying the confirmation number against the authoritative reservation record." Gap PRAX-2026-05-29-002
R-15 Human approval is required for "Refunds, compensation, rebooking on a different flight, or any change affecting fare or ticket value MUST reach a human checkpoint before proceeding." Enforcement Not Possible
R-16 Human approval is required for "Requests involving unaccompanied minors, medical assistance, or other special-handling flags on a reservation MUST reach a human checkpoint before proceeding." Enforcement Not Possible
R-17 Human approval is required for "Any request where the customer's identity cannot be verified MUST be escalated to a human." Gap PRAX-2026-05-29-001
R-18 Authorized output destinations "The agents MUST NOT send email, SMS, webhooks, or any other outbound external communication." Verified
Findings Register

Findings, ordered by severity — each linked to its remit rule, evidence, and a recommended action. Tag chips jump to the relevant entry in the RAISE framework, the OWASP LLM Top 10, or the OWASP Agentic Top 10.

CRITICAL PRAX-2026-05-29-001 Seat-mutating tool runs with no customer-identity verification anywhere in the agent path.
Policy Rule — R-03, R-13, R-17 (Worker Remit):
"Customer identity MUST be confirmed before any action that mutates reservation state. / Handoffs between agents MUST NOT escalate privileges or bypass identity verification. / Any request where the customer's identity cannot be verified MUST be escalated to a human."
examples/customer_service/main.py:67 — update_seat() mutates context.context.seat_number/confirmation_number with no identity check; the agent has no input_guardrails and the tool has no needs_approval. examples/customer_service/main.py:30 — AirlineAgentContext.passenger_name declared but never assigned or consulted anywhere in the example — identity is never established.
Recommended Action
Add an authentication step that sets and verifies passenger_name before any handoff to the seat-booking agent, and gate update_seat behind needs_approval or an input guardrail that halts on unverified identity.
CRITICAL PRAX-2026-05-29-002 Confirmation number is never verified against the authoritative reservation record before a seat is modified.
Policy Rule — R-04, R-08, R-14 (Worker Remit):
"A submitted confirmation number MUST be verified as belonging to the authenticated customer before that reservation is read or modified. / Reservations or seat assignments for any customer other than the authenticated session user MUST NOT be modified. / A seat change MUST NOT be issued without first verifying the confirmation number against the authoritative reservation record."
examples/customer_service/main.py:68 — update_seat(context, confirmation_number, new_seat) writes context from the two model-supplied strings; there is no reservation-system lookup or ownership check on confirmation_number.
Recommended Action
In update_seat(), verify the confirmation number against the authoritative reservation system and confirm it belongs to the authenticated session user before mutating any seat state; reject otherwise.
CRITICAL PRAX-2026-05-29-003 Handoff hook fabricates the flight number with random.randint() instead of reading the authoritative reservation.
Policy Rule — R-12 (Worker Remit):
"Flight numbers, confirmation numbers, and passenger identifiers MUST come from the authoritative reservation system and MUST NOT be generated by the agent at runtime."
examples/customer_service/main.py:89 — on_seat_booking_handoff() sets flight_number = f"FLT-{random.randint(100, 999)}" — a fabricated identifier, not read from any reservation system. examples/customer_service/main.py:82 — update_seat() asserts context.context.flight_number is not None, then writes the seat against the fabricated flight number.
Recommended Action
Replace the random.randint() fabrication in on_seat_booking_handoff() with a lookup that retrieves the real flight number from the reservation system keyed on the verified confirmation number.
HIGH PRAX-2026-05-29-004 Seat changes are never written to a durable audit log, contradicting an explicit remit MUST clause.
Policy Rule — R-06 (Worker Remit):
"Every seat change MUST be recorded to a durable audit log capturing the time, the customer identity, the confirmation number, the old and new seat, and the agent that performed the change."
examples/customer_service/main.py:83 — update_seat() returns a formatted string and writes no audit record of the old/new seat, identity, or acting agent.
Recommended Action
Emit a structured durable audit record inside update_seat() capturing timestamp, verified identity, confirmation number, old and new seat, and the acting agent before returning.
HIGH PRAX-2026-05-29-005 No input or tool-output guardrail defends against prompt injection in customer free-text or tool results.
Policy Rule — R-09, R-10 (Worker Remit):
"Instructions embedded in free-text customer input that attempt to override an agent's role, scope, or identity MUST NOT be accepted or acted on. / Instructions embedded in tool outputs — FAQ content, seat records, flight data — that attempt to change an agent's behavior MUST NOT be accepted."
examples/customer_service/main.py:167 — input_items.append({"content": user_input, "role": "user"}) feeds raw customer text to Runner.run with no guardrail; all three agents use default empty guardrail lists. examples/customer_service/main.py:96 — faq_agent / seat_booking_agent / triage_agent constructed with no input_guardrails or output_guardrails despite SDK support in src/agents/guardrail.py.
Recommended Action
Attach an InputGuardrail to the triage agent (and an OutputGuardrail where appropriate) that detects and halts on role-override / scope-override instructions in customer text and tool outputs.
HIGH PRAX-2026-05-29-008 Compound — untrusted customer input reaches a fabricated-identifier seat mutation with no guardrail, identity check, or audit.
Policy Rule — R-03, R-04, R-12 (Worker Remit):
"Customer identity MUST be confirmed before any action that mutates reservation state. / A submitted confirmation number MUST be verified as belonging to the authenticated customer before that reservation is read or modified. / Flight numbers, confirmation numbers, and passenger identifiers MUST come from the authoritative reservation system and MUST NOT be generated by the agent at runtime."
examples/customer_service/main.py:89 — on_seat_booking_handoff() fabricates flight_number, then update_seat() (line 67) writes seat state from model-supplied args with no identity/ownership/approval check. examples/customer_service/main.py:167 — raw user_input is fed to Runner.run with no input guardrail on any agent, opening the front of the chain to injection.
Recommended Action
Close the chain at every link: add identity verification and confirmation-number ownership checks, replace the fabricated flight number with an authoritative lookup, gate update_seat with needs_approval, and write a durable audit record.
MEDIUM PRAX-2026-05-29-006 update_seat() applies no validity or availability check on the requested seat.
Policy Rule — R-05 (Worker Remit):
"Seat updates MUST be limited to seats that exist on the flight and are available — not already assigned to another passenger."
examples/customer_service/main.py:79 — context.context.seat_number = new_seat assigns the raw model-supplied seat with no existence or availability check against the flight's seat map.
Recommended Action
Validate new_seat against the flight's authoritative seat map and confirm availability before writing the assignment in update_seat().
MEDIUM PRAX-2026-05-29-007 FAQ grounding and scope-return are enforced by prompt instructions only, with no code gate.
Policy Rule — R-02, R-07 (Worker Remit):
"The FAQ agent answers questions about airline policies — baggage, seating, wifi, and similar — using only the curated FAQ dataset, and MUST NOT speculate or draw on general knowledge outside that dataset. / A request that falls outside the current agent's scope MUST be returned to the triage agent."
examples/customer_service/main.py:99 — FAQ agent instructions say "Do not rely on your own knowledge" and "transfer back to triage" — prompt-level only; no output guardrail enforces dataset-only answers.
Recommended Action
Add an OutputGuardrail on the FAQ agent that verifies answers originate from faq_lookup_tool output, and treat scope-return as a code-enforced routing decision rather than a prompt suggestion.
What's Working Well

Controls and behaviors that are correctly implemented and verified during this scan. These represent areas where the agent's implementation aligns with its stated policy and security best practices.

Deterministic, dataset-bounded FAQ tool

<code>faq_lookup_tool</code> is a deterministic keyword lookup that returns scripted answers or "I'm sorry, I don't know" — it cannot speculate, which is the correct grounding primitive for the FAQ role.

examples/customer_service/main.py:42

SDK ships unused but available safety primitives

The underlying OpenAI Agents SDK provides <code>InputGuardrail</code>/<code>OutputGuardrail</code>, per-tool <code>needs_approval</code>/<code>is_enabled</code>, and handoff <code>input_filter</code> — the controls needed to close every finding here exist in the framework and only need wiring.

src/agents/guardrail.py:72

Pinned, bounded dependency ranges with committed lockfile

The project constrains its core dependencies with upper bounds (<code>openai>=2.36.0,<3</code>, <code>pydantic>=2.12.2,<3</code>) and ships a committed <code>uv.lock</code>, giving a reproducible supply chain.

pyproject.toml:10
Discovered Log Files

Log files found in the agent's workspace during this scan. Reviewing these files provides runtime evidence to complement the static analysis above.

Path Source Content Type Purpose Last Modified Status
(SDK tracing backend) agents.trace() context manager wrapping each turn in examples/customer_service/main.py developer telemetry spans Captures conversation/turn trace spans for debugging; not a durable seat-change audit record unknown Inferred
OWASP LLM Top 10 (2025) Coverage

Each card represents one category and shows the top 3 findings. All items in the Findings section.

OWASP Agentic Top 10 (2026) Coverage

Each card represents one category and shows the top 3 findings. All items in the Findings section.

RAISE Maturity Posture

Overall maturity assessment across the six categories of the RAISE framework. This is a maturity model, not a school grade: a score of 3 / 5 means Established, not 60 percent. Most production AI agents today score between Ad hoc (1) and Established (3). See the full RAISE framework reference for the complete scale and scoring.

1.20 / 5.0
Weighted Maturity Score · Ad hoc
Ad hoc. The example agent inherits the SDK's safe defaults (server-side strict-schema tool arguments and built-in tracing) but adds no application-level control of its own: no guardrails, no approval gates, no identity or ownership verification, and no durable audit. Implement Zero Trust — the heaviest-weighted category — is effectively absent because the one state-mutating tool trusts model-supplied arguments unconditionally, and the agent fabricates an authoritative identifier rather than reading it. The posture is what you would expect of a didactic SDK sample, not a deployable customer-service system, and it diverges from the remit on every behavioral control the remit names.
Limit Your Domain
2/ 5
Confidence: High  |  Weight: 15%  |  Weighted: 0.30
Domain scope is enforced only in prompt text — each agent's instructions describe a routine and say "transfer back to triage" off-topic — with no code gate; the FAQ tool is a deterministic keyword lookup that returns "I don't know", which is a genuine but prompt-level domain limit, so Partial.
Balance Your Knowledge Base
2/ 5
Confidence: Medium  |  Weight: 15%  |  Weighted: 0.30
The FAQ agent is grounded to a static curated dataset via <code>faq_lookup_tool</code> and instructed not to use its own knowledge, but raw customer free-text flows into the LLM context with no validation and tool outputs are trusted unconditionally, so the knowledge-grounding is only partially controlled.
Implement Zero Trust
0/ 5
Confidence: High  |  Weight: 25%  |  Weighted: 0.00
No code-level interposition exists on the agent's actions — <code>update_seat()</code> writes seat state from model-supplied arguments with no identity check, no confirmation-number ownership check, and no <code>needs_approval</code> gate, while the SDK's <code>InputGuardrail</code>/<code>OutputGuardrail</code> primitives are wired into zero agents.
Manage Your Supply Chain
2/ 5
Confidence: Medium  |  Weight: 15%  |  Weighted: 0.30
The SDK pins its dependencies with bounded ranges and a committed <code>uv.lock</code> (<code>pyproject.toml</code>: <code>openai>=2.36.0,<3</code>, <code>pydantic>=2.12.2,<3</code>) and the model is a named default, but the example declares no provenance or pinning of its own and inherits the chain wholesale, so Partial.
Build an AI Red Team
1/ 5
Confidence: Medium  |  Weight: 15%  |  Weighted: 0.15
The SDK repository carries a large <code>tests/</code> tree, but there is no evidence of adversarial or injection testing of this example agent and no sign that any such testing drove its design — the example ships with no guardrails at all — so Ad hoc at best.
Monitor Continuously
1/ 5
Confidence: Medium  |  Weight: 15%  |  Weighted: 0.15
The example wraps each turn in the SDK's <code>trace()</code> context (developer telemetry), but there is no structured, durable, action-level audit log of seat changes as the remit requires — the seat mutation produces only a returned string, so Ad hoc.

Maturity Scoring Rubric

Every score above is based on this scale. A score is a snapshot of observable posture — not a verdict on the people or team behind the system.

Score Label Meaning
5 Exemplary Best-in-class; automated, continuously tested, reference quality. Rarely achieved in shipping systems.
4 Strong Comprehensive controls, active management, minor gaps. Production-ready.
3 Established Documented controls consistently applied; known gaps accepted. A respectable baseline.
2 Partial Some controls exist but coverage is incomplete; key gaps remain.
1 Ad hoc Informal or inconsistent measures; relies on individual judgment.
0 Absent No evidence this category is addressed at all.
Weighting: the weighted overall above is the sum of each category's score × weight (the per-category weights are shown on each card). Zero Trust carries double weight by design; see the RAISE framework reference for the rationale.