Skip to content

1. Agentic AI Engineering

Purpose

Operational handbook for building, testing and managing agentic AI systems (Collaboration Modes 4-5).

When to use this?

You are building an AI system that autonomously executes actions (Mode 4-5) and need guidance on orchestration, tool design and failure management.

1. Purpose

This module describes the engineering practices for building, testing and managing agentic AI systems (Collaboration Mode 4-5). Where AI Architecture defines the strategic pattern, this document provides the operational guide: orchestration, protocols, tool design, failure modes, observability and cost management.

Prerequisite

First read AI Collaboration Modes and the acceptance criteria for Mode 4-5. Every technical choice in this document is determined by the mode and risk profile.

DORA: context engineering for AI-accessible internal data [so-28]

The DORA AI Capabilities Model (2025) identifies AI-accessible internal data as one of the seven capabilities that amplify AI adoption. DORA defines this as context engineering: connecting AI tools to internal codebases, documentation and wikis — not just prompt engineering. For agentic systems this means: invest in MCP servers, structured knowledge bases and domain-specific context files so that agents understand the organisational context. See External Evidence: DORA.


2. Orchestration Patterns

An agent system selects an orchestration pattern based on task complexity and risk. Always start with the simplest pattern that works.

Single Agent

[User/Trigger] → [Agent + Tools] → [Result]

One LLM with direct access to a set of tools. Suitable for well-scoped tasks with limited action radius.

When to use: Tasks with a clear goal, limited tool set, low to moderate complexity.

Multi-Agent (Supervisor)

[Trigger] → [Supervisor Agent] → [Specialist Agent A] → [Result A]
                                → [Specialist Agent B] → [Result B]
                                → [Merge] → [Final Result]

A supervisor agent distributes work across specialised sub-agents. Each sub-agent has a scoped mandate and its own tool set.

When to use: Complex tasks requiring multiple areas of expertise, or tasks that can be parallelised.

Handoff Pattern

[Agent A] → [Handoff Point] → [Agent B] → [Handoff Point] → [Agent C]

Responsibility transfers between agents as the context evolves. Each agent processes a specific phase.

When to use: Sequential workflows with clear phase boundaries (e.g. analysis → plan → execution → review).

Selection Matrix

Pattern Complexity Risk Cost Recommended for
Single Agent Low Low-Moderate Lowest Well-scoped tasks, Mode 4
Supervisor High Moderate-High Higher Parallel expertise, Mode 4-5
Handoff Moderate Moderate Moderate Sequential workflows, Mode 4

3. Protocols and Standards

Model Context Protocol (MCP)

MCP is an open standard (Anthropic, 2024) that defines how agents connect to external tools, data sources and APIs. MCP provides:

  • Standardised tool descriptions: Tools are described in a uniform schema so that any MCP-compatible agent can invoke them.
  • Transport layers: Stdio (local) and Streamable HTTP (network).
  • Security model: Server identity, capability registration and permission management.

Recommendation: Design new internal APIs with MCP compatibility. This prevents vendor lock-in and makes tools reusable across agent frameworks.

Agent-to-Agent (A2A) Protocol

A2A (Google, 2025; Linux Foundation) is an open standard for communication between agents from different frameworks or vendors. Agents publish their capabilities and negotiate interaction modalities.

When relevant: In multi-agent systems that combine agents from different teams or vendors.


4. Tool Design for Agents

Design Principles

  1. Allowlist-first: Only explicitly permitted tools are available. Deny-by-default.
  2. Progressive disclosure: Give the agent a short tool index; load extended descriptions only when needed. This limits token consumption.
  3. Atomic actions: Each tool does exactly one thing. Do not combine "read and write" in a single tool.
  4. Idempotent where possible: Repeated invocation of the same tool with the same input should have no side effects.
  5. Sandbox execution: Tools run in an isolated environment without direct access to production data (see Technical Controls).

Code Execution Pattern

Instead of direct tool invocations, an agent can write code that calls tools. This offers:

  • On-demand tool loading (lower baseline token costs)
  • Complex logic in a single step (filtering, transformation)
  • Better traceability (code is inspectable)

Risk: Requires strict sandboxing. Use only with Mode 5 governance.


5. Agent Memory

Agents that perform long-running tasks or work across multiple sessions require memory. We distinguish four types:

Type Description Storage Medium Example
Token memory Context window contents (system prompt, conversation history, tool results) In-context Running conversation
Episodic Specific events: what happened, when, with what result Database/file "Previous deployment failed due to schema mismatch"
Semantic General knowledge, facts, relationships Knowledge base/RAG Company policy, product documentation
Procedural Learned skills and operational knowledge Configuration/prompts Optimal sequence of deployment steps

Recommendation: Start with token memory + RAG (semantic). Only add episodic memory when the agent performs recurring tasks and needs to learn from previous results.


6. Failure Modes and Mitigation

Agentic systems fail qualitatively differently from traditional software. The patterns below require specific mitigation.

Failure Mode Description Impact Mitigation
Infinite loop Agent continuously generates subtasks or repeats the same action Cost explosion, system load Hard iteration limit per task; Circuit Breaker on token budget
Hallucination escalation Hallucinated output becomes input for the next step, errors compound Unreliable results that appear correct Multi-step validation; intermediate fact-checks; cross-validation between models
Scope creep Agent interprets mandate more broadly than intended Unauthorised actions Explicit scope boundaries in system prompt + tool allowlist
Tool misuse Agent invokes tools in unintended combinations or sequences Data corruption, unwanted side effects Log and validate tool invocations against permitted sequences
Cascade failure Error in sub-agent propagates through the entire system System-wide disruption Isolation per agent; error boundaries; graceful degradation
Silent degradation Quality gradually declines without visible error messages Unnoticed poor output Periodic Golden Set validation; acceptance rate monitoring

Rule of thumb

Every failure mode must have a corresponding alert in the monitoring dashboard. No mitigation without a measurable signal.


7. Observability

Why Agent Observability Is Different

Traditional monitoring measures what happens (latency, errors, throughput). Agent observability must also measure why something happens: what decisions did the agent make, which tools did it invoke, and what was the reasoning?

Minimum Telemetry

Data Point Description Purpose
Decision trail Per step: input, reasoning, chosen action, confidence score Audit, debugging
Tool invocations Which tool, with which parameters, result, duration Cost analysis, fault detection
Escalation events When and why the agent escalated to a human Scope validation
Token consumption Per step and per session Cost management
Session outcome Success/fail, elapsed time, number of steps Quality monitoring

OpenTelemetry

OpenTelemetry has established standardised semantic conventions for AI agent observability. Use these conventions to implement vendor-independent tracing. This makes it possible to analyse agent behaviour regardless of the underlying framework.


8. Cost Management

Agentic systems have a fundamentally different cost model from traditional AI applications. Usage costs account for only approximately 20% of total cost of ownership.

TCO Structure

Cost Category Share Control Measure
Inference (API tokens) ~20% Prompt caching, model tiering
Data preparation and integration ~25% Standardised pipelines
Governance and compliance ~20% Proportional governance per risk level
Monitoring and tuning ~15% Automated alerts, SLO monitoring
Training and onboarding ~20% Reusable patterns and documentation

Optimisation Techniques

  • Prompt caching: If an agent always uses the same system prompt, the provider can cache those tokens. Reduces input costs by ~90% and latency by ~75%.
  • Model tiering: Route simple tasks to a cheaper model; complex tasks to a more capable model.
  • Dynamic iteration limits: Set the maximum number of steps based on task complexity, not as a fixed number.
  • Hard budget cap: Technical limit per task/session/day (see Technical Controls).

9. Agent Testing

Test Strategy

Agent testing goes beyond functional tests. We test across four dimensions:

Dimension What to Test Method
Quality Task completion, correct tool selection, reasoning quality Golden Set scenarios
Performance Latency, throughput, resource usage Load tests
Safety Prompt injection, scope violation, tool misuse Adversarial tests, red teaming
Cost Token consumption per task, cost per successful result Cost benchmarks

Adversarial Scenarios (mandatory for Mode 4-5)

  • Scope test: Give the agent an assignment outside its mandate. Expected: refusal or escalation.
  • Loop test: Create a situation that could lead to infinite repetition. Expected: stop after iteration limit.
  • Conflicting instructions: Provide contradictory context. Expected: escalation, not guessing.
  • Tool misuse: Offer tools the agent should not use. Expected: no invocation.

10. Agentic AI Engineering Checklist

10. Agentic AI Engineering Checklist

  • Orchestration pattern is selected and documented
  • Tool allowlist is defined and enforced
  • Sandbox environment is set up for tool execution
  • Iteration limits and budget caps are configured
  • Failure modes are identified with corresponding alerts
  • Decision trail (audit trail) is active per agent step
  • Escalation path to human is defined and tested
  • Adversarial tests are completed and documented
  • Cost model is established (TCO, not just inference)
  • OpenTelemetry or equivalent tracing is implemented