AI Agents in Production
Despite research enthusiasm for sophisticated multi-agent orchestration, real-world production deployments of AI agents are characterised by simplicity, controllability, and conservative design — with evaluation and reliability as the dominant challenges.
What It Is
The UC Berkeley “Measuring Agents in Production” study (December 2025) provides the first large-scale systematic study of AI agents operating in real organisational environments: 306 practitioners surveyed, 20 in-depth case study interviews, across 26 industries. It documents the gap between research-context agent design and what organisations actually deploy.
The Google DeepMind “Intelligent AI Delegation” paper (February 2026) complements this with a theoretical framework for how agents should decompose and delegate tasks — addressing the orchestration challenge that production teams are grappling with.
Why It Matters (for Organizations)
AI agents — systems that can take actions autonomously over extended sequences to complete goals — are the primary delivery mechanism through which AI capability translates into organisational productivity. The Berkeley study establishes the practical baseline: what problems are organisations actually solving with agents today, what approaches work, and where the friction is. This is essential knowledge for organisations deciding where and how to deploy agents.
The delegation framework from DeepMind addresses a key unsolved problem in multi-agent systems: how to safely and effectively decompose complex tasks across agents (and between agents and humans), transfer appropriate authority at each step, and handle failures gracefully. As agents become more capable and are deployed in more complex workflows, intelligent delegation becomes the critical engineering challenge.
Evidence & Examples
- Top reasons organisations build production agents: increasing productivity, reducing human task-hours, automating routine labour, increasing client satisfaction, reducing human training requirements (
2512.04123v1.pdf) - Production agents are “typically built using simple, controllable approaches” — not the sophisticated multi-agent architectures dominant in research (
2512.04123v1.pdf) - Top development challenges cited by practitioners: evaluation and measurement of agent performance, handling edge cases, maintaining reliability across diverse inputs, and managing user trust and expectations (
2512.04123v1.pdf) - DeepMind argues existing delegation methods rely on “simple heuristics” that cannot adapt dynamically to environmental changes or handle unexpected failures — a significant limitation for high-stakes deployments (
2602.11865v1.pdf) - The DeepMind delegation framework covers: task allocation, transfer of authority and accountability, clear role and boundary specification, clarity of intent, and trust establishment between delegator and delegatee — applicable to both human-to-AI and AI-to-AI delegation (
2602.11865v1.pdf) - DeepMind’s framework is explicitly positioned for “the emerging agentic web” — an infrastructure-level framing that anticipates AI agents as participants in broader networked systems, not just internal tools (
2602.11865v1.pdf)
Tensions & Open Questions
- Evaluation as the unsolved problem: The most commonly cited challenge in the Berkeley study is measuring whether AI agents are actually working well. Unlike traditional software (where outputs are deterministic), agent outputs vary, failure modes are subtle, and success metrics require domain expertise to define. This evaluation gap is the primary barrier to confident production deployment at scale.
- Simple vs. capable trade-off: The preference for “simple, controllable” approaches in production creates a tension with the more capable agentic architectures available. Organisations are trading capability for reliability — which may be appropriate for current risk profiles but may leave significant value on the table.
- Human oversight design: As agents take on more complex and consequential tasks, the question of how humans stay appropriately involved (not over-supervising, not under-supervising) becomes critical. The delegation framework addresses this architecturally, but the organisational and cultural norms for appropriate human-agent oversight are not yet established.
- Trust calibration: Users and operators tend to either over-trust agents (accepting outputs without sufficient scrutiny) or under-trust them (re-checking every output, negating efficiency gains). Building correctly calibrated trust is as much a change management challenge as a technical one.
- 🔴 TODO: Are there examples of production deployments where organisations have moved from simple to sophisticated agentic architectures successfully? What triggered the move and what safeguards were required?
Related Concepts
Workflow Redesign Around AI · AI Delegation and Multi-Agent Systems · Agentic AI Fundamentals · Skill Partnerships Human-AI