Built Your Agentic Workflow? Now What?

June 2, 2026

Getting an agentic workflow to run is the easy part. Getting it to run reliably, safely, and accountably in an enterprise is where most teams discover they’ve been building on sand.

I’ve had a version of the same conversation many times over the past year. A team has built something impressive: agents that browse, reason, write code, call external services, spin up specialist sub-agents, and complete multi-step tasks without human intervention. The demo works. The GitHub repo is real. The ROI narrative is persuasive. And then someone in the room asks the question that changes the tone: “What happens when this fails in production?”

The momentum is undeniable. Gartner projects agentic AI will appear in 33% of enterprise software applications by 2028, up from less than 1% in 2024. But the same Gartner research predicts that over 40% of agentic AI projects will be cancelled by the end of 2027, due to escalating costs, unclear business value, or inadequate risk controls. That gap is not a technology problem. It’s an architecture problem.

Production agentic systems introduce failure modes, trust problems, and governance gaps that have no direct analogue in conventional software. These systems don’t just execute. They reason. They act. And unlike a microservice, they don’t fail silently. They fail creatively. This article is about the checks and balances that close that gap, organised across three pillars that any team shipping agents into production should be able to answer for.

PILLAR I

Operational integrity

Observability, failure recovery, and state consistency across long-running agent workflows.

PILLAR II

Security and trust

Access control, multi-agent trust boundaries, compliance surface area, and data governance.

PILLAR III

Governance at scale

Cost controls, behavioural testing, SLA design, versioning, and change management for agent systems.

Pillar one

Operational integrity

Here’s what makes agentic systems different from everything that came before. Traditional software fails in predictable ways: a null pointer exception, a network timeout, an out-of-memory error. You get a stack trace. You know what broke and roughly why. Agentic systems fail in reasoning space. An agent may complete every tool call successfully, produce plausible-looking output, and still have taken entirely the wrong action. Your operational infrastructure must account for that possibility at every layer.

Observability and auditability: Ask yourself a simple question: can you reconstruct what your agent did and why? Not just the final output. The full decision chain — which tools were called, in what order, with what inputs, and what intermediate reasoning produced each step. In regulated industries, this isn’t optional. It’s a prerequisite for any human review process, and increasingly for regulatory defensibility too.

TRIGGER / USER REQUEST
natural language intent or system event

ORCHESTRATOR LAYER
task decomposition · agent routing · step sequencing [trace_id issued]

AGENT EXECUTION LAYER
tool calls · LLM inference · subagent spawning [span per tool call logged]

TOOL / INTEGRATION LAYER
APIs · databases · external services · message brokers

EMITTED → Distributed trace · Decision log · Cost meter · Audit record

Table 1: Agentic observability stack

The key architectural requirement here is a persistent trace identifier that propagates through every layer: orchestrator, agent, tool call, and external system. Any incident should be reconstructable from a single ID. Without it, post-hoc debugging of a multi-agent workflow is effectively forensic archaeology. And nobody wants to do that at 3 a.m.

Failure modes and recovery: I often describe agentic failure modes as ‘creative failures’ — and not in a good way. A tool call that returns a plausible but incorrect result won’t throw an exception. The agent will use it. A reasoning loop that partially executes a multi-step task and then crashes leaves the world in a state that’s neither committed nor rolled back. It’s a category of failure that conventional error handling simply wasn’t designed for. Recovery patterns must be built in deliberately, not bolted on later.

FAILURE TYPE	WHAT HAPPENS	MITIGATION PATTERN
Reasoning drift	Goal substitution: agent optimises for proxy metric, not original intent	Anchored system prompts + HITL approval gates
Partial execution	System in inconsistent state; neither committed nor rolled back	Saga/compensating tx pattern; durable workflow checkpointing
Tool hallucination	Downstream logic built on fabricated data; silent corruption	Tool result schema validation; confidence thresholds + fallback
Cascading tool calls	Quota exhaustion, rate limits hit, runaway cost accumulation	Max-step limits; circuit breakers; per-workflow budget caps

Table 2: Agentic failure taxonomy and mitigations

State management and idempotency: You need durable state stores rather than in-memory context, explicit checkpoint markers, and idempotent tool implementations so that a resumed workflow doesn’t duplicate side effects. Think about what happens when an agent resends an invoice because it had no durable record of having already succeeded. That’s not hypothetical. That’s exactly what happens when teams treat idempotency as a nice-to-have rather than a correctness requirement.

Pillar two

Security and trust

Here’s a framing that I find useful: agents are principals. They act in the world. They hold credentials, make decisions, and invoke systems on behalf of users and organisations. That means the security model that governs them can’t be retrofitted from user-centric IAM. It must be designed from the ground up for the specific properties of autonomous action.

Access control and least-privilege for agents: The most common mistake I see in early agentic deployments is giving agents the same permissions as the humans they’re acting for. An agent that can read everything a user can access and write everywhere a user can modify has an attack surface proportional to the entire user base. The blast radius of a prompt injection or reasoning failure is unbounded. And this is not a theoretical risk. OWASP’s 2025 Top 10 for LLM Applications ranks prompt injection the #1 critical vulnerability, present in over 73% of production AI deployments assessed in security audits. In late 2025, researchers demonstrated how a malicious instruction hidden in an email caused an AI agent to compose and send a resignation letter on behalf of the user, instead of the out-of-office reply it was asked to write. OpenAI, responding to the finding, acknowledged that prompt injection “may never be fully solved.” That’s not a reassuring statement when your agent has written access to your production systems.

IDENTITY PROVIDER Issues short-lived agent credentials · no persistent tokens in agent context
AGENT · agent_id: acct-analyst · scope: read: reports
✓ PERMITTED ACTIONS read: analytics_reports write: draft_summaries call: summarization_api read: public_schemas scoped · expiring · audited	✗ DENIED ACTIONS write: production_db delete: any_record read: pii_tables call: payment_api hard-deny · no escalation path
HUMAN APPROVAL GATE triggered when scope boundary approached

Table 3: Agent identity and least-privilege authorisation model

The principle is least-privilege at the agent scope level: task-scoped credentials, short-lived tokens, and hard deny lists with no escalation path. If an agent needs broader permissions to complete a task, it should escalate to a human — not acquire permissions silently while nobody’s watching.

Multi-agent trust boundaries: When agents call other agents, as they increasingly do in mesh architectures, the trust model gets interesting. Agent A invoking Agent B is not the same as a service calling another service. The calling agent may be carrying injected instructions, acting on corrupted context, or operating outside its intended scope. Agent B cannot simply inherit the trust of the caller. In a well-designed system, it won’t.

ORCHESTRATOR verified identity · signed task token
AGENT A Research / Retrieval	AGENT B Analysis / Reasoning	AGENT C Action / Write
TRUST RULES · Agent-to-agent calls must carry a signed delegation token from the orchestrator, not the calling agent’s own credential · Receiving agent validates scope independently — inherited trust is a vulnerability, not a feature. Every agent in a mesh should validate its own authorisation independently, treating messages from other agents with the same scepticism it would apply to an external API call.

Table 4: Multi-agent trust boundary model

Compliance, data residency and privacy: This one catches teams off guard more than any other. An agent that retrieves customer records, sends them to an LLM provider for analysis, writes outputs to a cloud store, and triggers a notification has just created four potential jurisdictional touchpoints, each with its own data residency, retention, and consent implications. Most teams only map this out after something goes wrong.

GDPR, HIPAA, and SOC 2 frameworks don’t yet provide direct guidance on autonomous agents. Until they do, the enterprise burden is to instrument each data movement in an agent workflow, classify the data at each step, and enforce residency constraints at the integration layer. Not as an afterthought post-deployment.

Pillar three

Governance at scale

Here’s something that often surprises teams: agents are not static software. Their behaviour is partially a function of model weights, prompt versions, tool schemas, and runtime context — all of which can change independently. Governance at scale means treating every one of those variables as a versioned, managed artifact with defined change procedures, cost accountability, and behavioural test coverage. The urgency is real: Deloitte’s 2025 Emerging Technology Trends study found that while 38% of organisations are piloting agentic solutions, only 11% are actively running them in production. That gap points directly to unresolved governance, not unproven technology.

Cost controls and resource governance: An agentic workflow that loops unexpectedly for 40 iterations instead of 4 doesn’t just produce wrong output. It produces a surprisingly large invoice. LLM token costs, third-party API calls, and compute charges are all susceptible to unbounded growth when agents misbehave. Governance needs to happen at the workflow level, not just at the infrastructure level.

LOW RISK

HIGH RISK

LOW COST

✓ STANDARD OPS

Retrieval / read agents – standard rate limiting + cost attribution

⚠ MONITOR CLOSELY

Reasoning-heavy decision loops – audit trail + confidence thresholds + HITL

HIGH COST

COST-ATTRIBUTE & ALERT

Batch processing agents – quota + throttle + cost dashboard

⛔ HARD BUDGET CAP

Multi-step writes workflows – per-run cap + kill switch + HITL gate

Table 5: Agentic cost governance framework

Testing and behavioural validation: Unit tests don’t map to agent behaviour. You can’t assert that a reasoning system returns the correct output for a given input — the space of inputs is unbounded, the outputs are probabilistic, and the path through the workflow matters as much as the result. Testing agentic systems requires a different discipline entirely, and most teams underinvest here until after a production incident forces the issue.

The emerging practice combines golden path replay (recording known-good trajectories and detecting deviation), adversarial probing (deliberate prompt injections and edge-case inputs), and sandboxed production simulation where agents operate against mirrored environments with real data shapes but no write access to live systems.

SLA design and SLO accountability: If an agentic workflow is in the critical path of a business process, it needs formal service level objectives. Latency budgets, availability targets, and degradation strategies — including fallback to a simpler model, a human-routed path, or a cached result — should be designed before go-live, not discovered during the first production incident.

Versioning and change management: When an agent’s system prompt changes, its behaviour changes. When a tool schema is updated, every agent that calls it is affected. When a new model version is deployed, behavioural regressions are possible without any code change at all. This is one of the more counterintuitive realities of agentic systems. Enterprise-grade agent management requires version-controlled agent definitions, semantic versioning for tool contracts, and the ability to roll back a deployed agent independently of the application it serves.

OPERATIONAL INTEGRITY

☐ Distributed trace ID propagated across all agent layers

☐ Decision provenance log persisted and queryable

☐ Compensating actions defined for every write operation

☐ Durable state store with explicit checkpoint markers

☐ Circuit breakers on all tool integrations

☐ HITL escalation path for confidence threshold breaches

SECURITY & TRUST

☐ Agent identity scoped to task, not user role

☐ Short-lived credentials; no persistent tokens in context

☐ Hard deny list with no escalation path

☐ Agent-to-agent delegation validated independently

☐ Data classification at each integration boundary

☐ PII/PHI handling mapped to compliance framework

GOVERNANCE AT SCALE

☐ Per-workflow budget cap and kill switch in place

☐ Cost attributed to team, workflow, and task

☐ Golden path regression test suite running in CI

☐ Adversarial probe scenarios documented and scheduled

☐ SLOs defined with explicit degradation strategy

☐ Agent definitions versioned; rollback procedure tested

What frameworks give you and what they don’t

The three pillars above describe what enterprise-ready agentic infrastructure requires. It’s worth being equally precise about what the leading frameworks really provide, because the gap between the two is exactly where most production failures happen.

The popular frameworks each solve a real problem, and they solve it well. LangGraph gives you graph-based orchestration: explicit state machines where every node, edge, and transition is defined in code. It has built-in checkpointing backed by SQLite, Redis, or Postgres, and first-class human-in-the-loop support via interrupt nodes. It’s running in production at LinkedIn and Uber, with LangSmith providing a solid observability layer on top. CrewAI gives you role-based multi-agent composition: agents as crew members with defined goals and tools, wired together rapidly with minimal boilerplate. It’s the fastest path from idea to working prototype, claiming deployment across 60% of Fortune 500 companies as of its v1.0 GA in October 2025. Microsoft Agent Framework, the unified successor to AutoGen and Semantic Kernel, with GA targeted for end of Q1 2026, gives you graph-based workflows, checkpointing, and HITL on deep Azure integration, with native enterprise identity via Entra Agent ID.

A fourth category has emerged alongside these orchestration frameworks: platforms that approach the problem from the infrastructure layer up. Solace Agent Mesh, which reached general availability in December 2025, is the most distinct example. Built on Solace’s event broker — technology that enterprises like Bosch, RBC Capital Markets, and Heineken already run in mission-critical operations — it treats agent communication as an event-driven messaging problem rather than an orchestration code problem. Every agent interaction flows through the event mesh. That means observability, fault tolerance, horizontal scaling, and message delivery guarantees are structural properties of the platform, not layers you add later. Security and governance sit in the gateway layer: SSO, RBAC, action-level permissions, and user-delegated access come as first-class capabilities. Gartner’s August 2025 Innovation Insight on AI Agent Development Frameworks listed Solace as one of eight representative vendors in this space.

All of these are genuine capabilities, not marketing. But a precise reading of each reveals where the framework layer ends and the infrastructure layer begins, and which approach leaves you to assemble that infrastructure yourself.

Capability	LangGraph	CrewAI	Microsoft AF	Solace Agent Mesh	You Build
Workflow orchestration	✓	✓	✓	✓	–
State persistence	✓ native	◑ opaque	◑ limited	✓ event-driven	–
Human-in-the-Loop	✓	◑ basic	✓	✓	–
Distributed trace/Audit	◑ LangSmith	✗	◑ Azure only	✓ built-in	✓
Agent IAM/ Least-privilege	✗	✗	◑ Azure-bound	✓ SSO+RBAC	✓
Multi-agent trust boundary	✗	✗	✗	✓ gateway	✓
Cost/resource efficiency and governance	◑ manual	✗	◑ Azure Cost	◑ data management	✓
Compliance/ Data residency	✗	✗	◑ Azure platform	✓ SOC2·ISO27001	✓

Table 7: Framework coverage vs enterprise requirements

Table 7 is not a criticism of the orchestration frameworks. They were all designed to solve the hard problem of orchestration: how to define, sequence, and execute agent behaviour. They solve that well. But they were not designed to provide a production security model, a cross-layer audit trail, agent-scoped IAM, multi-agent trust validation, or workflow-level cost governance. Those requirements sit above the framework layer. If you want them, you assemble them yourself — LangSmith or Langfuse for tracing, a secrets manager for credentials, a policy engine for access control, an event broker for reliable delivery. Each one is a separate project.

Solace Agent Mesh starts from a different place entirely. It treats the enterprise infrastructure requirements as the foundation, not the finishing touch. Because every agent interaction flows through the event broker, observability and fault tolerance are built in rather than bolted on. Because security sits in the gateway layer by default, it doesn’t require a separate identity integration project. The trade-off is real: Agent Mesh is less composable for teams that want to wire up arbitrary orchestration patterns from scratch, and it introduces a dependency on Solace’s event broker infrastructure. But for enterprises moving from proof-of-concept to govern production, that trade-off is often exactly what’s needed.

Think of the framework as the engine. The three pillars — operational integrity, security and trust, governance at scale — are the chassis, the brakes, and the safety systems. Shipping an engine without the rest of the vehicle is what produces the 40% cancellation rate Gartner is projecting. The real question isn’t which framework to pick. It’s whether the infrastructure layer that production demands gets assembled piecemeal over time or is built in from the start.

The hardest part isn’t the build

None of the failure modes, trust gaps, or governance deficits described here are exotic. They’re the standard challenges of distributed systems, applied to a new class of actor that reasons instead of just executes. We’ve solved versions of these problems before — in microservices, in cloud-native architectures, in event-driven integration. The patterns exist. They just haven’t been applied to agents yet.

The teams that treat these as architectural requirements from day one, rather than operational problems to fix after launch, are the ones whose agentic systems survive contact with production. The others will have great demos. You built the agent — now build the infrastructure around it.

Pillar one

Operational integrity

Pillar two

Security and trust

Pillar three

Governance at scale

What frameworks give you and what they don’t

LEAVE A REPLY Cancel reply

Thought Leaders

HOW TOs

MOST POPULAR

Open Journey

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY