The Hidden Infrastructure Debt of AI Agents in Enterprise Systems

The ease with which AI agents can be built locally, requiring just a few LLM calls, a prompt, and tool definitions, belies the significant complexities that emerge when these agents transition into production environments. As organizations increasingly integrate AI agents into their core operations, a parallel to Google’s seminal 2015 paper, "Hidden Technical Debt in Machine Learning Systems," is becoming apparent. This foundational research illuminated the vast infrastructure surrounding the relatively small "ML Code" box, a pattern now being replicated by agentic engineering systems. These systems, capable of dynamic decision-making, autonomous tool usage, and complex reasoning, are not merely code but intricate ecosystems that generate substantial technical debt.

The core challenge lies in the disproportionate infrastructure requirements of agentic systems compared to the agent code itself. While building a functional agent might be a matter of minutes for an individual, scaling these agents to departmental or organizational use, involving real data and critical consequences, unveils a multitude of unforeseen infrastructure needs. Agentic engineering systems inherit all the maintenance burdens of traditional software while adding a new layer of agent-specific complexities. The rapid proliferation of new agents, often created by numerous employees, threatens to outpace human oversight, leading to a decentralized and potentially unmanageable landscape.

Defining an agent as any process with dynamic decision-making capabilities that can autonomously determine tool usage and execution paths through reasoning and reflection underscores the need for robust supporting infrastructure. The current paradigm, where agent code represents a fraction of the overall system, highlights that the true complexity resides in the surrounding infrastructure. Through extensive conversations with engineering leaders and direct experience, seven key infrastructure blocks have been identified, each representing a category of work often overlooked during the initial development and demonstration phases of AI agents. Some of these, like observability, integrations, and governance, will resonate with those familiar with traditional software engineering. However, others, such as human-in-the-loop mechanisms, robust evaluation frameworks for non-deterministic systems, and comprehensive agent registries, are uniquely critical for agentic AI.

1. Integrations: The Gateway to Operational Data

Agents necessitate access to an organization’s diverse operational systems, including CI/CD pipelines, cloud providers, incident management tools, observability platforms, code repositories, and secret managers. The absence of centralized integration management often leads to each team independently establishing its own agent connections. This can result in a proliferation of credentials, such as individual GitLab Personal Access Tokens (PATs), Snowflake credentials, Kubernetes service accounts, and monitoring tokens, creating hundreds of integration points that require individual configuration, debugging, and lifecycle management.

This decentralized approach to integrations introduces inconsistencies. When each engineer manages their own credentials, agents can perceive different data landscapes. For instance, one agent might have broad access to all repositories via a comprehensive PAT, while another is restricted to a specific team’s scope, leading to divergent operational views for identical agent types. Furthermore, API changes from integrated services, like a breaking change in GitLab’s API, can trigger widespread debugging efforts across multiple teams, with varying degrees of efficiency in resolution.

The quality and consistency of data flowing through these integrations are equally critical. When multiple teams connect to the same data source via disparate paths, their agents may derive different answers to identical queries. A disparity in data history—for example, a 30-day deployment history versus a three-year view—will inevitably lead to differing agent outputs. While Machine Configuration Platforms (MCPs) provide a standardized method for agents to invoke tools, they do not inherently manage the credentials, data scope, or the impact of external API changes, leaving significant technical debt in the integration layer.

Hidden Technical Debt in Integrations:

Credential Sprawl and Management: Hundreds of individual API keys, tokens, and service accounts requiring constant rotation, revocation, and auditing.
Inconsistent Data Access: Agents operating with different scopes of access, leading to varied and potentially incorrect conclusions.
Fragile Dependencies: High susceptibility to breaking changes in external APIs, requiring widespread, uncoordinated debugging.
Data Silos and Inconsistencies: Diverse integration paths to the same data sources creating conflicting information for agents.
Lack of Centralized Control: Inability to enforce uniform access policies or respond rapidly to security threats across all integrations.

2. Context Lake: Fueling Agentic Decision-Making

The efficacy of an AI agent is intrinsically tied to the context it can access and leverage. This context can be broadly categorized into two types: runtime context and decision traces.

The hidden technical debt of agentic engineering

Runtime Context: This refers to the dynamic, real-time data agents require for specific executions. This includes information about services, their ownership, recent deployments, and operational configurations. For a coding agent tasked with adding a retry mechanism to a service, runtime context would encompass the service’s language and framework, established retry patterns within the organization, downstream service dependencies, and any recent configuration changes affecting timeout settings. Relying on static documentation, such as Markdown files, for this ever-changing information is a significant source of technical debt. Service ownership can shift, dependencies evolve, and configurations are updated hourly. An agent referencing an outdated Markdown file might operate under erroneous assumptions, leading to flawed actions.

Decision Traces: These are historical records of past actions, the rationale behind them, and their subsequent outcomes. Without access to this history, each agent execution begins anew, risking the repetition of past mistakes. For example, an agent attempting to fix a flaky test might re-open a pull request that was previously rejected due to downstream contract violations, or which was intended for deprecation. The absence of decision traces means that valuable institutional knowledge is lost after each agent run. When multiple agents interact with the same systems, their inability to learn from each other’s histories creates a compounding problem, leading to repeated errors and inefficiencies. While LLM providers are introducing memory features, scaling these to manage the historical data for dozens of agents necessitates a reliable mechanism for serving relevant memory to specific agents.

Hidden Technical Debt in Context Lake:

Outdated and Inaccurate Context: Reliance on static documentation for dynamic operational data leads to flawed agent decisions.
Knowledge Loss: Each agent run starting from scratch due to the absence of historical learning.
Repetition of Errors: Agents re-performing tasks that have already been attempted and resolved (or deemed unnecessary).
Lack of Cross-Agent Learning: Agents unable to benefit from the experiences of other agents operating within the same ecosystem.
Inefficient Context Retrieval: Difficulty in efficiently serving the most relevant historical data to specific agents for their current tasks.

3. Agent Registry: Navigating the Agent Landscape

The rapid proliferation of AI agents, often created by individual employees, poses a significant challenge to visibility and control. With the potential for agents to outnumber employees by a factor of five to ten, organizations face an escalating "org chart" of autonomous processes. These agents, operating across various tools like Claude Code, Cursor, n8n, Zapier, and cloud platforms, can access critical infrastructure and make decisions without clear oversight.

A common pattern emerges where multiple teams independently develop similar agents due to a lack of awareness of existing solutions. This leads to overlapping responsibilities, conflicting behaviors, and hidden dependencies. Before agents can be effectively shared or governed, their existence must be cataloged and understood.

Beyond mere visibility, agents require a standardized operational framework, akin to an employee handbook. This includes guidelines on their expected behavior, available skills, and operational protocols. Currently, individual engineers often create skill files independently, leading to fragmentation, duplication, and inaccuracies. These scattered skills can contradict platform-distributed context, undermining consistency. A centralized platform team is often best positioned to define and distribute these critical operational instructions.

The challenge then becomes delivering these instructions—coding rules, commands, skills, and hooks—consistently and personally to the appropriate agents. This information may need to be layered, with organizational standards, team-specific guidelines, and individual agent configurations. Reliably disseminating this information to thousands of agents, ensuring the right instructions reach the right agents, is a substantial infrastructure undertaking.

Finally, the process of creating new agents must be managed without stifling innovation. Platform teams are tasked with establishing standardized agent creation processes, analogous to service catalogs for traditional software. Without a template, agents may be spun up with no defined owner, lifecycle state, or connection to the services they operate on, leading to orphaned processes with expired credentials and unknown functionality. A standardized template ensures every agent is born with essential metadata—owner, description, tools used, services touched, and lifecycle state—enabling governance from inception. This standardized creation process, accessible through developer workstations like Cursor, can actually accelerate the delivery of high-quality, governable agents.

Hidden Technical Debt in Agent Registry:

Lack of Agent Visibility: Difficulty in identifying existing agents, leading to duplication and conflicting functionalities.
Fragmented Skill Management: Inconsistent, duplicated, or inaccurate skill definitions scattered across various repositories.
Inconsistent Agent Onboarding: Agents created without standardized templates, leading to a lack of ownership, lifecycle tracking, and operational metadata.
Scalability Challenges in Instruction Delivery: Difficulty in reliably distributing and updating operational guidelines to a vast number of agents.
Uncontrolled Agent Sprawl: The creation of numerous unmanaged agents with unknown capabilities and access privileges.

4. Measurement: Quantifying Agent Performance and Impact

Assessing the effectiveness of AI agents presents a multi-faceted challenge, as different stakeholders require distinct metrics. Site Reliability Engineers (SREs) focus on understanding agent actions, while ML engineers and product managers track performance trends, and VPs of Engineering evaluate return on investment. End-users need assurance that agents are learning from their feedback.

Observability: This addresses the fundamental question of "What are your agents doing?" Events, traces, and logs are crucial for understanding agent actions, data access, and operational status. The expanded attack surface of agents, encompassing intricate workflows like autonomous Jira ticket resolution, requires end-to-end traceability. Debugging issues in workflows involving code generation, repository access, and pull request creation demands comprehensive visibility into each step.

Evals: Measuring agent improvement or degradation in response to changes in prompts, skills, tools, or models is critical. Unlike traditional software with deterministic unit tests, agentic systems produce variable outputs. Evals provide a method to answer: "After a change, is the agent still performing adequately?" Without this, changes can be deployed untested, leading to silent degradation of output quality.

Business Impact: The financial justification for AI agents—answering "What is AI doing for your business?"—is a significant concern for engineering leadership. While tracking direct costs like token usage and compute is feasible, measuring ROI is more complex. Quantifying the number of tickets resolved, engineering time saved, or reductions in Mean Time To Recovery (MTTR) is challenging and requires trust in the collected data. A focus solely on cost without demonstrable ROI creates difficult conversations.

Feedback Loops: Capturing human feedback on agent outputs—whether it’s a thumbs-up/thumbs-down on a generated PR or a corrective instruction—is vital for agent improvement. These loops are more impactful than evals for refining agent behavior. In the demo phase, feedback mechanisms are often absent or ignored, leading to agents that fail to learn from user interactions.

Hidden Technical Debt in Measurement:

Lack of Comprehensive Observability: Inability to trace the complete execution path of complex agentic workflows.
Unreliable Performance Tracking: Absence of robust evaluation frameworks for non-deterministic agent outputs.
Difficulty in Demonstrating ROI: Challenges in quantifying the business value and impact of agent deployments.
Ineffective Feedback Mechanisms: Failure to capture and act upon human feedback, hindering agent learning and improvement.
Uncontrolled Cost Overruns: Agents continuing to consume resources without effective cost monitoring or limits.

5. Human-in-the-Loop: Balancing Autonomy and Oversight

The spectrum of agentic operations ranges from fully manual to fully autonomous. Most effective agents operate in a middle ground, where human intervention is strategically placed based on the action, environment, and associated risk. Human-in-the-loop (HITL) mechanisms enable agents to move closer to autonomy safely by defining checkpoints for human approval.

For example, a deployment agent might operate autonomously in staging environments but require explicit approval before deploying to production. The rules governing these approvals are conditional and vary by agent, action, environment, and team. Hard-coding approval logic as simple if statements that trigger Slack notifications is unsustainable for a large number of agents across multiple teams. This leads to inconsistent implementation, where one agent might rollback production without approval, while another requires multiple sign-offs for the same action.

The orchestration of approvals itself introduces complexity. Determining who is notified, through which channel, what the timeout is for a response, and how to handle approver unavailability are critical considerations. Maintaining separate approval systems via Slack, email, and custom UIs creates significant overhead.

Beyond approval workflows, HITL is essential for providing visibility into agent operations. As engineers transition to managing agents, a control plane is needed to monitor active agents, initiate tasks, identify agents requiring attention, and intervene when necessary. This visibility is crucial for building trust in agent capabilities. Teams that can observe and intervene in an agent’s initial deployments will develop confidence in its autonomous operation, a trust that is impossible to build without such transparency.

Hidden Technical Debt in Human-in-the-Loop:

Inconsistent Approval Workflows: Lack of a standardized approach to human oversight across different agents and teams.
Unscalable Approval Logic: Hard-coded approval mechanisms that do not adapt to complex operational scenarios.
Fragmented Notification Systems: Multiple channels and methods for seeking human approval, creating operational complexity.
Lack of a Centralized Control Plane: Difficulty in monitoring agent activities, initiating tasks, and intervening when necessary.
Erosion of Trust: Without transparent oversight, teams may be hesitant to delegate critical tasks to agents.

6. Governance: Establishing Rules and Enforcement

Traditional governance processes for human engineers involve requesting, approving, and logging access to sensitive systems. This ensures that access is scoped, audited, and accountable. In contrast, agents often operate with the broad credentials of their creators, bypassing established governance protocols and scope reviews.

Effective governance for agents requires specific rules concerning data access, tool usage, and operational permissions. These rules must be defined centrally by platform teams and applied uniformly across all agents. Enforcement is equally critical. The ability to instantly disable a tool across all agents upon discovery of a vulnerability is a key governance capability that is often missing.

Even with initial access controls, unforeseen issues can arise. Comprehensive audit trails are essential to determine which agent performed an action, what data it accessed, what credentials were used, and who initiated the process. Agents operating under their creator’s credentials obscure this audit trail, making it appear as individual user activity. When multiple agents share a service account, attributing specific actions becomes impossible. In chained agent workflows, the audit log may only capture the final modification, omitting the preceding reasoning and decision-making process.

Cost governance is another significant aspect. Agents can incur substantial costs by continuing to operate in error loops or prolonged reasoning cycles. Without cost limits, these processes can burn through tokens for hours, leading to unexpected invoices. The ability to break down LLM spend by agent, team, or use case is crucial for financial accountability.

Hidden Technical Debt in Governance:

Uncontrolled Agent Permissions: Agents operating with excessive or unreviewed credentials inherited from their creators.
Inconsistent Policy Enforcement: Lack of a centralized system for defining and enforcing agent-specific governance rules.
Weak Audit Trails: Difficulty in tracking agent actions, data access, and credential usage, hindering accountability.
Inability to Respond to Threats: Challenges in rapidly disabling compromised agents or tools across the entire organization.
Unmanaged Cost Escalation: Agents incurring significant expenses due to unchecked operational loops or inefficient reasoning.

7. Orchestration: Weaving Together Complex Workflows

Most agentic workflows are not isolated agent operations but rather intricate combinations of agents, tools, and human participants. The technical debt in orchestration arises not from individual steps but from the connections between them—routing, failure handling, and ownership.

Consider an incident response workflow: an alert triggers a triage agent, which identifies a deployment issue and hands off to a deployment agent for rollback. A verification agent then checks the fix. If the triage agent misidentifies the root cause as a deployment issue when it’s actually a database timeout, the rollback is unnecessary, the underlying problem persists, and a human must eventually intervene to clean up the incorrect rollback and address the actual issue. This highlights orchestration debt: workflows that execute confidently but incorrectly, with the source of the error being difficult to trace.

The introduction of new incident types, such as security breaches or configuration drifts, further complicates routing and testing. While individual changes may be minor, the absence of a defined owner for decision-making on how these components connect leads to ad-hoc and inconsistent integration.

Traditional workflow orchestration, seen in CI/CD pipelines or cloud-native services, is typically deterministic. Step A reliably produces a known output consumed by Step B. Agent workflows introduce non-determinism, making every downstream step unpredictable. When a runbook is replaced by an agent that reasons through a problem, the full range of potential paths becomes unknown and untestable.

Furthermore, there is often no clear contract between agents. Unlike services with defined APIs and schemas, agents communicate via prompts and natural language. A subtle shift in an agent’s output due to a model update or prompt change can break subsequent agents in the chain.

While deployment pipelines are designed to be deterministic and highly controlled, incident response workflows are inherently non-deterministic, requiring agents to investigate and adapt. The lack of shared rules on when to employ deterministic versus non-deterministic workflows can lead to collisions when workflows span multiple teams with differing approaches.

Finally, the ownership of the overall workflow must be clearly defined. Even if individual agents have owners, the responsibility for the end-to-end workflow can become ambiguous. When an agent-driven action causes an issue, determining accountability—whether it lies with the agent’s owner, the service owner, or the workflow orchestrator—can lead to confusion, particularly in cross-team incidents. Debugging individual step failures is straightforward; tracing the decision three handoffs ago that led the workflow astray requires a level of traceability that many organizations have not yet established.

Hidden Technical Debt in Orchestration:

Unreliable Workflow Execution: Workflows that confidently take incorrect actions due to misidentified root causes or faulty reasoning.
Non-Deterministic Dependencies: The unpredictable nature of agent outputs breaking downstream processes.
Fuzzy Agent Interfaces: Lack of formal contracts or schemas between agents, leading to compatibility issues.
Ambiguous Workflow Ownership: Difficulty in assigning responsibility when issues arise in multi-agent, multi-team workflows.
Untraceable Decision Chains: The inability to pinpoint the exact decision that led a workflow down an incorrect path.

When the Debt Hits: Stages of Impact

The accumulation of this hidden technical debt manifests at specific trigger points as agent adoption scales. In the initial exploration phase, with a single engineer and a single agent, debt is minimal. However, as teams begin using agents for real work, integration and context management become immediate pain points. Agents accessing unauthorized customer data due to un-scoped credentials or misidentifying service owners due to missing context are common early failures.

When multiple teams independently deploy agents, the debt accelerates. Agent registry, measurement, and human-in-the-loop challenges surface concurrently. At this stage, it’s estimated that up to 50% of a team’s capacity can be consumed by building the surrounding infrastructure.

At full production scale, with agents embedded across the engineering organization, governance and orchestration become paramount. Some organizations proactively address this, recognizing that "chaos will exist" and striving to establish standards from the outset. Others learn through painful experience, like a platform engineering VP retrofitting governance after observing teams independently build redundant agents. Regardless of the approach, the infrastructure will eventually be built; the question is whether it’s constructed proactively or reactively after significant incidents.

A Parallel to Microservices: The Platform Engineering Moment

The current evolution of agentic engineering systems echoes the trajectory of microservices adoption. Initially, teams selected their own technologies and managed their own infrastructure, leading to fragmentation. Eventually, platform engineering emerged to create standards and streamline development. A similar platform engineering moment is now occurring with agents.

Platform engineering, traditionally focused on velocity through self-service and reducing ticket loads, now faces the challenge of managing agents. Engineers are rapidly creating agents using tools like Cursor and Claude Code, bypassing traditional platform queues. The platform team’s immediate priority is to gain visibility into existing agents and establish control. Only then can they fulfill their core mission: making agent creation and usage faster, safer, and more efficient for everyone.

One DevEx team described their evolving role as facilitating developer understanding and interaction with agents, and enabling broader agent creation, rather than directly designing agents themselves. This shift signifies a focus on empowering development teams within a governed framework.

Addressing the Debt: A Path Forward

The initial step in addressing this technical debt is to establish visibility. This involves auditing GitHub organizations for AI-related workflows, assessing the number of active API tokens across LLM providers, and identifying AI nodes in workflow tools. The goal is not an exhaustive inventory but a foundational count.

A critical challenge is agreeing on a working definition of what constitutes an "agent." Is a GitHub Actions automation an agent? Is a scheduled task in Claude Code? Or an n8n workflow with an AI node? A clear, agreed-upon definition is necessary for cataloging agents.

The debate between centralized versus democratized infrastructure development is also pertinent. Should the platform team build all necessary infrastructure, or provide guardrails for teams to build their own? Both models have proven effective, with the choice often dependent on organizational culture and tolerance for centralized control.

Ultimately, organizations face a choice: build this essential infrastructure now, or do so after an incident involving leaked customer data, exorbitant token costs, or an uncommanded production rollback. The infrastructure will be built regardless; the only variable is whether it precedes the pain or follows it.