The Two-Loop Tax: Redefining Integration Testing for the Agent-Driven Era

For years, integration tests have been a staple of Continuous Integration (CI) pipelines, typically triggered by code pushes and delivering results minutes, or even tens of minutes, later. This established workflow, while functional when human developers manually initiated code changes, is rapidly becoming obsolete. The advent of sophisticated coding agents, capable of iterating and generating code in mere seconds, has fundamentally altered the development feedback loop. The traditional round trip to a remote CI pipeline is now a significant bottleneck, too slow to keep pace with the rapid iteration cycles driven by these AI-powered tools.

"Developers are driving coding agents that iterate in seconds, and the round trip to a remote pipeline is too slow to fit inside the loop where the work is happening," highlights a recent analysis of this evolving landscape. This shift necessitates a reevaluation of integration testing strategies to align with the speed and autonomy of modern development agents.

The core challenge lies in what is being termed the "two-loop tax." Historically, software validation has been segmented into two distinct loops: an inner loop, encompassing rapid, local checks like unit tests and linters that provide immediate feedback, and an outer loop, comprising slower, more comprehensive tests like integration and end-to-end tests executed in CI/CD pipelines.

For coding agents, this separation creates a critical disconnect. An agent can generate a code modification in seconds. However, waiting for fifteen minutes or more for a verdict from a remote CI pipeline is an unacceptably long delay that disrupts the agent’s efficient workflow. Consequently, agents tend to "ship" their work from the inner loop, relying solely on the limited, often mocked, validation that this inner loop can provide. This approach leaves the crucial task of verifying the code’s integration with the full system to human developers, effectively leaving the validation loop open and incomplete.

A natural inclination might be to embed agents directly within CI systems, empowering them to author pipelines, generate test stubs, and respond to failures. While this direction is undoubtedly part of the future, the fundamental primitive required for agent-in-CI success is the same one needed to revitalize the inner loop: a unit of validation that is small enough for an agent to author, select, and execute against a real integration environment within seconds.

By establishing this foundational primitive, the inner loop gains the capability to invoke validation directly during the authoring phase. Simultaneously, the outer loop can leverage the same primitive on code pushes. This allows the developer’s agent to interact with CI not as a distant gatekeeper, but as an accessible verification library, eliminating the need to wait for a remote pipeline’s delayed response. The focus on the inner loop stems from its current status as the most acute time-to-verdict bottleneck. However, the same underlying primitive possesses the power to collapse these two distinct loops into a single, cohesive validation process.

CI wasn’t built for coding agents. Here’s what comes next.

Environments: From Bottleneck to Enabler

A significant part of the solution lies in embracing ephemeral, on-demand environments. These environments, already being adopted by some teams for their inner-loop validation, are crucial for providing isolated, scoped, and production-grade fidelity. They offer access to real downstream services and dependencies, scaling to meet the demands of autonomous agents. The pattern for building such environments is broadly applicable, often involving technologies that can spin up and tear down isolated instances of the application stack on demand.

The critical advantage these environments provide is a realistic integration landscape, available instantaneously and precisely scoped to the work being undertaken. An agent can summon such an environment in seconds, validate its changes against it, and then dismantle it once its purpose is served. This eliminates the long-standing challenge of provisioning and managing complex integration environments, a hurdle that previously made rapid, on-demand validation impractical.

Workflow Matters: The Evolution of Integration Test Definition

While robust environments are a necessary prerequisite, they are not sufficient on their own. The other half of the integration testing equation – how validation is defined, selected, executed, and acted upon – is where the most deeply ingrained assumptions of the pre-agent era persist. Traditional integration testing is heavily reliant on pipelines: monolithic, heavyweight artifacts designed for remote execution with elaborate bootstrapping processes, triggered by code pushes, and following a fixed sequence of steps. None of these characteristics are compatible with the interactive, session-based nature of an agent’s workflow.

This leads to the crucial question: what should the ideal primitive for agent-driven validation look like? It needs to be compact enough to reside within an agent’s working session, yet robust enough to reliably detect integration bugs against a live system. This primitive should be something an agent can author in seconds, which humans can easily review, and which the team can maintain and reuse. Fundamentally, it should function more like a function that an agent invokes when needed, rather than a self-scheduled pipeline.

This concept has been formalized as a "plan."

Actions and Plans: A New Paradigm for Validation

A plan represents a small, two-layer system designed for agent-driven validation. Understanding these layers is key to grasping the new approach.

The foundational layer consists of actions. These are typed, deterministic building blocks, a concept familiar to anyone who has worked with workflow runners like GitHub Actions. Within this context, two key attributes are noteworthy:

Reusable Components: Actions serve as standardized, reusable units of work. This promotes consistency and reduces redundancy in test definitions.
Action Catalog: A curated catalog of available actions ensures that agents can discover and utilize pre-defined validation capabilities, accelerating the plan authoring process.

The second layer comprises plans. A plan is essentially a Directed Acyclic Graph (DAG) composed of these actions, stitched together to validate a single, user-visible behavior end-to-end. These are specifically intended to be authored by agents. The defining characteristics of plans include:

Agent-Authored: Designed for agents to create and manage, facilitating seamless integration into their rapid development cycles.
Small and Focused: Each plan targets a specific user-facing behavior, making them manageable and efficient.
Selection Hints: Metadata that allows agents to intelligently select the most relevant plan for a given code change.
Versioned and Reusable: Plans are version-controlled and can be reused across different development efforts, building a valuable knowledge base of system behavior.

Consider an illustrative example of a plan for a ride-request feature:

spec:
  selectionHint: "End-to-end ride-request check for HotROD: pick pickup +
    dropoff in the React app, request a ride, assert the resulting
    itinerary shows both location names."
  steps:
  - id: e2e_ride
    action:  actionID: <playwright-action-id> 
    args:
      values:
        script: |
          test('itinerary shows pickup and dropoff', async ( page ) => 
            await page.goto(process.env.BASE_URL + '/');
            await page.getByRole('button',  name: 'Request Ride' ).click();
            await expect(page.locator('.itinerary')).toContainText("Rachel's Floral Designs");
          );

This plan, with its clear selectionHint, leverages an action (likely a Playwright-based one in this case) to execute a specific end-to-end test scenario. The agent can author or select this plan based on the code changes it has made, ensuring that only relevant validation is triggered.

The Inner Loop Advantage: Immediate, Real-World Feedback

The efficacy of plans for the inner loop stems from their inherent design. Each plan encapsulates a single, end-to-end user-visible behavior, rather than an entire pipeline or test suite. This focused scope makes them small enough to integrate seamlessly into an agent’s working session. Furthermore, the selectionHint mechanism enables agents to precisely choose the most appropriate plan for a given code modification.

For instance, a change impacting the ride-request functionality would trigger only the one or two plans directly relevant to that behavior, bypassing unrelated tests concerning billing or authentication. Crucially, these plans execute against a real integration environment, delivering a meaningful verdict within seconds, often before a pull request is even initiated. This dramatically accelerates the feedback loop, allowing agents to correct issues while the context is still fresh.

Practical Implementation: The HotROD Example

To illustrate this process in practice, consider the HotROD rideshare demo application, a Kubernetes-based system comprising four Go services simulating a ride-hailing backend. This application, augmented with dependencies like Redis, MySQL, and Kafka, provides a realistic environment for testing complex interactions.

Imagine an agent performing a seemingly minor refactor: renaming a Go struct field from Name to LocationName within the location service. This change, while potentially compiling and passing unit tests, could easily introduce integration bugs that are invisible at this granular level.

Before submitting a pull request, the agent, guided by its selectionHint, would retrieve the relevant ride-request plan from the catalog. This plan, likely employing Playwright, would then execute a simulated booking flow within an ephemeral environment that includes the modified location service.

The plan would fail. The frontend, still expecting the old Name field, would encounter a missing data point, resulting in an empty itinerary and a failed assertion. The agent would receive a structured report detailing the exact assertion that failed and capturing relevant diagnostic information.

Upon receiving this feedback, the agent could immediately trace the failure to the API contract. It would then identify the frontend as the affected consumer, edit the necessary four files within the frontend code, and re-run the plan. This time, the plan would pass, confirming that the fix has propagated correctly to the consumer. The resulting pull request would then land on the reviewer’s desk already validated against the real cluster, with an associated environment URL and an auditable plan run.

Transforming the Software Development Lifecycle (SDLC)

This paradigm shift fundamentally alters the Software Development Lifecycle (SDLC). Integration tests, once relegated to the outer loop of CI, now migrate to the agent’s session, executed against a live cluster even before a pull request is created. Bugs that historically surfaced only in staging environments – issues that were locally correct but systemically broken – are now caught and rectified within the agent’s immediate feedback loop. Staging environments can consequently revert to their intended role as a final sanity check, rather than serving as the primary discovery mechanism for integration defects.

The engineer driving the agent experiences this transformation most acutely. Validation, previously a waiting game with CI, now arrives directly within the authoring session. Integration failures are surfaced while the engineer and the agent are actively collaborating on the code, rather than after the development cycle has seemingly concluded. Subsequent reviews, whether by the engineer themselves or a teammate, shift from a proxy for validation to a review of actual behavior. What reaches human eyes is a diff that has already passed against the real cluster, accompanied by an environment URL and an auditable plan run.

Over time, teams will accumulate a versioned library of these plans. Each plan becomes a precise definition of "correctness" for a specific system behavior. This growing library serves a dual purpose: it acts as the reference against which agents validate their work, and it provides a human-readable documentation of the system’s intended functionality.

"As agents compress code generation into seconds at one end of the pipeline, leaving integration validation pinned to the outer loop turns CI into a growing backlog rather than a feedback mechanism," emphasizes the underlying philosophy. This essential shift ensures that the verdict on code quality arrives while the agent is still actively engaged with the code. Otherwise, the gap between generated code and shipped code will continue to widen for cloud-native teams, undermining the very agility that these advanced tools promise.

Agent Skills: The Delivery Mechanism for Modern Validation

The practical implementation of these primitives relies on "agent skills." These are scoped, loadable instructions that modern development harnesses, such as Claude Code, Cursor, and Codex, can integrate as first-class extensions. Skills are a natural fit for several reasons:

Agent-Centric Integration: They reside within the agent’s operational environment, enabling seamless integration with existing workflows.
Scoped Functionality: By design, skills are narrowly focused. Authoring a plan and running a plan are distinct tasks, and splitting them ensures each skill remains small and unambiguous, allowing the agent to select the appropriate tool without confusion.

The plan-based validation workflow is delivered through two distinct skills, bifurcated by the nature of the work:

signadot-plan: This skill handles the authorship of plans. Developers or agents can describe the behavior they wish to validate, and the skill will generate a draft plan, composed from the action catalog, ready for review and integration into the codebase.
signadot-validate: This skill acts as the runner. It analyzes code diffs, identifies the most relevant plan using selection hints, provisions an ephemeral environment, executes the plan against the live cluster, and surfaces any failures that the agent can then act upon.

This modular approach ensures that each skill is optimized for its specific function, simplifying agent interaction. A quickstart guide is available to walk users through this plan-based validation process end-to-end, demonstrating its practical application.

While these skills currently operate within the agent’s session to address the immediate bottleneck, the underlying primitives are extensible. The same plans can be seamlessly integrated into CI pipelines, running against the same ephemeral environments as a safety net for any validation gaps missed in the inner-loop checks. Further details on this extended functionality are anticipated.

Leave a Reply Cancel reply