Greptile, Cursor, and Devin agree that agents should run their code. What they run it against matters.

A significant shift is underway in the realm of agentic development, with a growing consensus that successfully deploying code generated by AI agents at scale necessitates robust runtime verification. This evolution moves beyond traditional static analysis, such as code reviews and unit tests against mock data, acknowledging that the true efficacy of AI-generated code can only be determined once it is executed in a live environment. The industry is rapidly integrating this runtime verification into the agent’s development loop, aiming to empower agents to validate their own work before human intervention.

This transition addresses a critical bottleneck: the sheer volume of code that AI agents can produce. As agents become capable of opening pull requests at a speed that outpaces human review capacity, the review process itself becomes the primary constraint. Companies like Stripe have already demonstrated the potential of AI agents in this regard, with their internal agents reportedly shipping over 1,000 reviewed pull requests weekly. For such operations to remain efficient and avoid overwhelming senior engineering teams, agents must be equipped to run their code, interpret failures, and iterate on fixes autonomously.

The Convergence on Runtime Verification

The most evident indicator of this industry-wide movement is the increasing adoption of code execution by agent development tools. While static analysis offers valuable insights into code structure and potential issues, it fundamentally falls short of predicting actual runtime behavior. Complex scenarios like race conditions, which depend on concurrent requests, or regressions that manifest only after a user interface renders, remain invisible to static checks.

This trend is demonstrably present across leading agent development platforms. Greptile, for instance, recently unveiled TREX, a feature designed to execute each code change within a disposable, sandboxed environment. This process generates valuable artifacts such as logs, traces, and even screenshots, providing a tangible record of the code’s performance. Cursor’s cloud agents similarly operate by cloning repositories into dedicated virtual machines to facilitate building and testing. OpenAI’s Codex Cloud employs an analogous approach, replicating this VM-based execution. Devin, a prominent AI software engineer, functions within a comprehensive environment that includes its own shell and test runner. Despite variations in their specific implementations, these platforms share a common objective: providing agents with a dedicated space to execute and validate their code before it reaches human reviewers.

This emphasis on runtime verification addresses a fundamental gap in many agentic architectures. By integrating this capability directly into the agent’s workflow, development teams can maintain rapid iteration cycles even as the volume of AI-generated code increases. The crucial, and as yet open, question is the level of fidelity required in this runtime environment to deem a code change truly verified.

The Limitations for Cloud-Native Architectures

The prevailing approach adopted by many of these tools involves providing each agent with an environment that mirrors a developer’s local setup. This typically includes the agent’s specific service, its direct dependencies, and mock implementations for external systems. While this offers a significant improvement over purely static analysis, it presents a critical limitation for teams developing complex, cloud-native applications.

Greptile, Cursor, and Devin agree that agents should run their code. What they run it against matters.

Cloud-native architectures are often characterized by their distributed nature, comprising numerous interconnected services. In such environments, an agent’s code change might function correctly in isolation within its sandboxed environment but fail spectacularly when interacting with the broader system. Mocks, by their very design, can only confirm pre-existing assumptions; they cannot uncover unexpected behaviors or emergent issues that arise from the intricate interplay between different services, real-time data streams, and live user traffic.

These integration points are precisely where the most costly and challenging bugs tend to reside. Issues related to inter-service communication, data consistency across distributed databases, or performance degradations under load are unlikely to be revealed by unit tests or mock interactions. Instead, they surface during integration tests, end-to-end testing, and other system-level validations that require the code to run against the actual, operational system.

Furthermore, non-functional requirements such as performance, scalability, and security are equally susceptible to being overlooked in isolated testing environments. Load regressions, resource contention, and subtle runtime security vulnerabilities can only be accurately assessed when the code operates within a realistic system context. A simple sandbox, by definition, cannot replicate the complexity and dynamic nature of a production cloud-native environment.

The intuitive solution of replicating the entire production system for each agent’s verification environment is, however, impractical. Orchestrating the deployment and management of dozens of stateful services, complete with their data and configurations, for every agent iteration would be an insurmountable operational burden, especially given the high velocity at which agents generate code. Moreover, a mere copy, even if feasible, would still represent a static snapshot rather than the dynamic, evolving reality of a live system.

An Architecture for Comprehensive System Verification

The path forward lies in shifting from providing each agent with its own isolated copy of the system to enabling all agents to perform verification against a single, shared, production-like environment, while maintaining robust isolation between their respective testing activities.

This architectural pattern involves establishing a shared cluster that hosts all the real dependencies and services, mirroring the production environment’s behavior. To verify a specific code change, only the modified service is deployed into this baseline environment. Request-level isolation is then employed to ensure that each agent’s traffic is routed exclusively to its version of the service. Consequently, an agent’s requests interact with its deployed service, which then communicates with the underlying, live services. Crucially, all other agents’ traffic remains segregated and unaffected, operating against the stable baseline.

This approach provides agents with a realistic runtime context, allowing their changes to interact with live services, data, and policies. This direct interaction makes system-level behaviors and integration dynamics observable, moving beyond the limitations of mocked interactions.

This model also scales effectively with the way agents operate. Since each verification involves layering a single modified service onto the existing system rather than provisioning an entire replica, multiple agents can conduct their verification concurrently and cost-effectively without interference. The verification environment is lightweight and ephemeral, and it leverages existing infrastructure that teams already manage. This is the paradigm being pursued by platforms like Signadot, which currently focuses on Kubernetes-based implementations, with the understanding that similar principles will apply as agent tooling matures across different platforms.

The integration of this architecture aligns seamlessly with the existing agent workflows. The agent writes code, executes it, analyzes any failures, and iterates, all before the code is submitted for human review. The critical difference is the elevated standard for a passing verification. Instead of simply achieving a "green" status against mocks designed to align with the agent’s own assumptions, the code must now demonstrate correctness against the actual, complex system it is intended to serve.

The Trend is Sound, the Scope is the Challenge

The burgeoning adoption of runtime verification marks a fundamental and positive evolution in software delivery processes. Leading teams are increasingly viewing the agent’s role not merely as code generation but as a continuous cycle of writing, proving, and debugging. This iterative process is essential for ensuring the reliability and robustness of AI-generated code.

The pivotal question for organizations operating cloud-native systems is the extent to which this verification process reaches. While isolated, per-agent environments are sufficient for validating many individual code changes, they fall short when the change’s success hinges on its interaction with the broader system. Proving that a change is correct within the intricate web of a distributed application demands integration and system-level verification against a runtime that accurately represents the entire ecosystem, not merely a simplified stand-in.

Ultimately, the teams that will derive the greatest benefit from background agents will be those that ensure their verification loops encompass the entire system, rather than being confined to the immediate service being modified. This comprehensive approach is crucial for navigating the complexities of modern cloud-native development and for unlocking the full potential of AI-powered software engineering.