The Endurance Gap: New AI Coding Agents Face Scrutiny Over Long-Horizon Task Reliability

The promise of artificial intelligence agents capable of writing complex code has taken a significant leap forward with the introduction of new tools designed to tackle extended, multi-step development tasks. However, a growing body of research and industry observation highlights a critical challenge: the "endurance gap." This refers to the point at which AI coding agents, despite initial proficiency, begin to falter and compound errors as tasks extend beyond a certain number of steps, often leading to project failure. This challenge is now at the forefront of AI development, prompting both innovation and rigorous independent evaluation.

Xiaomi’s MiMo Code, an open-source, terminal-native harness, has entered this competitive landscape, with the company claiming it surpasses Anthropic’s Claude Code on agentic tasks exceeding 200 steps. While this benchmark is self-reported and derived from Xiaomi’s beta testing and developer surveys, it underscores a pivotal area of focus: long-horizon reliability. The ability of an AI to maintain coherence and accuracy across hundreds of dependent coding operations is emerging as the new frontier, moving beyond simple code generation to complex project scaffolding and maintenance.

The Cracks in Long-Horizon Coding

The initial allure of AI coding agents lies in their ability to rapidly generate functional applications from straightforward prompts. However, the true test of their utility emerges when tasked with sustained development, involving intricate edits, iterative testing, and continuous revisions over extended periods. It is during these prolonged engagements that the inherent limitations of current AI architectures become apparent.

Developers and researchers have observed recurring failure modes that contribute to this collapse. These often include:

Hypothesis Lock-In: An agent forms an initial assumption about the optimal approach or solution and rigidly adheres to it, even when subsequent steps reveal it to be incorrect. This can lead to a cascade of ill-fitting patches and workarounds.
Error Compounding: Small, undetected errors early in the process are not corrected but instead become foundational for subsequent operations. As the task progresses, these initial missteps multiply, leading to a significant divergence from the intended outcome and eventual system failure.
State Management Failures: Similar to a long batch job without saved checkpoints, an agent can lose track of its progress. A crash or an unexpected interruption can force a complete restart from the beginning, rather than resuming from a stable point.

These issues have been documented by various industry players. For instance, an account from the team behind Ejentum, a development platform, noted that agentic coding failures often occur around the 30-step mark, attributing them to hypothesis lock-in and error compounding. This analogy to un-checkpointed batch jobs vividly illustrates the problem: without a mechanism to save and restore progress, any interruption can negate all prior work.

Berkeley’s "Agent’s Last Exam" Sets a New Standard

To address the need for more objective and realistic assessments, researchers at the University of California, Berkeley, have developed a novel benchmark designed to evaluate AI agents on tasks that mirror real-world professional projects. The "Agent’s Last Exam," created by Dawn Song and Yiyou Sun at UC Berkeley’s RDI lab, was shaped by over 250 industry experts across 55 occupations. This benchmark is deliberately stringent, engineered to expose the limitations of AI agents rather than showcase their strengths.

Each task within "Agent’s Last Exam" is a direct conversion of a previously shipped professional project into a code-graded test. Critically, there is no human judge involved in the evaluation process. The AI agent is granted full access to a simulated graphical user interface and command line, and it must complete the task autonomously. The benchmark then solely scores the final artifact produced by the agent.

The headline finding from Berkeley’s research is stark: even the most advanced configurations tested, a combination of Codex and GPT-5.5, achieved less than 50% success on the easiest tier of tasks. On the hardest tier, the pass rate plummeted to under 10%. Mainstream agents, including Claude Code, demonstrated near-zero success rates on these more demanding assignments. This suggests that while AI agents can handle a significant portion of professional coding tasks, the most challenging long-horizon work remains largely out of reach.

These findings emerged concurrently with Anthropic’s release of Fable 5, which was accompanied by marketing emphasizing its "job-ready" capabilities. Berkeley’s benchmark provided a quantitative counterpoint, directly questioning the readiness of current agents for complex, sustained development. The true value of the exam lies in its focus on the finished product, differentiating it from performance metrics based on demos or intermediate progress. A model might excel in short coding challenges but fail to deliver a complete or correct artifact in longer, more complex scenarios – a failure mode that a code-graded benchmark is designed to detect.

The Evolving Harness Layer in AI Coding

The critical layer enabling AI agents to manage complex, multi-step tasks is the "harness." This is the architectural component responsible for maintaining the task’s state, pacing the execution, and making strategic decisions about the next course of action. Current development in this area is exploring three primary approaches:

Nested Sub-Agents (Claude Code): Anthropic’s Claude Code employs a hierarchical structure of nested sub-agents, with a cap of five levels deep. A frontier model handles high-level planning, while specialized, more cost-effective sub-agents execute tasks and spawn their own assistants as needed. This modular approach aims to distribute complexity and leverage specialized AI capabilities.
Coordinator and Executor Model (Arbor): Researchers from Renmin University of China developed Arbor, which utilizes a long-lived coordinator agent paired with short-lived executor agents. A key innovation in Arbor is its persistent hypothesis tree, which meticulously checkpoints progress, allowing for seamless resumption of tasks even after interruptions. This approach prioritizes robust state management and fault tolerance.
Terminal-Native Harness (MiMo Code): Xiaomi’s MiMo Code addresses the long-horizon reliability challenge from an open-source perspective. It is a terminal-native harness specifically tuned to optimize performance for task runs extending beyond 200 steps, aiming to provide a stable and efficient environment for extended coding operations.

These approaches are not mutually exclusive, and their effectiveness can vary significantly. The evidence presented for each also differs in its level of independent verification.

A common thread among these advanced harnesses is the explicit externalization of state. Arbor, in particular, emphasizes the hypothesis tree as the primary mechanism for long-term survival, rather than solely relying on the context window of the language model. This design philosophy draws parallels with the principles of durable workflow engines developed years ago, which underscored the necessity of checkpointing any process that cannot withstand failure. The core principle remains: whatever a system cannot reliably save and restore will likely need to be redone in the event of an interruption.

The Significance of the Endurance Gap for Enterprises

For enterprises seeking to integrate AI agents into their development pipelines, the endurance gap is not merely an academic curiosity but a critical procurement consideration. An AI agent that fails silently at step 30 of a production task, returning a plausible but flawed artifact based on an early incorrect assumption, can lead to significant downstream costs. These manifest as costly rework, elusive silent defects, and an erosion of trust in the development process, necessitating human oversight and extensive debugging.

Berkeley’s benchmark provides a tangible, code-graded floor for evaluating how far an AI agent can reliably carry out a task before human intervention becomes essential. This allows organizations to treat "endurance" as a distinct line item in their procurement evaluations. Instead of relying solely on coding leaderboard scores, which may not reflect long-horizon performance, buyers can now demand evidence of an agent’s ability to sustain complex tasks.

When evaluating potential AI coding agents, teams should inquire about:

State Management: How effectively does the candidate harness maintain task state across extended runs?
Resumption Capabilities: Can the agent reliably resume from a checkpoint after an interruption?
Published Endurance Ceilings: What are the agent’s own published metrics regarding the maximum number of steps it can successfully complete?

Some vendors may focus on model benchmarks, often omitting crucial details about long-horizon behavior. This can lead to misaligned expectations, where buyers are seeking solutions for extended tasks but are presented with data that answers a different question.

It is important to consider both Xiaomi’s 200-step claim and Berkeley’s challenging benchmark. Xiaomi’s claim may hold true when validated by independent testing. Conversely, Berkeley’s most difficult tiers might represent scenarios that no team would realistically assign to an unsupervised agent. As AI models continue to improve their coherence over longer spans, the "endurance gap" may naturally narrow. However, this does not diminish the current reality: an AI agent’s finished output is often a more accurate reflection of its capabilities than any demonstration.

The Road Ahead: Verifiable Performance and Procurement Shifts

For developers and IT decision-makers monitoring the evolving AI tool landscape, the focus is shifting away from headline model performance to two critical metrics: an agent’s coherence span and the verification of its performance. MiMo Code, Arbor, and Claude Code’s sub-agent architecture represent early entrants in a contest that the field is only beginning to quantify.

The crucial next step will be independent verification of claims regarding long-horizon capabilities. Whether benchmarks like "Agent’s Last Exam" corroborate or challenge the endurance claims of various agents will be a key determinant of market trust. Furthermore, the hardening of harness technology within the commercial AI coding agent market is already underway.

When this evolving landscape matures and the first truly contested leaderboards emerge, the most critical figure for buyers will be the step count an agent can reliably complete before human intervention is required. This number will transform vendor promises into concrete procurement requirements, ensuring that the tools being adopted are genuinely capable of supporting the complex, long-term development workflows that drive modern software delivery. Ultimately, the leaderboard that truly matters is the one graded on shipped work, not self-reported wins, and this discipline promises to benefit every team integrating AI into their delivery pipelines.