Beyond prompting: How KubeStellar reached 81% PR acceptance with AI agents

In mid-December, a solo developer embarked on the ambitious project of building KubeStellar Console from scratch. This multi-cluster management dashboard for Kubernetes, now housed within the Cloud Native Computing Foundation (CNCF) Sandbox as part of the KubeStellar project, represents a significant undertaking in the realm of cloud-native infrastructure management. The chosen technology stack comprises Go for the backend, React and TypeScript for the frontend, and Helm for packaging. This endeavor was initiated not by a team, but by an individual collaborating with two AI coding agents running in parallel terminal sessions.

The initial two weeks of this project were characterized by a period of remarkable productivity, often described as a "honeymoon phase" by those who have experimented with AI coding assistants. Code generated by the agents emerged at an unprecedented pace, with features that might typically require days of human effort being delivered in mere hours. The developer reported a rapid implementation of a pre-existing mental wishlist of functionalities, showcasing the perceived efficiency gains of AI assistance.

However, this initial euphoria soon gave way to a stark realization. Builds began to fail in increasingly complex and difficult-to-diagnose ways. Architectural decisions made in previous iterations were silently overwritten, and project scope expanded organically without explicit instruction. A persistent issue arose where the AI agent would modify files beyond its designated scope, leading to a cascade of problems. The common experience of fixing one issue only to trigger three others became the norm, prompting a shift from code review to extensive code reversion. The anticipated tenfold increase in productivity began to feel like a net negative, ultimately leading to the decision to abandon the initial approach and re-evaluate the strategy.

This arc, from initial elation to profound frustration, appears to be a common experience in the field of AI-assisted software development. The prevailing industry advice often suggests granting AI agents more autonomy – allowing them to run for longer periods, modify a greater number of files, and engage in self-correction. However, in this developer’s experience, this approach tends to exacerbate failure modes rather than mitigate them. The crucial insight is that the true leverage of AI assistance lies not solely within the model itself, but within the surrounding codebase that orchestrates and constrains the AI’s operations. To enable an AI agent to perform more effectively, the surrounding code must possess a greater capacity for measurement and feedback.

Over the subsequent four months, KubeStellar Console has undergone a significant transformation, reaching a more robust and mature state. The project now boasts 63 CI/CD workflows and 32 nightly test suites, achieving an impressive 91% code coverage across twelve parallel shards. Over an 82-day period, the pull request acceptance rate stabilized at approximately 81%. A testament to the improved system’s efficiency, community-reported bugs are now being addressed with merged fixes in roughly thirty minutes, and feature requests are being integrated as pull requests in about an hour. Critically, this advancement was not a result of a superior AI model, but rather a consequence of the codebase itself learning to measure and adapt.

This maturation was achieved through the implementation of five distinct "tightening loops," conceptualized as the rungs of an "AI Codebase Maturity Model": Assisted, Instructed, Measured, Adaptive, and Self-Sustaining. These loops were implemented and evolved in a specific order, reflecting a dependency on preceding stages.

1. Externalizing Preferences: The "Instructed" Stage

The initial and most cost-effective intervention involved externalizing the developer’s own coding preferences and criteria. This began with the creation of a CLAUDE.md file at the repository’s root, followed by a .github/copilot-instructions.md file detailing pull request conventions. A more granular development guide was subsequently introduced, cataloging the primary reasons for rejecting AI-generated pull requests. This single guide ultimately encompassed approximately 90% of rejection criteria, leading to more consistent agent behavior and a significant reduction in recurring mistakes across different AI models. While not yet a formal measurement, this step filtered out sufficient noise to pave the way for effective measurement.

2. Establishing the Trust Layer: The "Measured" Stage

A pivotal shift occurred when tests were re-conceptualized not merely as a mechanism for ensuring correctness, but as the fundamental "trust layer" for autonomous workflows. In an AI-driven development process, tests serve as the sole signal for the agent to ascertain whether its actions are improving or degrading the system.

Over a four-week period, 32 nightly test suites were implemented, pushing code coverage to 91% across twelve parallel shards. These suites encompassed a broad spectrum of quality checks, including compliance, performance, nil safety, accessibility, internationalization, and visual regression. Concurrently, a auto-qa-tuning.json file was introduced to log PR acceptance rates per category. This file proved to be foundational for all subsequent developments.

While test coverage volume and breadth are important, the most critical factor, and the one that nearly derailed the project, is determinism. A flaky test in a human-driven workflow might be a minor annoyance, but in an autonomous system where test results gate code merges, it represents a slow and insidious erosion of the entire trust model. For instance, a Playwright end-to-end test for drag-and-drop functionality that passed only 85% of the time, while tolerable for human developers, became a significant impediment in the autonomous workflow. This unreliability led to random blocking of valid pull requests and the inadvertent approval of weaker ones. A three-day effort was dedicated to resolving this single test, which ultimately traced back to an animation-completion timing issue within the CI environment. The generalized lesson was clear: autonomous systems cannot be built upon unreliable signals.

3. Measurement Precedes Automation: The "Adaptive" Stage

With acceptance rates reliably logged, the process of automation became significantly safer. The Auto-QA system began executing four times daily, performing checks across eight distinct quality assurance layers. The system dynamically adjusted rotation weights, which dictate the focus areas for development, based on incoming data. For example, when accessibility PRs exhibited a 62% acceptance rate, their weight was increased to 0.93. Conversely, operator-category PRs, with an 8% acceptance rate (11 merges against 129 closed), had their weight reduced to zero, redirecting CI cycles.

Several additional feedback loops were established around this core mechanism:

Automated Refactoring: Agents were tasked with identifying and refactoring code based on established patterns and metrics, ensuring consistency and maintainability.
Automated Documentation Updates: When code changes were made, agents were instructed to automatically update relevant documentation, keeping it synchronized with the codebase.
Automated Security Scanning: Regular security scans were integrated into the CI pipeline, with agents flagging and potentially addressing vulnerabilities.
Automated Dependency Management: The system was designed to monitor and manage project dependencies, alerting to outdated or insecure packages.

The overarching principle is that automation must always follow measurement. Inverting this order, as demonstrated by the initial struggles, leads to failure at scale.

4. The Codebase as the Operating Manual: The "Self-Sustaining" Stage

At an indeterminate point, the system transitioned to a state of self-sufficiency, no longer requiring constant human intervention for operation. Its behavior became dictated by its inherent artifacts: instruction files, test suites, workflow rules, and historical acceptance rate data. Community members began submitting issues at all hours, and these issues were systematically triaged, assigned, resolved, tested, and queued for review before the developer even became aware of them.

A specific incident exemplified this shift. A user reported a bug where a cluster was marked as "healthy" despite pods being stuck in an ImagePullBackOff state. Before the developer could intervene, the system had already provided an explanation: cluster health, in this context, referred to infrastructure health (node readiness, API reachability), which was architecturally distinct from workload health. This was not a bug but a misunderstanding of Kubernetes’ mental model in relation to the dashboard’s presentation. The design decision was already embedded within the tests, health-check logic, and documentation, enabling the AI agent to articulate it because the codebase itself "understood" it. This practical manifestation of "the code is the model" underscores the profound implications of this stage.

5. Questioning Over Commanding: The "Why" Prompt

A single prompting habit proved disproportionately effective. Instead of issuing commands like "fix this bug," the developer began posing questions such as "Why didn’t you catch this?" The former typically yields a superficial patch, while the latter encourages a root-cause analysis. This deeper inquiry often results in the creation of a new test, instruction, or rule that proactively prevents an entire class of similar failures. Commanding leads to a series of isolated fixes, whereas questioning fosters compounding improvements. Over time, these questions are instrumental in transforming the codebase into a self-improving system and are the very mechanism by which instruction files are generated when starting from scratch.

Implications for Maintainers and Leaders

For engineering leaders, the focus should shift from optimizing the choice of AI model to strengthening the surrounding feedback systems. AI models are becoming increasingly commoditized, with replacements often requiring minimal effort. The true differentiation and long-term value lie in the "intelligence infrastructure" – the meticulously crafted instruction files, robust test suites, comprehensive metrics, and intelligent workflow rules.

For open-source maintainers, this paradigm offers a potential solution to the pervasive burnout issue frequently discussed within CNCF communities. If a codebase can sufficiently encode a maintainer’s judgment, enabling AI agents to handle triage, generate pull requests, and explain design decisions to users, the community can then steer project development primarily through issue reporting. This allows maintainers to evolve into architects of their systems rather than being consumed by their daily operations. This model is currently operational for KubeStellar Console, and its scalability beyond a solo-maintained Sandbox project remains a subject for broader community exploration.

Most teams are currently operating within the initial loop, focused on prompt writing and output review – the universal starting point. The objective is not to prematurely reach the final stage, but to identify the specific loop that is currently acting as a bottleneck and address it systematically. The codebase now encapsulates the learned knowledge, and the tests capture what cannot be held in active memory. The developer’s role, which remains essential, is to define what is worth building, to exercise judgment in saying "no," and to articulate the vision of what constitutes "good."