At the highly anticipated CES 2026, NVIDIA CEO Jensen Huang articulated a vision where artificial intelligence transitions from a specialized technology to a ubiquitous force, accessible and applicable across every company and industry. This proliferation, he asserted, hinges on the activation of open innovation principles throughout the entire AI ecosystem. If Huang’s trajectory—supported by the rapid advancements and widespread adoption of open-source models like DeepSeek, Llama, and Mistral—is indeed the future, then the underlying infrastructure that powers AI development and deployment cannot remain proprietary. This paradigm shift necessitates a fundamental re-evaluation of how AI workloads are managed and scaled, with Kubernetes emerging as a critical, albeit evolving, platform.
Kubernetes, the de facto standard for container orchestration, has been instrumental in managing AI workloads for a significant portion of its existence. While not initially architected with AI’s unique computational demands in mind, the Kubernetes community has consistently demonstrated ingenuity in adapting the platform. Early efforts focused on making Graphics Processing Units (GPUs), the workhorses of modern AI, manageable within the Kubernetes framework, even when core APIs offered only rudimentary support, such as a simple integer count for available GPUs. Now, as AI increasingly dominates compute resource consumption, the community is actively bridging the gap between what was merely "possible" and what constitutes "first-class" support for AI operations. This article delves into the current state of this critical infrastructure evolution.
The Evolving Landscape of AI Hardware Management in Kubernetes
The foundational Kubernetes device plugin API, while functional for basic GPU allocation, proved insufficient as AI workloads became more sophisticated. The original API struggled to accommodate scenarios requiring nuanced resource management, such as the partitioning of a single GPU for multiple workloads, the sharing of a single physical device across several pods, or the high-speed, low-latency interconnects essential for distributed training jobs spanning multiple nodes. These limitations directly impacted the efficiency and scalability of complex AI tasks.
The advent of Dynamic Resource Allocation (DRA) marks a significant advancement in addressing these challenges. DRA empowers hardware vendors to expose detailed, structured information about their devices through ResourceSlices. Workloads, in turn, can articulate their precise needs via ResourceClaims. The Kubernetes scheduler then leverages this information to intelligently match claims to available devices, considering crucial attributes like device capabilities, sharing policies, and network topology. DRA, which achieved General Availability (GA) in Kubernetes 1.34, provides the fundamental building blocks for sophisticated AI hardware management. The next crucial phase of development involves refining the policies and mechanisms that enable these primitives to be used to their full potential for optimal AI performance.
Optimizing Kubernetes for AI Workload Scheduling
The demanding nature of distributed AI training and inference necessitates advanced scheduling capabilities. Specifically, distributed training jobs often require "gang scheduling," a mechanism ensuring that all participating pods are launched simultaneously or not at all. This prevents resource deadlocks and ensures the integrity of the training process. Beyond mere availability, efficient placement of AI workloads is intrinsically linked to the cluster’s physical topology. Strategically landing pods on nodes that share high-speed network interconnects or are part of the same network spine can dramatically reduce communication overhead, a critical factor in training speed and inference latency.
In response to these needs, the KAI Scheduler, now a CNCF Sandbox project, offers a comprehensive solution. It provides DRA-aware gang scheduling, allowing for fine-grained resource control. Furthermore, it incorporates hierarchical queues with sophisticated fairness policies, ensuring equitable resource distribution. Its topology-aware placement capabilities enable intelligent decisions about where to deploy workloads, optimizing for network proximity and reducing latency. Complementing KAI Scheduler, Topograph is an open-source tool designed to discover and expose the underlying network topology of a cluster. This information is invaluable for schedulers, enabling them to make more informed placement decisions across diverse environments, including hybrid and multi-cloud deployments. Discussions surrounding the Workload API within the broader Kubernetes community are actively pushing these advanced scheduling patterns further upstream, aiming for native integration into future Kubernetes releases.
The Challenge of Serving AI Workloads in Production
Inference, the process of deploying trained AI models to generate predictions or insights, represents a growing concentration of production GPU cycles. It is also an area where Kubernetes’ default assumptions often fall short. The standard Horizontal Pod Autoscaler (HPA), for instance, typically scales based on CPU and memory utilization. However, Large Language Model (LLM) inference has distinct scaling requirements, often tied to metrics like KV cache utilization, request queue depth, and the time-to-first-token. Scaling based on inaccurate metrics can lead to either underutilization of expensive GPU resources or, conversely, missed latency targets, negatively impacting user experience and operational efficiency.
To address these inference-specific scaling challenges, the Inference Gateway project extends the Kubernetes Gateway API with model-aware routing capabilities. This allows for more intelligent traffic management tailored to the needs of AI models. Simultaneously, collaborative efforts within the llm-d and Dynamo communities are focused on developing distributed serving solutions. These initiatives explore advanced techniques such as prefix-cache-aware routing and disaggregated prefill/decode operations. Such advancements introduce novel scheduling and autoscaling demands, pushing the boundaries of current orchestration capabilities. While the foundational building blocks for these sophisticated serving architectures are emerging, the necessary abstractions to seamlessly integrate them are likely to span both core Kubernetes primitives and higher-level control planes.
The evolution of AI workloads on Kubernetes is not static; it is a continuous process of adaptation. The next wave of complexity involves orchestrating autonomous AI agents. These agents, increasingly being containerized and deployed as workloads on Kubernetes, represent a new class of compute that demands robust management and orchestration strategies.
The Imperative for Open Infrastructure in the Age of AI
The sentiment that "open-source AI doesn’t stop at the model weights" resonates deeply within the community. The underlying infrastructure required to develop, deploy, and scale these models must also embrace openness. The Kubernetes AI Conformance Program, launched at KubeCon North America in November 2025 with twelve initial certified vendors, signifies a crucial step towards standardization and interoperability. However, the most effective patterns for solving the complex challenges of AI infrastructure are often developed within organizations that are at the forefront of AI adoption. Currently, this invaluable knowledge is largely siloed within individual companies. To accelerate progress and foster broader innovation, this expertise needs to be contributed upstream, into the open-source community, where it can be shared, iterated upon, and compounded for the benefit of all.
The ongoing development within Kubernetes and its surrounding ecosystem reflects a clear recognition that the future of AI is inextricably linked to open, adaptable, and scalable infrastructure. As AI continues its rapid ascent, the principles of open innovation must extend beyond the models themselves to encompass the very foundation upon which they are built and operated.
This guest column is published in anticipation of KubeCon + CloudNativeCon Europe, the flagship conference of the Cloud Native Computing Foundation. Scheduled to take place in Amsterdam, the Netherlands, from March 23-26, 2026, the event will convene a diverse assembly of adopters, technologists, and thought leaders from leading open-source and cloud-native communities. The discussions and collaborations fostered at such events are vital for shaping the future of cloud-native technologies, particularly in the context of AI’s transformative impact.
