Where the choke points are in AI systems and what to do about them.

The rapid proliferation of artificial intelligence, particularly generative AI and large language models (LLMs), has fundamentally shifted the focus of semiconductor design from raw computational power to the efficiency of data movement. In the current architectural landscape, the primary challenge is no longer just how many floating-point operations per second (FLOPS) a processor can execute, but how effectively data can be transported across the system to feed those processing units. As AI models grow in complexity, the industry is encountering significant "choke points" that threaten to stall progress. These bottlenecks occur at the intersection of memory bandwidth, interconnect latency, and power consumption, necessitating a reevaluation of how chips and systems are designed.

Nandan Nayampally, Chief Commercial Officer at Baya Systems, recently highlighted these challenges in a discussion with Semiconductor Engineering, emphasizing that AI is essentially a data-centric problem. The industry is currently grappling with a massive influx of data that must be processed, stored, and moved with unprecedented speed. This requires sophisticated tradeoffs to ensure that systems remain efficient while maintaining the flexibility to handle future, unpredictable workloads. To address these issues, designers are increasingly turning to advanced Network on Chip (NoC) and Network across Chip (NaC) architectures to manage the flow of information within and between silicon dies.

The Architecture of Movement: Defining NoC and NaC

The transition from monolithic system-on-chip (SoC) designs to more complex, heterogeneous architectures has elevated the importance of the interconnect. A Network on Chip (NoC) serves as the internal communications framework for a single die, connecting various components such as CPU cores, GPUs, AI accelerators, and memory controllers. As chips grow larger and incorporate more specialized processing units, traditional bus or crossbar architectures become insufficient due to congestion and physical routing constraints.

A Network across Chip (NaC) extends this concept to multi-die environments, such as chiplets integrated into a single package or across multiple packages on a board. In AI systems, where a single chip may not provide enough memory or compute power, NaC becomes the critical link that allows multiple silicon components to function as a unified system. The challenge lies in ensuring that the transition from NoC to NaC is seamless, maintaining low latency and high bandwidth even as data crosses physical boundaries.

The distinction between these two layers of connectivity is vital because the "choke points" often reside at the interfaces. If the NoC is highly optimized but the NaC lacks the necessary throughput to feed it, the entire AI accelerator will experience idle cycles, wasting both time and energy.

Identifying Primary Choke Points in AI Hardware

The most pervasive bottleneck in modern AI systems is the "memory wall." While logic speeds have increased significantly over the decades, memory access speeds have not kept pace. In AI workloads, which require the constant shuffling of massive weight matrices and activation data, the inability to move data from memory to the processing elements quickly enough results in significant performance degradation.

Bandwidth Limitations: High-bandwidth memory (HBM) has emerged as a temporary solution, offering terabytes per second of throughput. However, the cost and thermal density of HBM create new challenges. Even with HBM3e, the latest standard, the appetite of AI models for data often exceeds available bandwidth.
Latency Congestion: As the number of processing elements on a chip increases, the distance data must travel grows. This increases latency, which is particularly detrimental for real-time AI applications like autonomous driving or live language translation.
Power Consumption: It is an industry axiom that moving a bit of data across a chip can consume orders of magnitude more energy than performing a mathematical operation on that bit. In data centers where power delivery and cooling are the primary limiting factors for scaling, inefficient data movement is a critical failure point.
Data Coherency Overload: In many-core systems, maintaining data coherency—ensuring that all processors have the most up-to-date version of a piece of data—requires significant overhead. If every data movement requires a "handshake" to verify coherency, the system’s effective throughput drops.

The Evolution of Interconnect Strategy: A Chronology

To understand the current state of AI system design, it is necessary to examine the evolution of interconnect technology over the last two decades.

2000s: The Bus Era: Early SoCs relied on simple bus architectures like ARM’s AMBA. These were effective for low-core-count chips where data traffic was predictable and relatively light.
2010s: Rise of the NoC: As mobile processors and early cloud chips moved to octa-core and beyond, buses became congested. Companies like Arteris and Sonics (later acquired) pioneered NoC technology, using packet-switched networking concepts on silicon to allow multiple parallel data paths.
2020-2022: The AI Explosion: The emergence of Transformers and LLMs shifted the requirement from general-purpose NoCs to AI-optimized interconnects. Designers began prioritizing "all-to-all" communication patterns and massive spatial arrays.
2023-Present: The Chiplet and NaC Era: With the physical limits of reticle-sized chips being reached, the industry has pivoted toward chiplets. This has necessitated the development of standards like UCIe (Universal Chiplet Interconnect Express) and the integration of NoC and NaC into a singular, cohesive fabric.

Data Coherency: When and Where It Makes Sense

One of the most complex decisions for AI architects is determining where to apply data coherency. Nandan Nayampally notes that while coherency is essential for general-purpose computing (where multiple CPUs must share a consistent view of memory), it can be a "choke point" for AI accelerators.

In many AI workloads, data follows a predictable, "streamed" path. In these cases, hardware-managed coherency is often unnecessary and adds needless latency. However, as AI systems move toward "agentic" models or heterogeneous tasks where an AI engine must interact closely with a traditional CPU, selective coherency becomes vital. The modern approach is to design "islands" of coherency, where only specific clusters of cores maintain synchronized memory, while the rest of the chip operates on a more efficient, non-coherent streaming model.

Supporting Data and Market Analysis

The urgency of solving these choke points is reflected in the massive capital expenditures by hyperscalers. According to market research data from Gartner and IDC, the AI semiconductor market is expected to grow at a compound annual growth rate (CAGR) of over 20% through 2027. However, the cost of designing these chips is also skyrocketing. A 3nm AI chip design can cost upwards of $500 million, with a significant portion of that budget allocated to physical design and ensuring signal integrity across the interconnect.

Furthermore, energy statistics suggest that data movement accounts for roughly 30% to 50% of the total power consumption in a typical AI training server. Reducing the distance data travels through better NoC/NaC mapping can result in a 10% to 15% improvement in overall system efficiency, which translates to millions of dollars in energy savings for large-scale data centers.

Industry Reactions and Strategic Implications

The semiconductor industry has reacted to these choke points by moving toward specialized interconnect IP providers. Companies like Baya Systems are entering the fray to provide tools that allow architects to simulate data movement at the "pre-silicon" stage. By modeling how data flows through a proposed NoC before the chip is actually built, designers can identify and eliminate bottlenecks early in the development cycle.

Official responses from major players like NVIDIA and AMD indicate a similar focus. NVIDIA’s NVLink technology is essentially a proprietary NaC solution designed to bypass the limitations of standard PCIe connections, allowing GPUs to share data at speeds that mimic on-chip interconnects. Similarly, AMD’s Infinity Fabric is a foundational technology that allows their CPUs and GPUs to scale across multiple chiplets.

The broader implication for the industry is a shift in value. The "secret sauce" of a successful AI chip is no longer just the mathematical engine; it is the fabric that binds the engines together. As models continue to scale toward trillions of parameters, the ability to manage data movement across thousands of interconnected chips will define the next era of computing.

Conclusion: Designing for the Future of AI

To overcome the choke points in AI systems, the industry must embrace a holistic approach to data movement. This involves not only better physical interconnects but also smarter software-hardware co-design. Architects must be willing to make difficult tradeoffs between flexibility and efficiency, deciding when to prioritize low-latency streaming and when to invest in the complexity of data coherency.

As Nandan Nayampally and Baya Systems suggest, the future lies in "software-defined hardware" where the interconnect can be tuned to the specific requirements of the workload. By identifying bottlenecks in the design phase and utilizing advanced NoC and NaC architectures, the industry can ensure that the next generation of AI systems is limited only by human ingenuity, rather than the physics of moving data.