Subquadratic Unveils Subquadratic Selective Attention (SSA) Architecture, Promising Linear Scaling for Large Context Windows in AI Models

Miami-based startup Subquadratic has announced the launch of its inaugural AI model, featuring a groundbreaking Subquadratic Selective Attention (SSA) architecture. This new technology aims to overcome a fundamental limitation in transformer-based language models: the quadratic scaling cost of attention mechanisms with respect to context length. While leading AI labs are converging on million-token context windows, Subquadratic claims its SSA architecture enables linear scaling, allowing for context windows of up to 12 million tokens initially, with plans to expand to 50 million tokens in the near future. This development could represent a significant leap forward in the practical application of large context windows, a capability that has historically been hampered by computational constraints.

The current landscape of frontier AI models, as of 2026, sees a widespread offering of context windows at the million-token mark. However, a significant caveat persists: the ability to effectively utilize such vast amounts of information remains a challenge for most models. The MRCR v2 benchmark, a multi-reference retrieval task, illustrates this point. The leading performer, GPT-5.5, achieves a score of 74.0%, while other prominent models like Claude Opus 4.7 lag considerably behind at 32.2%. This disparity highlights the ongoing struggle to translate sheer context window size into tangible performance improvements.

The core impediment to efficient large-context processing in current transformer architectures, established in 2017 with the seminal "Attention Is All You Need" paper, is the quadratic relationship between context length and computational cost. Doubling the input sequence length quadruples the computational workload. This inherent limitation has driven the development of numerous workaround strategies, including Retrieval Augmented Generation (RAG), agentic decomposition, and hybrid model architectures. These techniques aim to manage and optimize information processing, but they fundamentally represent trade-offs designed to circumvent the quadratic bottleneck.

Subquadratic’s entry into this space with its SSA architecture represents a bold claim to sidestep this long-standing challenge. The company, comprising 11 Ph.D. researchers, asserts that SSA achieves linear scaling in both compute and memory requirements relative to context length. This purported breakthrough allows their model to process 12 million tokens efficiently, a scale far beyond the practical reach of current frontier models. The company reports that at one million tokens, SSA operates 52 times faster than dense attention mechanisms. Furthermore, on a needle-in-a-haystack retrieval task at 12 million tokens, Subquadratic’s model achieved an impressive 92.1% accuracy. On the MRCR v2 benchmark, the model scored 83, outperforming OpenAI’s GPT-5.5 by a substantial nine points.

These are significant claims that, if validated, could redefine the capabilities of large language models. The benchmarks released by Subquadratic are indeed noteworthy. On the SWE-bench, a benchmark for evaluating code generation capabilities, their model achieved an 82.4% success rate. This score surpasses the performance of Anthropic’s Opus 4.6 (81.42%) and Google’s Gemini 3.1 Pro (80.6%). Crucially, Subquadratic emphasizes that these advancements are being delivered at a significantly lower operational cost.

Subquadratic is making its technology available through an API, offering the full 12-million-token context window. Additionally, the company is launching specialized tools: SubQ Code, a coding agent, and SubQ Search, a deep research tool, both powered by the new architecture.

The Historical Pursuit of Efficient Long-Context AI

The problem of quadratic attention cost is not a new one. Researchers have been exploring solutions for years, tracing back to the early days of transformer research. The overarching pattern has been a series of attempts to trade one desirable characteristic for another, with no single approach successfully replacing dense attention at the frontier of model scale and performance.

One early strategy was fixed-pattern sparse attention, exemplified by models like Longformer. This approach achieved linear scaling by restricting each token’s attention to a localized sliding window. While effective when relevant information is spatially close, its performance degrades when key details are dispersed across longer sequences.

State-space models, such as Mamba, Mamba-2, RWKV, and RetNet, offered an alternative by replacing the all-pairs comparison of attention with a recurrent state mechanism that compresses past information. However, this compression is inherently lossy. Studies, including one by Nvidia at an 8B parameter scale, indicated that pure Mamba-2 lagged behind transformers on tasks like MMLU and phonebook lookup, with performance gaps only narrowing when attention mechanisms were reintroduced.

Hybrid architectures represent a pragmatic compromise. Models like Jamba, Kimi Linear, Qwen3-Next, and Nvidia’s Nemotron v3 combine efficient layers with a few dense attention layers for critical retrieval tasks. While offering cost benefits, these hybrids still face the O(n²) complexity of their retained dense attention layers, limiting their ultimate scalability. A hybrid model that is three times cheaper at 32K tokens, for instance, will still incur the same proportional cost increase at 10 million tokens due to the fundamental quadratic nature of its dense attention components.

The context window has been shattered: Subquadratic debuts a 12-million-token window

More recent innovations have shifted focus from modifying the attention pattern or compressing states to intelligently selecting which positions to attend to. DeepSeek’s Native Sparse Attention, which received the ACL 2025 best paper award, and its successor, DeepSeek Sparse Attention (DSA), implemented in DeepSeek V3.2-Exp, utilize a lightning indexer to route attention to a select subset of keys. While the attention over these selected keys is genuinely sparse, the indexer itself must still score every query against every key, introducing a quadratic bottleneck in the selection process.

Alex Whedon, CTO of Subquadratic, explained this challenge to The New Stack: "Sparse attention basically means instead of doing what transformers do, which is if you have 1,000 words, you look at every possible relationship between all 1,000 words, which is 1,000 squared combinations. You realize that only a portion of those actually matter and you only process the portion that matter." The critical hurdle, however, lies in making this selection process itself efficient.

SSA’s Differentiated Approach

Subquadratic’s SSA architecture claims to circumvent the "indexer trap" encountered by previous sparse attention methods. Its core innovation lies in a content-dependent selection mechanism. For any given query, the model dynamically identifies and prioritizes the most relevant positions based on the actual content of the query and keys. Crucially, this selection process itself does not exhibit quadratic scaling.

"For prompt A, words one and six are going to be important to each other," Whedon elaborated. "For prompt B, maybe it’s words two and three. It’s different for every single input." This adaptive and content-aware attention mechanism is the key to SSA’s purported efficiency gains. Whedon further contrasted this with hybrid approaches, stating that while hybrids offer "a scalar benefit," a pure subquadratic mechanism provides a "scaling-law advantage." SubQ’s reported benchmarks reflect this, showing a 7.2x speedup at 128K tokens and a 52.2x speedup at 1 million tokens compared to dense attention.

Performance Benchmarks: A Deep Dive

Subquadratic’s performance claims are supported by a series of benchmark results. On the RULER benchmark at 128K tokens, their model achieved a score of 97.1%, surpassing Claude Opus 4.6’s 94.8%. The MRCR v2 results further underscore the advantage, with Subquadratic’s score of 83 significantly widening the performance gap compared to other leading frontier models.

On the SWE-Bench Verified, Subquadratic reported an 82.4% success rate, narrowly outperforming Opus 4.6 (81.4%) and Gemini 3.1 Pro (80.6%). A particularly striking achievement is the model’s performance on a needle-in-a-haystack benchmark at 12 million tokens, a context length where no other frontier model currently operates. Here, Subquadratic’s model achieved a 92.1% accuracy.

However, some caveats accompany these impressive results. The technical paper notes that each model was run only once due to the high inference cost of current frontier models. The margin of victory on SWE-Bench, as acknowledged by the company, is also attributed partly to "harness as much as model," suggesting potential optimizations in the testing environment. Furthermore, Whedon himself describes the SubQ model as "way smaller than the big labs," implying that its performance might not directly translate to models of comparable parameter counts from major research institutions.

What Subquadratic is Shipping Now

In its initial rollout, Subquadratic is offering two beta products: an API that exposes the full 12-million-token context window and SubQ Code, a command-line interface (CLI) agent built upon the same architecture. Both are deployed on what the company refers to as "neoclouds" rather than major hyperscale cloud providers. CEO Justin Dangel cited cost as the primary driver for this decision, stating, "they’re very expensive."

While Subquadratic is not open-sourcing the model weights, they plan to provide training tools for enterprises to conduct their own post-training on the SSA architecture. The ambitious target of a 50-million-token context window is slated for release in the fourth quarter.

The history of AI research is replete with ambitious announcements of long-context capabilities that have struggled to materialize into widespread practical use. A notable example is Magic.dev, which announced a 100-million-token context window model in August 2024 with a claimed 1000x efficiency advantage, securing over $500 million in funding. However, as of early 2026, there is no public evidence of their LTM-2-mini model being utilized outside of Magic.dev’s internal operations. This cautionary tale underscores the importance of sustained, demonstrable performance and broad adoption in validating such technological claims.

Funding and Future Outlook

Subquadratic has secured $29 million in funding to date, achieving a valuation of $500 million. Investors include prominent figures such as former SoftBank Vision Fund partner Javier Villamizar and Tinder co-founder Justin Mateen. The company was formerly known as Aldea and initially focused on speech models before pivoting to its current AI research. While the technical underpinnings of the SSA architecture appear robust, the broader adoption and long-term impact of this technology will depend on its ability to consistently deliver on its promises in real-world applications and overcome the historical challenges of scaling AI models. The AI industry will be closely watching Subquadratic’s progress as it aims to set a new standard for efficient, large-context language understanding.

The Historical Pursuit of Efficient Long-Context AI

SSA’s Differentiated Approach

Performance Benchmarks: A Deep Dive

What Subquadratic is Shipping Now

Funding and Future Outlook

Leave a Reply Cancel reply