Frontier LLMs Exhibit Significant Disagreement on Real-World Fact-Checking Claims, Raising Concerns for AI Reliability

The rapid acceleration of the frontier model race, a period characterized by intense competition among leading AI developers, is witnessing a splintering of loyalty among both users and developers. While variations in inference capabilities across different large language models (LLMs) are an accepted norm, a recent analysis suggests a more fundamental issue: even at the highest tier, these advanced models exhibit a surprising lack of consensus on basic, real-world facts. This divergence challenges the assumption that frontier LLMs, despite their diverse architectures and training data, would converge on universally agreed-upon truths about the world.

An investigation published on the claim-verification platform Lenz this month revealed that out of 1,000 real-user fact-check claims – statements asserted as true about the real world – a panel of five prominent frontier LLMs disagreed on a staggering 67% of them. This disagreement manifested in various forms: at least one model would dissent from a majority verdict, or no clear majority opinion would emerge at all. This finding has significant implications for the trustworthiness and deployment of AI in sensitive applications.

The Disagreement Landscape: A Four-Bucket Rubric

The study involved five leading LLMs: GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search integration, and Sonar Pro. Each model was presented with the same real-world claim and tasked with assigning a verdict from a four-bucket rubric: "True," "Mostly True," "Misleading," or "False." The logic behind this rubric is straightforward: since only one category can accurately describe a claim, any divergence among the models indicates a lack of label consistency, suggesting that at least one model is misinterpreting or misrepresenting the factual basis of the claim.

According to Lenz, the intentional inclusion of models that represent a spectrum of inference modes commonly found in production AI systems was crucial to the study’s design. These modes span from latency-sensitive inference, critical for real-time interactions like chatbots, to throughput-aware, resource-constrained, and scalable inference, which are vital for batch processing and large-scale data analysis. The study aimed to capture the real-world performance of these models under conditions that mimic their deployment in diverse applications.

Understanding AI Inference in Practice

The concept of "inference" in AI refers to the process by which a trained model uses its learned patterns to make predictions or generate outputs based on new, unseen data. This process is often categorized by its performance characteristics and operational context. Low-latency, high-throughput inference prioritizes speed and the ability to handle a large volume of requests quickly, making it essential for interactive applications. Conversely, offline or batch inference involves accumulating data over time before processing it in a single, optimized run, typically for cost efficiency or when immediate responses are not required. The study’s selection of models and claims aimed to probe how these different inference strategies might influence factual accuracy and agreement.

A Fresh Corpus of Real-World Claims

A critical aspect of the Lenz study was its methodology for selecting claims. The research, led by Kosta Jordanov, founder of Lenz and co-founder of Wiser, an IT consulting and software engineering group, utilized real claims that users had fact-checked on the Lenz platform since February 15, 2026. This approach was deliberately chosen to move beyond standard benchmark questions, which LLMs might have encountered during their training phases.

"We’ve excluded private claims, near-duplicate claims, and any claims containing personally identifiable information (PII)," Jordanov explained to The New Stack. "The interesting thing about this corpus is that, unlike the standard benchmark questions, the models have not seen these claims during training – i.e., it’s a fresh real-world corpus across science, healthcare, politics, law, and other domains on topics that people care about and fact-check." This ensures that the models are being tested on their ability to reason about novel information and apply their knowledge to contemporary issues, rather than simply recalling memorized answers.

The claims were meticulously curated to represent a diverse range of topics and domains, reflecting the breadth of information users seek to verify in their daily lives. This included scientific breakthroughs, health-related queries, political developments, legal interpretations, and various other subjects that are of common interest and often subject to misinformation. By focusing on such a corpus, the researchers aimed to provide a more realistic assessment of LLM capabilities in a world saturated with information of varying veracity.

Quantifying the Divergence: Beyond Mere Dissent

The 67% dissent metric is a significant indicator of LLM disagreement. However, the analysis further broke down the nature of this disagreement. A substantial 34% of the claims saw substantial disagreement, meaning the models’ verdicts spanned two or more buckets on the rubric. More alarmingly, 21% of the claims resulted in polar opposite verdicts, with at least one model classifying a claim as "False" while another declared it "True." This level of profound disagreement highlights the potential for AI systems to present contradictory information to users, with potentially serious consequences.

The implications of such divergence are particularly acute for live production AI systems and tools. When LLMs are integrated into applications that provide information, make recommendations, or even automate decision-making, these factual disagreements can translate into tangible risks.

Implications for AI Developers and Deployers

The practical takeaway from this research is that relying on a single frontier LLM for factual verification is inherently risky. Each model, when presented with real-world claims, offers an opinion that stems from a demonstrably unstable distribution of responses. The addition of a second model often yields a different perspective, further underscoring the lack of a unified factual grounding.

"For many applications, that’s fine," Jordanov clarified. "But if a software engineering team operates a system where legal, financial, or reputational risk is involved – and it delivers untrue or hallucinated content to users – you should think about the ways in which you validate the AI-generated content before it reaches users." This statement underscores the need for robust validation mechanisms, particularly in high-stakes environments where factual accuracy is paramount.

The research also probed why frontier models might converge confidently on clear "True" or "False" verdicts but fracture on more nuanced categories like "Mostly True" and "Misleading." While definitive answers remain elusive, Jordanov posited that these middle-ground categories might inherently possess greater ambiguity. "What we measured, though, is that some models use the middle buckets way less often than others," he noted. "Gemini is quite ‘confident’ and classified only 6% of the claims in the two middle buckets vs. 45% for Opus 4.7." This suggests differences in how models interpret and utilize the full spectrum of certainty, potentially leading to a more polarized output.

Is Claude Opus 4.7 an Outlier?

The study also examined the performance of Claude Opus 4.7, a model that had previously faced criticism for flaky performance. Claude Opus 4.7 aligned with the peer majority least often, at 70%. This raises the question of whether this divergence is a cause for concern for its developer, Anthropic.

However, Jordanov cautioned against drawing hasty conclusions. "Not necessarily," he stated. "Our limited preliminary research shows that the majority is often wrong, and sometimes we see wrong unanimous verdicts; i.e., having a different opinion than the majority does not necessarily mean being wrong." This observation is critical: the study, by design, does not establish "ground truths" – universally accepted factual benchmarks. Its primary aim is to quantify the differences in verdicts among the models, not to determine which model is definitively correct. Establishing ground truths for such a diverse set of claims would require extensive human expert review across multiple domains, a monumental task.

Echoes from Academia: The Problem of Epistemic Divergence

The findings from Lenz are not isolated. Academic research is also increasingly highlighting the issue of "epistemic divergence" among LLMs, even when they achieve comparable benchmark accuracy. A February study by Eddie Yang and Dashun Wang at Cornell University, published on arXiv, analyzed two major reasoning benchmarks, MMLU-Pro and GPQA. Their research indicated that LLMs exhibiting similar accuracy scores still disagreed on a significant percentage of items – ranging from 16% to 66% overall, and 16% to 38% among top-performing frontier models.

"Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence," the Cornell researchers wrote. "Using two major reasoning benchmarks – MMLU-Pro and GPQA – we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and 16-38% among top-performing frontier models." This academic validation lends further weight to the Lenz findings, suggesting that the current evaluation methods for LLMs may be insufficient to capture the full extent of their potential disagreements on factual matters. The benchmarks, often designed for ease of automated scoring, may not adequately represent the complex, nuanced nature of real-world knowledge and reasoning.

The Path Forward: Incorporating Human Expertise

Jordanov confirmed that the current Lenz analysis is just the initial phase of a larger research agenda. "We do plan a follow-up where we measure the models against human-provided labels, and also measure the source-based multi-step multi-model Lenz pipeline against those labels and against the frontier models," he stated. "The time-consuming part is the methodologically correct labeling by human experts in all of those domains, but we aim to publish in the coming months."

This next phase is crucial for establishing a baseline of human consensus against which LLM performance can be more definitively measured. It aims to answer critical questions about where LLM panels systematically diverge from human understanding, how Lenz’s own verification pipeline compares to both human consensus and individual LLMs, and which categories of claims are most prone to causing divergence. Understanding these patterns – whether due to rubric ambiguity, temporal framing of claims, domain specialization, or calibration drift in the models – is essential for developing more reliable AI systems.

The overarching goal of this research, Jordanov emphasized, is not to create a competitive leaderboard for LLMs. Instead, the focus is on mapping the "structure of disagreement." By understanding the patterns and causes of divergence, developers and researchers can better anticipate potential pitfalls, implement necessary safeguards, and ultimately build AI systems that are more trustworthy and aligned with human understanding of truth and fact. The current landscape of frontier LLMs, while impressive in many respects, clearly indicates that the journey towards truly reliable AI fact-finders is far from over.