The first public benchmark testing how well AI models predict real-world events. 12 models. Same headlines. Scored against reality.
| # | Model | Score | Accuracy | Rounds |
|---|
Given a week of escalation headlines (Feb 20-27), DeepSeek R1 and Mistral Pixtral were the only models to predict military action. All Claude models (Opus 4.6, Sonnet 4.6, Haiku 4.5) and all Amazon Nova models explicitly predicted no strikes would occur. The war began March 1.
Given a week of Iran escalation headlines, predict what happens next.
US and Israel launched massive strikes on Iran. Ayatollah Khamenei killed. Full-scale war began. Oil surged 35%. Middle East airspace closed.
💡 haiku-4.5 scored highest (4.0/5). Lowest: gpt-oss-120b (1.6/5). Average across all models: 2.6/5.
Apple leaks and rumors — predict what gets announced.
Apple announced iPhone 17e, MacBook Neo ($599 laptop — surprise new product line), M4 iPad Air, and 7 new products at March event.
💡 opus-4.6 scored highest (1.6/5). Lowest: kimi-k2.5 (1.0/5). Average across all models: 1.1/5.
Pentagon pressures AI companies on military contracts.
OpenAI made controversial Pentagon surveillance deal. Altman admitted it was 'opportunistic and sloppy'. ChatGPT uninstalls surged 295%. Anthropic refused. Major backlash.
💡 sonnet-4.6 scored highest (2.2/5). Lowest: gpt-oss-120b (1.0/5). Average across all models: 1.4/5.
Iran war just started — predict market impact.
Oil surged 35% in one week (biggest gain since 1983). Dow futures sank 1000+ points. Gas prices spiked. Defense-tech stocks boomed. Treasury market worst weekly rout since 'liberation day'.
💡 lenz-1.0 scored highest (2.8/5). Lowest: qwen3-next-80b (1.2/5). Average across all models: 1.6/5.
Will Trump's escalating tariff policy succeed or face further major setbacks?
Trump's escalating tariff policy faced setbacks as the Supreme Court ruled them unlawful, prompting a scramble for refunds among US companies and trade groups. Meanwhile, geopolitical tensions surged due to Trump's aggressive stance on Iran, overshadowing tariff discussions.
💡 nova-2-lite scored highest (2.2/5). Lowest: qwen3-next-80b (1.0/5). Average across all models: 1.5/5.
Will the US launch a military strike against Iran?
The US, in collaboration with Israel, launched military strikes against Iran, resulting in the reported death of Supreme Leader Ayatollah Khamenei and significant regional repercussions. These strikes caused widespread chaos, including disruptions in air travel and damage to key infrastructure such as Dubai International Airport.
💡 nova-2-lite scored highest (3.0/5). Lowest: mistral-large-3 (1.4/5). Average across all models: 2.1/5.
Will Western non-combat troops be deployed to Ukraine in 2026?
Based on the provided headlines, there is no information regarding the deployment of Western non-combat troops to Ukraine in 2026. The headlines focus on various unrelated global and local events.
💡 llama4-maverick scored highest (1.0/5). Lowest: llama4-maverick (1.0/5). Average across all models: 0.9/5.
Will local resistance significantly slow Big Tech's data center expansion plans?
Local resistance, specifically in Iowa County, has led to extensive zoning rules for data centers, while geopolitical tensions have caused physical damage to Amazon's data centers in the UAE and the Middle East.
💡 opus-4.6 scored highest (2.6/5). Lowest: lenz-1.0 (1.0/5). Average across all models: 1.5/5.
Will the Stop Killing Games campaign achieve meaningful regulation of digital products?
The follow-up headlines primarily focused on geopolitical tensions and developments in the gaming industry, with no direct mention of the Stop Killing Games campaign or its impact on digital product regulation. The gaming news revolved around new console announcements and game reviews.
💡 mistral-large-3 scored highest (1.0/5). Lowest: mistral-large-3 (1.0/5). Average across all models: 1.0/5.
Will the US launch military strikes against Iran?
The US launched military strikes against Iran, leading to a surge in oil prices and regional instability. The conflict has drawn in other actors, including Hezbollah and Israel, escalating tensions across the Middle East.
💡 llama4-maverick scored highest (4.3/5). Lowest: sonnet-4.6 (1.8/5). Average across all models: 2.4/5.
Will Trump escalate his tariff policy into a full-scale trade war?
Trump did not escalate his tariff policy into a full-scale trade war; instead, the Supreme Court ruled some tariffs unlawful, prompting a process for refunds and legal challenges from companies like Nintendo and Lenovo.
💡 haiku-4.5 scored highest (3.0/5). Lowest: mistral-large-3 (1.2/5). Average across all models: 2.0/5.
Will the killing of El Mencho lead to sustained cartel violence in Mexico?
El Mencho, the Jalisco cartel leader, was killed and buried in a golden casket in a Guadalajara cemetery on March 3, 2026, but the follow-up headlines do not provide specific details on the resulting cartel violence in Mexico.
💡 nova-2-lite scored highest (3.0/5). Lowest: mistral-large-3 (1.0/5). Average across all models: 1.7/5.
Will the US implement strict federal regulations on AI development?
There were no follow-up headlines indicating the US implemented strict federal regulations on AI development within 7-14 days after the original question date. The news primarily focused on AI advancements, ethical concerns, and international competition in AI.
💡 sonnet-4.6 scored highest (2.5/5). Lowest: kimi-k2.5 (1.0/5). Average across all models: 1.5/5.
Will Trump make a formal attempt to purchase or assert control over Greenland?
Donald Trump did not make a formal attempt to purchase or assert control over Greenland. The follow-up headlines focus on other issues, including gender equality in schools, legal challenges by Anthropic against the Trump administration, and Canadian reactions to Trump's previous sovereignty threats.
💡 deepseek-v3.2 scored highest (1.6/5). Lowest: haiku-4.5 (1.0/5). Average across all models: 1.0/5.
Will the US launch military strikes against Iran?
The US launched military strikes against Iran, with Israel's involvement, leading to significant escalations, including the reported death of Ayatollah Khamenei and retaliatory attacks on a US oil tanker.
💡 opus-4.6 scored highest (3.4/5). Lowest: mistral-large-3 (1.6/5). Average across all models: 2.4/5.
Will Congress intervene to limit Trump's tariff authority after the Supreme Court ruling?
Congress did not intervene to limit Trump's tariff authority after the Supreme Court ruling, as subsequent news coverage focused predominantly on Trump's foreign policy actions and statements regarding Iran and NATO.
💡 haiku-4.5 scored highest (2.6/5). Lowest: qwen3-next-80b (1.0/5). Average across all models: 1.5/5.
Will AI coding tools cause a major crash in traditional software and cybersecurity stock valuations?
There was no major crash in traditional software and cybersecurity stock valuations reported in the follow-up news headlines. Instead, the focus was on geopolitical tensions, tech updates, and regulatory considerations around AI.
💡 lenz-1.0 scored highest (2.6/5). Lowest: mistral-large-3 (1.0/5). Average across all models: 1.3/5.
Will cartel violence in Mexico disrupt the upcoming World Cup?
The follow-up headlines do not mention any disruptions to the World Cup related to cartel violence in Mexico. Instead, they focus on various unrelated global events and sports updates.
💡 llama4-maverick scored highest (1.0/5). Lowest: llama4-maverick (1.0/5). Average across all models: 0.8/5.
Will the Pentagon's integration of Grok into classified systems succeed without security breaches?
There is no follow-up news regarding the Pentagon's integration of Grok into classified systems or any related security breaches. The headlines focus on global political events, entertainment, and other unrelated topics.
💡 nova-2-lite scored highest (1.0/5). Lowest: nova-2-lite (1.0/5). Average across all models: 1.0/5.
Will rising tensions between the US and Iran lead to military conflict?
Rising tensions between the US and Iran escalated into a military conflict, resulting in the death of Ayatollah Khamenei and the naming of a new supreme leader in Iran. The conflict has also impacted global geopolitics, affecting regions such as Taiwan and causing concern among Chinese investors.
💡 kimi-k2.5 scored highest (3.0/5). Lowest: sonnet-4.6 (1.2/5). Average across all models: 2.0/5.
Will Paramount's bid win over Warner Bros. Discovery against Netflix's offer?
Warner Bros. Discovery accepted Paramount's bid, and the FCC chair deemed the merger preferable to Netflix's previous offer.
💡 glm-4.7 scored highest (3.7/5). Lowest: gpt-oss-120b (1.4/5). Average across all models: 2.4/5.
How will AI reshape the future job market and workforce dynamics?
Meta announced the formation of a new AI engineering organization focused on superintelligence, and Amazon showcased the future of AI in consulting with its cloud reboot, reflecting significant investments and strategic shifts in the tech industry toward AI.
💡 llama4-maverick scored highest (1.4/5). Lowest: nova-premier (1.0/5). Average across all models: 1.1/5.
What impact will the US's new consular policy have on Israeli-Palestinian relations?
The new consular policy's impact on Israeli-Palestinian relations wasn't directly addressed in the follow-up headlines, which primarily focused on tensions and military actions involving the US, Iran, and Israel. The geopolitical climate appears increasingly volatile, overshadowing specific diplomatic efforts.
💡 nova-2-lite scored highest (2.6/5). Lowest: qwen3-next-80b (1.0/5). Average across all models: 1.7/5.
Will solar energy continue to outpace other renewable sources in the US?
The follow-up headlines primarily focus on geopolitical tensions affecting global energy markets rather than specific developments in US renewable energy. Solar energy trends in the US are not directly addressed in these updates.
💡 nova-premier scored highest (1.0/5). Lowest: nova-premier (1.0/5). Average across all models: 1.0/5.
Will Trump broker a ceasefire between Russia and Ukraine within his first 100 days?
Donald Trump did not broker a ceasefire between Russia and Ukraine within his first 100 days. The conflict continued with ongoing military strikes and rising tensions between the involved parties.
💡 kimi-k2.5 scored highest (4.0/5). Lowest: gpt-oss-120b (1.2/5). Average across all models: 2.7/5.
Will North Korea launch a military attack on South Korea or US assets?
North Korea did not launch a military attack on South Korea or US assets. Instead, tensions were heightened in the Middle East with military actions between the US and Iran, while North Korea continued its military posturing and missile tests.
💡 sonnet-4.6 scored highest (2.8/5). Lowest: gpt-oss-120b (1.2/5). Average across all models: 2.0/5.
Will Trump and Xi Jinping reach a comprehensive trade agreement?
Trump and Xi Jinping did not reach a comprehensive trade agreement. Instead, the Trump administration initiated new Section 301 trade probes into China, the EU, and others on March 12, 2026.
💡 nova-premier scored highest (2.6/5). Lowest: lenz-1.0 (1.0/5). Average across all models: 1.4/5.
Will the US government impose major restrictions on AI companies in 2026?
The US government did not impose major restrictions on AI companies by early March 2026, but the Pentagon labeled AI company Anthropic a supply chain risk.
💡 glm-4.7 scored highest (2.0/5). Lowest: qwen3-next-80b (1.0/5). Average across all models: 1.4/5.
Will Anthropic accept the Pentagon's terms for AI use?
Anthropic was labeled a supply chain risk by the Pentagon on March 6, 2026, indicating they did not accept the Pentagon's terms for AI use. Despite this, Anthropic continued to engage in other projects, such as collaborating with Firefox on AI bug hunting.
💡 llama4-maverick scored highest (3.0/5). Lowest: gpt-oss-120b (1.0/5). Average across all models: 1.6/5.
Who will ultimately acquire Warner Bros. Discovery?
Based on the follow-up headlines, it appears that Paramount is attempting to merge with Warner Bros. Discovery, prompting opposition from the Teamsters union.
💡 kimi-k2.5 scored highest (2.0/5). Lowest: qwen3-next-80b (1.0/5). Average across all models: 1.5/5.
Will the US and Iran reach a nuclear deal or will tensions escalate to military conflict?
Tensions between the US and Iran escalated to military conflict, resulting in the death of Ayatollah Khamenei and the naming of a new supreme leader in Iran. The conflict has also impacted global stakeholders, including Chinese investors.
💡 opus-4.6 scored highest (4.8/5). Lowest: gpt-oss-120b (1.2/5). Average across all models: 2.7/5.
Will AI cause mass layoffs across industries as Jack Dorsey predicts?
Atlassian announced layoffs of 10% of its workforce, attributing the cuts to changes brought by the AI era, while Marvell's stock surged due to strong AI-driven demand.
💡 llama4-maverick scored highest (3.0/5). Lowest: gpt-oss-120b (1.0/5). Average across all models: 1.5/5.
Will Anthropic reverse its position and comply with Pentagon demands to be removed from the US government blacklist?
Anthropic did not reverse its position and comply with Pentagon demands to be removed from the US government blacklist. Instead, the company launched a marketplace for Claude-powered software.
💡 haiku-4.5 scored highest (1.8/5). Lowest: nova-2-lite (1.0/5). Average across all models: 1.1/5.
Will US regulators approve the Paramount-Warner Bros. Discovery merger despite antitrust concerns?
As of March 13, 2026, the Teamsters union has publicly urged the Department of Justice (DOJ) to block the proposed merger between Paramount and Warner Bros. Discovery due to antitrust concerns. No final decision from US regulators has been announced.
💡 llama4-maverick scored highest (2.6/5). Lowest: glm-4.7 (1.0/5). Average across all models: 1.5/5.
Will the Trump administration succeed in terminating the federal student loan repayment plan by end of 2026?
The Trump administration did not terminate the federal student loan repayment plan by the end of 2026. Instead, millions of borrowers were removed from Biden's affordable repayment plan due to a court reversal.
💡 sonnet-4.6 scored highest (2.8/5). Lowest: nova-2-lite (1.0/5). Average across all models: 1.9/5.