OpenAI Pits AI Agents Against Each Other to Red-Team Smart Contracts

CryptoBreaking

OpenAI has unveiled a benchmarking framework aimed at measuring how effectively AI agents can detect, mitigate, and even exploit security vulnerabilities in crypto smart contracts. The project, titled “EVMbench: Evaluating AI Agents on Smart Contract Security,” was released in collaboration with Paradigm and OtterSec, two organizations with deep exposure to blockchain security and investment. The study assesses AI agents against a curated set of 120 potential weaknesses drawn from 40 smart contract audits, seeking to quantify not just detection and patching capabilities but also the theoretical exploit potential of these agents in a controlled environment.

Key takeaways

EVMbench tests AI agents against 120 vulnerabilities culled from 40 smart contract audits, emphasizing vulnerabilities sourced from open-source audit competitions.

Among the models tested, Anthropic’s Claude Opus 4.6 led with an average detect award of $37,824, followed by OpenAI’s OC-GPT-5.2 at $31,623 and Google’s Gemini 3 Pro at $25,112.

OpenAI frames the benchmark as a step toward measuring AI performance in “economically meaningful environments,” not just toy tasks, highlighting the real-world implications for attackers and defenders in the crypto security landscape.

The researchers note that smart contracts secure billions of dollars in assets, underscoring the strategic value of AI-enabled tooling for both offensive and defensive activities.

Industry observers have tied these developments to broader discussions about AI-driven payments and the role of stablecoins in everyday transactions, with major executives predicting growing agentic usage in the coming years.

The context for such work is underscored by 2025’s crypto-security incident data, which shows a continued flow of funds through vulnerabilities and attacks, reinforcing the demand for robust AI-enabled auditing and defense mechanisms.

Detect awards for AI agents are detailed in the OpenAI PDF accompanying the study, which also describes the evaluation methodology and the scenarios used to simulate real-world smart-contract risk. The authors emphasize that while AI agents have evolved to automate a wide range of routine tasks, assessing their performance in “economically meaningful environments” is essential to understanding how they’ll perform under pressure in production systems.

“Smart contracts secure billions of dollars in assets, and AI agents are likely to be transformative for both attackers and defenders.”

OpenAI notes that it expects agentic technologies to broaden the scope of payments and settlement, including stablecoins used in automated workflows. The discussion around AI-enabled payments extends beyond security testing to the broader question of how autonomous systems will participate in daily financial activity. The company’s own projections suggest that agentic payments could become more commonplace, grounding AI capabilities in practical use cases that touch everyday consumer transactions.

In tandem with the benchmark results, Circle CEO Jeremy Allaire has publicly forecast that billions of AI agents could be transacting with stablecoins for everyday payments within the next five years. That view intersects with a recurring theme in crypto circles: the potential for crypto to become the native currency of AI agents, a narrative that has gained notable attention from industry leaders and investors alike. While such predictions remain speculative, the underlying trend is clear—AI automation is moving from the lab to the transaction layer, where it could reshape how value moves across networks.

The study arrives at a moment when crypto security continues to be a significant risk factor for investors. The data point about 2025’s assault on crypto funds—where attackers pulled roughly $3.4 billion—highlights the urgency of improved tooling and faster, more reliable patching mechanisms. The EVMbench framework is positioned, in part, as a way to measure whether AI agents can meaningfully contribute to defensive capabilities at scale, reducing exploitation opportunities and accelerating threat mitigation.

To build the benchmark, researchers drew on 120 curated vulnerabilities spanning 40 smart contract audits, with many weaknesses traced back to open-source audit challenges. OpenAI argues the benchmark will help track AI progress in recognizing and mitigating contract-level weaknesses at scale, offering a standardized way to compare future AI models as they evolve. The study also provides a lens into how AI might be applied to normalizing risk assessment across a wide range of smart-contract architectures, rather than focusing solely on isolated cases.

Smart contracts weren’t built for humans: Dragonfly

In a contemporaneous thread on X, Haseeb Qureshi, a partner at Dragonfly, argued that crypto’s promise of replacing property rights and traditional contracts never materialized not because the technology failed, but because it was never designed with human intuition in mind. He has highlighted the persistent fear associated with signing large transactions in an environment where drainer wallets and other attack vectors remain a constant threat, in stark contrast to the comparatively smoother experience of traditional bank transfers.

Qureshi contends that the next phase of crypto transactions could be enabled by AI-intermediated, self-driving wallets. Such wallets would monitor risk, manage complex operations, and autonomously respond to threats on behalf of users, potentially reducing the friction and fear that characterize large transfers today.

“A technology often snaps into place once its complement finally arrives. GPS had to wait for the smartphone, TCP/IP had to wait for the browser. For crypto, we might just have found it in AI agents.”

The broader takeaway from this thread is that AI agents may play a critical role in transforming how people interact with crypto—shifting from manual, error-prone transactions to automated, risk-aware processes that can scale with adoption. As AI agents begin to demonstrate more competence in handling security concerns, users could see improved reliability and resilience in decentralized finance workflows, even as the underlying technologies continue to mature.

What to watch next

Publication and independent replication of the full EVMbench dataset across additional AI models and architectures.

Broader adoption of AI-assisted auditing workflows by auditors, exchanges, and DeFi projects looking to bolster security postures.

Explorations into agentic wallets and autonomous payment flows, including regulatory and compliance considerations for AI-managed assets.

Follow-up benchmarks comparing more AI systems as new versions roll out, tracking improvements in detection accuracy and patching speed.

Sources & verification

OpenAI: EVMbench: Evaluating AI Agents on Smart Contract Security — PDF: https://cdn.openai.com/evmbench/evmbench.pdf

OpenAI: Introducing EVMbench — https://openai.com/index/introducing-evmbench/

Crypto security losses in 2025 (reporting coverage): https://cointelegraph.com/news/crypto-3-4-billion-losses-2025-wallet-hacks

Dragonfly: Haseeb Qureshi on AI and crypto UX (X post): https://x.com/hosseeb/status/2024136762424185208

China’s AI lead and crypto implications (analysis): https://cointelegraph.com/news/china-ai-lead-future

AI Eye — IronClaw and AI bot developments in Polymarket coverage: https://cointelegraph.com/magazine/ironclaw-secure-private-sounds-cooler-openclaw-ai-eye/

Key figures and next steps

The EVMbench study demonstrates that large language models and related AI agents are beginning to perform meaningful security work in the smart contract space, with clearly quantifiable differences across models. Claude Opus 4.6’s lead in average detect awards signals that certain architectures may be more adept at spotting and mitigating vulnerabilities within complex contract logic, while others trail, offering a spectrum of capabilities that researchers will likely want to refine. The inclusion of multiple industry partnerships in the project underscores the growing consensus that AI-enabled security and automated risk management could become essential to scale in decentralized environments.

As the field evolves, observers will be watching for how quickly AI agents can transition from detection to remediation, and whether these agents can operate reliably in live systems without introducing new risks. The conversation about AI-driven wallets and autonomous payments touches on a broader set of questions around security governance, user consent, and regulatory alignment. If the trajectory suggested by OpenAI and its partners continues, AI-assisted tools could become a core component of future crypto infrastructure, changing both the risk calculus and the user experience in meaningful ways. The next round of benchmarks, alongside real-world deployments, will help determine how quickly this vision materializes and what safeguards must accompany it.

This article was originally published as OpenAI Pits AI Agents Against Each Other to Red-Team Smart Contracts on Crypto Breaking News – your trusted source for crypto news, Bitcoin news, and blockchain updates.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)