Security research just changed. Not incrementally, but in a way that forces the industry to rethink how software gets hardened. Many flaws in software go unnoticed for years because finding and exploiting them has required expertise held by only a few skilled researchers. With the latest frontier AI models, the cost, effort, and level of expertise required to find and exploit vulnerabilities have all dropped dramatically.
In 2025 alone, more than 48,000 CVEs were published, a 38% increase from 2023. The human security workforce was already stretched thin before AI entered the picture. Now AI is both widening the attack surface and, increasingly, the best tool we have for scanning it.
The Leap From Pattern Matching to Real Reasoning
Traditional static analysis tools work by matching code against known vulnerability patterns. Existing analysis tools help, but only to a point, as they usually look for known patterns. Finding the subtle, context-dependent vulnerabilities that are often exploited by attackers requires skilled human researchers, who are dealing with ever-expanding backlogs.
The new generation of AI security tools works differently. OpenAI's Aardvark, now rebranded as Codex Security, does not rely on traditional program analysis techniques like fuzzing or software composition analysis. Instead, it uses LLM-powered reasoning and tool-use to understand code behavior. It looks for bugs as a human security researcher might: by reading code, analyzing it, writing and running tests, and using tools.
What These Tools Are Actually Finding
The results from early deployments are hard to dismiss. Over 30 days of beta testing, Codex Security scanned more than 1.2 million commits across external repositories, identifying 792 critical findings. OpenAI's scans on the same repositories over time demonstrated increasing precision and declining false positive rates, with the latter falling by more than 50% across all repositories.
Anthropic's results with Claude Mythos Preview go further. Mythos Preview identified a 27-year-old denial-of-service vulnerability in OpenBSD's TCP SACK implementation, an integer overflow condition that allows a remote attacker to crash any OpenBSD host responding over TCP. A 16-year-old vulnerability in FFmpeg's H.264 codec was also discovered, introduced in a 2003 commit and exposed by a 2010 refactor, overlooked since by every fuzzer and human reviewer who had examined the code.
Historically, converting a known vulnerability into a working exploit has taken skilled researchers days to weeks. That timeline has compressed substantially.
Key Players in AI-Driven Vulnerability Research
- Claude Mythos Preview (Anthropic): Used to identify thousands of zero-day vulnerabilities in every major operating system and every major web browser.
- Codex Security (OpenAI): Evolved from Aardvark; available to ChatGPT Enterprise, Business, and Edu customers.
- ÆSIR (Trend Micro): Combines MIMIR for real-time threat intelligence and FENRIR for zero-day discovery, enabling scans of massive codebases in hours and continuous protection for customers.
- Big Sleep (Google DeepMind + Project Zero): Found and reported 20 flaws in popular open source software including FFmpeg and ImageMagick.
The Signal-to-Noise Problem
Not everything is clean. According to Bobby Kuzma, director of offensive cyber operations at ProCircular, "AI tools, when properly applied and validated, do provide high impact findings, but we're also seeing programs being overwhelmed by huge numbers of reports, most of which are slop."
Triaging the increasing volume of variable-quality reports is putting a strain on under-resourced programs. The curl project, a command-line tool commonly used for downloading files, has put out public entreaties to stop submitting AI-detected bugs.
Crystal Hazen, senior bug bounty program manager at HackerOne, describes the current moment as "the era of the bionic hacker," where human researchers use agentic AI systems to collect data, triage, and advance discovery. The human-in-the-loop still matters, especially for validating findings before they flood maintainers' inboxes.
The Dual-Use Reality
Although the risks from AI-augmented cyberattacks are serious, there is reason for optimism: the same capabilities that make AI models dangerous in the wrong hands make them invaluable for finding and fixing flaws in important software.
A significant share of the world's code will be scanned by AI in the near future, given how effective models have become at finding long-hidden bugs. Attackers will use AI to find exploitable weaknesses faster than ever. But defenders who move quickly can find those same weaknesses, patch them, and reduce the risk of an attack.
Final Thoughts
What strikes me most here isn't the raw vulnerability counts. It's the age of the bugs being found. A 27-year-old flaw in OpenBSD. A 16-year-old codec bug in FFmpeg. These weren't obscure codebases sitting in a dusty corner. They were actively maintained, widely reviewed, and battle-tested by some of the best engineers in the world. AI found what decades of human scrutiny missed, and it did it at a fraction of the cost.
The noise problem is real and shouldn't be minimized. Flooding open-source maintainers with low-confidence AI reports creates its own kind of damage. The tools that will win long-term are the ones that prioritize precision over recall, like Codex Security's approach of grounding findings in system context before surfacing them. False positives aren't free.
The arms race framing is tired, but the underlying tension is genuine: the same model capability that finds a 17-year-old remote code execution flaw for a defender can find it for an attacker too. How the industry manages access, disclosure, and responsible deployment of these tools over the next 12 months will matter more than any individual benchmark score. What's your take on where the line should be drawn? Drop your thoughts in the comments.
FAQ
What is AI-powered vulnerability research?
AI-powered vulnerability research uses large language models and autonomous agents to scan codebases, reason about code behavior, and identify security flaws, going beyond traditional rule-based static analysis tools.
How is AI different from traditional security scanners?
Static analysis is typically rule-based, meaning it matches code against known vulnerability patterns. That catches common issues like exposed passwords or outdated encryption, but often misses more complex vulnerabilities like flaws in business logic or broken access control. AI models reason about code contextually, the way a human researcher would.
Are AI bug-hunting tools reliable?
Results are mixed. Top tools like Codex Security are improving precision, but several software project maintainers have complained of bug reports that are actually hallucinations, with some calling them the bug bounty equivalent of AI slop. Human validation remains essential.
Which companies are leading in AI security research?
Anthropic (Claude Mythos / Project Glasswing), OpenAI (Codex Security), Google DeepMind (Big Sleep), and Trend Micro (ÆSIR) are among the most active players pushing AI-driven vulnerability discovery in 2025 and 2026.
Can attackers use these same AI tools?
Yes. The same capabilities that help defenders find and fix vulnerabilities could help attackers exploit them. This is why companies like Anthropic are restricting access to their most capable security models to vetted partners only.




