How AI can revolutionize vulnerability research

Generative artificial intelligence (GenAI) and AI agents are combining unique code analysis, reasoning and automation capabilities to save time and uncover previously overlooked flaws.

AI agents like Google’s Big Sleep and Code Intelligence’s Spark have already enabled the autonomous discovery of security flaws in popular open-source projects, while the incorporation of large language models (LLMs) into fuzzing workflows has enabled researchers to cover more ground while removing tedious and time-intensive manual steps.

“As these approaches continue to evolve, they will significantly impact the workflows of both software developers and security researchers,” Code Intelligence co-founder and Chief Product Officer Khaled Yakdan told SC Media.

The benefits of AI-powered vulnerability research

Google’s Big Sleep AI agent, which was developed in a collaboration between the Project Zero and Project DeepMind teams, exemplifies one of the benefits AI-powered vulnerability discovery can potentially bring to security researchers.

The vision of the project, successor to Google’s Project Naptime, is to save researchers significant time and manual work by automating certain tasks, which Project Zero jokingly stated will allow them “take regular naps.”

While automated techniques like fuzzing tools and static and dynamic application security testing (SAST/DAST) tools are already used to save time and improve efficiency in vulnerability research, AI agents’ unique reasoning capabilities and ability to leverage other tools take them a step beyond traditional automation.

“While traditional security testing tools follow predefined heuristics and algorithms, AI agents leverage modern LLMs to understand, generate, and adapt test cases intelligently," Yakdan explained.

"When AI agents are incorporated into traditional vulnerability testing tools, they can enhance bug detection capabilities, ease triaging and debugging, and accelerate the setup process."

The Spark agent, for example, makes fuzzing faster and more effective by using LLM intelligence to automatically generate high-quality fuzz tests, typically a time-intensive manual process. Spark can save up to 1,000 hours of manual effort when testing a codebase with 100,000 lines of code, according to Code Intelligence.

“Additionally, Spark iteratively refines these tests to achieve thorough code coverage, enabling a more efficient and scalable approach to security testing,” Yakdan added.

Time savings aren’t the only potential benefit of LLM-enhanced vulnerability testing—Google has noted that reasoning abilities of AI agents like Big Sleep have the potential to uncover vulnerabilities that have long gone undiscovered by traditional fuzzing due to limitations in available fuzzing harnesses for certain projects.

AI-enhanced fuzzing can also increase code coverage. More than 370,000 lines of new code coverage across 272 C/C++ projects added to Google’s OSS-Fuzz after LLM capabilities were introduced to the tool.

Additionally, the time gained back from automating traditionally manual tasks can empower security professionals to give more focus to higher-level efforts like strategic analysis and studying novel attack vectors.

“Developers will receive more reliable vulnerability reports and, in some cases, even proposed fixes, streamlining the remediation process and reducing the burden of manual debugging," Yakdan said.

"Security researchers, on the other hand, will transition from performing repetitive vulnerability discovery tasks to guiding and fine-tuning AI agents, setting research objectives, and validating complex findings."

Real-world results of AI vulnerability discovery

AI tools are already proving their worth in discovering new vulnerabilities, with Code Intelligence’s Spark autonomously discovering a heap-based user-after-free vulnerability in popular open-source wolfSSL library during its beta testing in October 2024.

“The discovery required no manual intervention—beyond setting up the project and typing ‘cifuzz spark’ into the command line,” Code Intelligence Software Developer Peter Samarin wrote in a statement.

Google has also embraced AI-enhanced fuzzing, adding LLM capabilities to its OSS-Fuzz system in August 2023. Since then, the AI-powered version of OSS-Fuzz has uncovered more than two dozen bugs in open-source projects. This included a 20-year-old previously-undiscovered out-of-bounds read/write flaw in OpenSSL in September 2024.

Big Sleep, which is driven by Gemini 1.5 Pro, takes a different approach to AI agent-based vulnerability research. Using a human-like workflow and interacting with a set of tools to probe codebases for vulnerabilities, it discovered a stack buffer underflow in the widely-used open-source database engine SQLite in October 2024.

While manual intervention in these automated processes is greatly reduced, human decision-making still plays a key role in ultimately reviewing the results of AI tests and reporting the discovered vulnerabilities to project maintainers.

Current challenges and future prospects for AI-driven vulnerability research

While GenAI and agentic AI technologies have progressed rapidly over the past few years, applying these tools to vulnerability research is not without its hurdles. As seen with general use LLMs like OpenAI’s ChatGPT or Google’s Gemini, issues like hallucinations and misuse by bad actors need to be addressed when working with GenAI models.

“One of the main challenges of using AI agents for vulnerability discovery is verification and validation of findings. Without proper validation, AI agents risk generating false positives, which can reduce trust in their effectiveness. Similarly, AI-driven vulnerability fixes pose a challenge: how can we ensure that a proposed fix truly resolves the issue without introducing new bugs or regressions?” Yakdan said.

Code Intelligence takes measures to reduce false positives and verify results with Spark using methods such as re-running fuzz tests repeatedly to both confirm the vulnerability and the reliability of proposed fixes, in addition to human review, Yakdan noted.

As with any new technology, there is also the possibility that threat actors will leverage AI to attempt to uncover zero-days to exploit. Reports by OpenAI last year revealed that threat actors are already attempting to use LLMs to perform tasks like reconnaissance and to search for information on known vulnerabilities.

“A fundamental challenge in cybersecurity is the asymmetry between attackers and defenders: attackers only need to find a single exploitable weakness, while defenders must secure the entire software stack,” Yakdan explained. “However, defenders have a critical advantage—they typically have access to the source code and can proactively apply the best available security testing methods before attackers strike.”

Yakdan says cyber defenders can get ahead of cyber threat actors by using state-of-the-art technologies like automation and AI to find and remediate vulnerabilities before they can be exploited, and to integrate continuous vulnerability discovering into secure development workflows.

In the future, AI-enhanced vulnerability remediation will be the next step in advancing vulnerability research, automating not only the discovery, but also the fixing and mitigation, of security flaws.

“We are planning to enhance Spark with root cause analysis and automated patching, enabling it to not only find vulnerabilities, but also to propose and validate fixes," said Yakdan.

"The ultimate goal is to provide developers with a fully integrated solution that tests software, identifies security flaws, suggests patches, and verifies their correctness—all within an automated workflow.”