Offensive Security with Large Language Models (1)

Applications of larage language models in offensive security
Xint's avatar
Sep 27, 2024
Offensive Security with Large Language Models (1)

Introduction

The rise of large language models (LLMs) has been nothing short of revolutionary. Once confined to assisting with language tasks like text generation and summarization, these models are now breaking into more technical fields — including cybersecurity. LLMs, like OpenAI’s ChatGPT or Anthropic’s Claude, are transforming how we tackle security challenges. But what if these models could help hackers identify vulnerabilities or even exploit them? That’s exactly where we find ourselves today. Offensive security is no longer the exclusive domain of seasoned hackers but is now increasingly accessible to anyone with the right AI tools.

In this post, we’ll explore how LLMs are shaking up the world of offensive security. While much of the focus on LLMs has been on defense — like detecting threats or writing secure code — there’s an entire wave of research exploring how these models can be used to attack systems, whether through fuzzing, vulnerability detection, or even automating complex hacking techniques.

Cybersecurity and Large Language Models

LLMs have predominantly been used in defensive security. For example, Microsoft’s Copilot for Security helps write security-conscious code, while AI and LLM-based technologies are used to detect threats based on threat intelligence. A recent paper titled Large Language Models in Cybersecurity: State-of-the-Art notes that while defensive applications are well-documented, the use of LLMs in offensive security remains a relatively underexplored area.

However, LLMs are now beginning to gain traction in offensive security, particularly for automatic vulnerability detection and fuzzing techniques. Traditionally, identifying vulnerabilities and evaluating their exploitability demanded significant human expertise, time, and resources. Now, with the rise of LLM-based models like OpenAI’s ChatGPT and Anthropic’s Claude, these models are evolving beyond simple pattern recognition and are beginning to interpret the deeper logic and flow of code and binaries. This advancement has sparked growing interest and research into how LLMs can automate vulnerability detection with greater precision.

OSS-Fuzz-Gen: Merging Google’s OSS-Fuzz and AI

Open source testing has long been a key area of research, with Google’s OSS-Fuzz being one of the most prominent tools for automatically fuzz testing open-source projects. While OSS-Fuzz has been operational for over eight years, it requires human input to write seed corpora (text inputs) and fuzzer harnesses (test programs), limiting its coverage to around 30%. To address these limitations, OSS-Fuzz-Gen was developed, incorporating large language models (LLMs) to automate harness generation.

In OSS-Fuzz-Gen, LLMs generate the necessary harnesses based on a provided list of Application Programming Interfaces (APIs). If compilation errors arise, the LLM refines the harness using error messages until it’s executable. After each fuzzing round, the system evaluates the coverage and iteratively improves the harness. This automated approach drastically reduces the need for human intervention while improving the efficiency of fuzzing tests.

The benefits of OSS-Fuzz-Gen are evident in its results. For example, in the TinyXML2 project, coverage increased by 31% compared to OSS-Fuzz alone. Other projects experienced improvements ranging from 1–10%. These results show that OSS-Fuzz-Gen has evolved beyond being a simple testing platform and is emerging as a tool for continuous code quality enhancement.

LLM-Based Fuzzing Innovation: PromptFuzz

PromptFuzz’s fuzz driver generation process
PromptFuzz’s fuzz driver generation process

Building on the advancements of OSS-Fuzz-Gen, further research into LLM-based fuzzing has led to the development of tools like PromptFuzz (Prompt Fuzzing for Fuzz Driver Generation). Introduced in November 2023, PromptFuzz automates the generation of fuzzer harnesses by parsing API-related data types from header files, significantly reducing setup time. While OSS-Fuzz-Gen focuses on individual APIs, PromptFuzz goes a step further by generating harnesses for API sequences, accounting for interactions between different APIs — a critical improvement for detecting more complex vulnerabilities.

Additionally, PromptFuzz introduces a sophisticated method for analyzing constraints on test inputs based on the fuzzing results. This process helps bypass trivial branches in the code and improves coverage by discovering previously uncovered unique branches. By doing so, PromptFuzz enhances the efficiency of fuzzing efforts, advancing automated software testing beyond what traditional tools could achieve.

Project Naptime: Google Project Zero’s LLM Initiative

Google Project Zero’s latest venture, Project Naptime, is generating buzz among offensive security researchers for its innovative application of LLMs in vulnerability discovery and exploitation. Led by security experts Sergei Glazunov and Mark Brand, the project aims to explore how LLMs can significantly enhance the offensive security landscape by automating critical tasks that traditionally require deep technical expertise.

Earlier this year, Meta launched CyberSecEval 2, a benchmark for detecting memory safety bugs and crafting exploits using LLMs. The Project Zero team managed to outperform these benchmarks dramatically — achieving up to a 20-fold increase in efficiency — by deploying an LLM-driven fuzzing framework. The team humorously dubbed the project Naptime because they can “take regular naps while it helps us out with our jobs.” You can find more details in their blog post.

What sets Project Naptime apart is its well-engineered framework that maximizes the strengths of LLMs, built on several key design principles:

  1. Space for Reasoning: Encouraging extensive reasoning in well-defined contexts.

  2. Interactive Environment: Creating interactive environments that allow for the adjustment of errors.

  3. Specialized Tools: Providing LLMs with the same debugging and scripting tools that human researchers use.

  4. Perfect Verification: Ensuring reproducibility and reliability by structuring vulnerability discovery processes.

  5. Sampling Strategy: Employing sampling strategies to efficiently explore multiple hypotheses (e.g., vulnerability discovery and exploitation).

Naptime Architecture
Naptime Architecture

As illustrated in the diagram above, Project Naptime’s architecture allows AI agents to imitate the workflow of human security researchers. The AI agent is equipped with essential tools such as a debugger, a code browser for navigating the target codebase, Python for scripting, and a reporter to track and verify the overall process. In essence, the AI agent mirrors human roles by utilizing the same tools offensive security experts employ, enhanced with LLM capabilities.

This setup allows Naptime to imitate the iterative, hypothesis-driven techniques used by researchers in vulnerability detection. The design principles ensure not only effective vulnerability identification but also the reproducibility of results, which is critical in security research.

Naptime uses commercial LLM models, such as GPT and Gemini, as its AI agents. Details on Meta’s CyberSecEval 2 benchmark, which was used for this evaluation, as well as a comprehensive assessment of memory vulnerabilities (e.g., buffer overflow, advanced memory corruption).

Ultimately, the Project Zero team confirmed that LLMs could be effectively employed to solve isolated CTF (Capture The Flag) style security challenges in controlled environments. However, real-world security work is far more intricate. It requires not only understanding large systems but also prioritizing and analyzing key factors from an attacker’s viewpoint. While LLMs show immense promise, replicating these successes in real-world environments remains a challenge. As the Project Zero team emphasized, for LLMs to truly match human researchers, they will need to operate with the same flexibility, continuously generating, testing, and refining complex hypotheses — just as human security experts do every day.

AIxCC: Strengthening Open Source Security with AI

Winners of the AIxCC semifinals
Winners of the AIxCC semifinals

In today’s interconnected world, the rapid development and widespread use of open-source software have significantly expanded the attack surface for cyber threats. With open-source software integrated into critical sectors like transportation, energy, and emergency systems, its security is of paramount concern. To address these growing risks, the U.S. Defense Advanced Research Projects Agency (DARPA) launched the AI Cyber Challenge (AIxCC). This competition challenges participants to build AI systems that can autonomously detect, prove, and patch vulnerabilities in open-source projects.

The aim of AIxCC is to create more secure open-source software by leveraging advanced AI models. Participants are tasked with analyzing open-source projects, identifying vulnerabilities introduced in new commits, and submitting the problematic input corpus along with patches to demonstrate the effectiveness of their AI systems.

In comparison to DARPA’s Cyber Grand Challenge (CGC) held in 2016, which focused on lower-level technical skills like binary-level patching, AIxCC centers around source code vulnerabilities. Participants can leverage LLMs to detect vulnerabilities at the source code level and develop patches. AIxCC requires participants to rely solely on LLM APIs provided by the competition, emphasizing higher-level software security and automation using AI tools.

More details on Theori’s involvement in AIxCC can be found in a dedicated post.

Conclusion

In this article, we examined the growing role of LLMs in offensive security. As their ability to interpret and generate code advances, LLMs are playing an increasingly vital role in automating software security and vulnerability testing. Major organizations like Google, DARPA, and Meta are actively leveraging LLMs to enhance the detection and remediation of security flaws in code. While these technologies are still evolving and face limitations when dealing with complex security challenges, they hold immense promise for the future of open-source software security.

Looking ahead, the key to advancing offensive security lies in blending human expertise with LLM-driven automation. This combination has the potential to push cybersecurity research forward in unprecedented ways. Rather than replacing human security researchers, these technologies are poised to free up time by automating routine tasks, allowing experts to focus on innovation and tackling more complex security challenges. In our next blog post, we’ll take a closer look at how LLMs are used to strengthen the security of web services. Stay tuned!

About Xint

Xint is an advanced Unified Security Posture Management (USPM) platform from Theori, the leading cybersecurity firm specializing in offensive security solutions. Our offensive approach ensures we deliver optimal security solutions by analyzing vulnerabilities from an attacker’s perspective. Serving a diverse client base of all sizes from ambitious startups to Fortune 500 enterprises, we are the leading experts in identifying and addressing industry-agnostic threats.

To learn more about Xint,
▪️ Visit our
website
▪️ Visit Theori’s
website
▪️ Follow us on
X
▪️ Follow us on
LinkedIn

Share article

Theori © 2025 All rights reserved.