AIxCC and RoboDuck
As part of Theoriâs open-sourcing of the Cyber Reasoning System (CRS) nicknamed RoboDuck that we developed for DARPAâs AI Cyber Challenge (AIxCC), we wanted to talk a bit about how our system is designed. Our system is unique among the competitors in that it has a full pipeline to develop Proofs of Vulnerability (POVs) without fuzzing, symbolic execution, or related techniques. For some more details about our LLM Agents, you can also check out this post or for walkthroughs of some agent trajectories, check out this post.
About AIxCC
The AI Cyber Challenge asks teams to produce a CRS that takes in large C or Java codebases from oss-fuzz and produces patches and POVs. A challenge may be a âfull-modeâ task in which the CRS must attempt to find any triggerable bugs in the repository, or a âdelta-modeâ task in which the CRS is restricted to look at bugs only in a single code diff. Additionally, CRSs may receive static analysis reports in SARIF format that may be valid, or may be a false positive, and are asked to submit their assessment.
The specifics of the scoring get a bit complicated, and can be found in the official rules document. At a high level, SARIF assessments are worth 1 point, POVs are worth 2 points, and patches are worth 6 points. The score for all submissions decays over time from 100% to 50% of the base score as challenges reach an expiration. There is an additional score multiplier for accuracy which is reduced when faulty or duplicate patches are submitted.
All CRSs must run in the Azure cloud, be deployable from a single command (such as terraform or make), and run without any human intervention. Although they cannot use generic internet resources, CRSs are allowed to access LLM provider APIs. Additionally, while each CRS runs within a fixed, organizer-provided budget, the actual amount spent does not factor into the scoring calculations.
Our Approach
From the outset, team Theori opted for an âLLM-firstâ approach. Although we have extensive experience with fuzzing and other âstandardâ techniques (including members who participated in DARPAâs Cyber Grand Challenge on the winning team Mayhem), our design focused on using LLMs for all aspects of the challenge, with fuzzing and static analysis techniques available for backup. This choice was for several reasons: it was the most exciting and interesting way to build a system; as LLMs improve, such a system will reap the benefits for âfreeâ; and even when AIxCC was announced in 2023, agentic LLM systems already showed great promise in security tasks. In contrast, some teams took a more cautious approach: using LLMs to produce inputs or grammars to assist existing techniques like fuzzing.
For patching, most teams have a fairly similar approach at a high-level of using LLMs or LLM agents to write source code that patches any found bugs.
A high level diagram of our system architecture can be found here.
Although advances have been going on in parallel outside of AIxCC, we are still not aware of any other system which can use LLMs at the repository scale and not just identify bugs, but also generate inputs to trigger them using LLMs. Moreover, our system operates with zero human input.
Bug Finding
The first step our system must take is finding bugs. This is done with both static analysis and fuzzing. Static analysis is done both with traditional methods (in our case fbinfer) as well as a collection of two different LLM methods, each performed by two different LLMs.
Static Analysis with Infer
We chose to use Infer for its support of multiple languages, and its ability to identify bugs without needing specific rules written for each project. By performing interprocedural value analysis, infer can spot bugs like null dereferences, out-of-bounds errors, or overflows. In practice, there are a lot of complications. To start, infer needs to understand the source code well, which means instrumenting the compilation process. Although this is simple enough to do with C, for Java this ended up not being included at all in our CRS submission.
The next issue is that infer generally produces reports when it cannot prove an operation is safe. While this is great, it also results in a large amount of false positives when infer fails to prove pieces of certain operations. Further, like most real world codebases, infer itself has bugs which result in false negativesâtrue bugs that should be reported but are missed. Especially in the overflow checking routines (which seem to be less used), infer had some bugs that we fixed for our CRS.
In the end, infer was still a valuable tool, but with our workflow and criteria, it had a false positive rate around 99.9%.
Static Analysis with LLMs
While only a piece of the overall architecture of our CRS, static bug finding with LLMs is an incredibly deep and interesting area with a lot of research and engineering behind it. Overall, the concept is simple: feed code samples to LLMs and ask them to identify bugs. Due to the size of the codebases in question, this cannot be done with more complicated methods like LLM agents, and must instead be done with single-shot LLM completions.
Even then, a number of complications remain: how to prompt the LLMs, what particular code to include in the context when requesting vulnerabilities to be identified, how to structure the output, and so on. Our CRS has two main ways to perform this analysis: using âsingle functionâ contexts, or using large code chunks. In âsingle functionâ mode, the LLM is asked to analyze only one function at a time without much additional context. In the large code chunk mode, several entire files of source can be added, selected carefully to group likely related code in the same context to promote useful interprocedural analysis.
Without much space in this blog post to go into details, the best way to learn more about the implementation of this component is to read the source. Interestingly, LLM based static analysis found several bugs that were missed with other traditional techniques. Of course, like most static analysis tools, it also still has a high false positive rate, depending on the models used and mode of operation.
When given a diff, this task is much simpler: an LLM agent is given the diff itself and asked to look for bugs caused by the change. We use two agents in parallel, one with a version of the diff that is pruned to remove code that may not be relevant, based on compilation introspection. This analysis has far fewer false positives given its drastically reduced scope.
Fuzzing
Our fuzzing approach was to keep things simple. We use the libfuzzer harnesses provided by the projects to ensure basic fuzzing functionality. Given many real-world projects also use common data-types, we also attempt to match harnesses with corpuses downloaded before the competition.
To aid in fuzzers getting âstuckâ, an LLM based agent attempts to use coverage data to find large pieces of never reached code and craft inputs for them. Even in corpuses bootstraped with seeds that had saturated coverage, this agent is commonly able to produce seeds that lead to new coverage. For more information specifically about this feature, check out this writeup. Additionally, when agents attempt to generate POVs, any crafted inputs are also fed to the fuzzers as seeds. In some cases, we observed the fuzzer using these seeds to beat the POV generator at crashing its target!
As fuzzer results are not âbug reportsâ like those produced by static analysis, fuzzer crashes must be triaged to produce natural language descriptions. This is done first by bucketing crashes by stack hash, then bucketing by any patches that have been developed, and then finally using an LLM agent to produce a bug report for the crash and attempting to match that against any existing vulnerabilities.
Bug Filtering
Given the high number of false positives produced by static analysis, it is essential to filter them down before it is possible to attempt to patch or trigger them. This is a very âreal worldâ problem: many humans deal with the pain of filtering through thousands or tens of thousands of static analysis reports to find the hidden handful of valuable results.
This component itself is broken down into two pieces: vulnerability scoring, and agent based analysis.
Vulnerability Scoring
This component is another piece that is deceptively complex and important. At its core, this is a single LLM classification asking an LLM to output a single token of whether a bug is âlikelyâ or âunlikelyâ to be a real result. Rather than use the binary output, the logprobability of each token is used, and several samples are drawn to overcome nondeterminism due to MoE scheduling. The output of this step is a score between 0 and 1, which can be used to prioritize bugs for the next step of analysis.
Agent Based Analysis
To further analyze vulnerabilities, we use an LLM agent. This agent has access to source-browsing tools and is tasked with both determining whether the report is valid as well as providing additional information on reports it deems valid. While vulnerability scoring costs around $0.001 per report, this analysis is closer to $0.50. With potentially tens of thousands of reports, only the top 20% of scored vulnerabilities are analyzed, or up to a budget cap set per task. This two-phase approach keeps costs under control and balances false positives and false negatives in a simple to adjust manner.
POV Generation
Identified bugs that pass through both stages of the filtering are then passed to the POV generator. This is another LLM agent, this time tasked with developing inputs that trigger the specified bug.
Input Encoder
All POVs use what we call an âinput encoderâ. This is a Python function that takes parameters and produces a binary blob that can be used by a harness. The parameters the function takes are up to the LLM agent producing the script to decide. Ideally, the parameters are semantically meaningful pieces that can be encoded easily (for example, user and password strings), though in some cases the input is itself just a single binary or text blob.
This encoder is tested in isolation: for example if there is a username parameter, a test string can be injected and the agent can set a breakpoint to inspect that the correct value made it to the correct variable.
This encoder is complex and essential to get right, and is used by any POV producer targeting the same harness. By centralizing the effort and doing extra testing, we can cache the resulting encoder for future attempts. Having this as a âgivenâ to the POV producer helps focus that agent on the task of triggering the bug, without needing to split its effort to also debugging the input format.
POV Producer
The POV Producer itself is relatively simple. It does not have access to source code directly, being instead given a generic âsource questionsâ agent. It also has access to a debug agent (not a debugger directly), and the ability to test POVs and fetch input encoders. For more information on the specifics, check out our other blog post.
The POV producer runs in parallel with three different LLM models: Anthropicâs Claude Sonnet 3.5 and 4, and OpenAIâs o3. While this parallel execution increases our running cost, it reduces the latency of the system, which can help with the time component of the scoring system. Whenever a producer succeeds (verified by running the POV and checking for a crash), the sibling jobs are terminated.
Patch Generation
At the same time that bugs are given to the POV generator, they are also passed to the patch generator. Once again, this is an optimization to reduce latency at the cost of increasing spending. The patch generator is much like one might expect from a modern coding agent, though it was largely written in mid 2024. One interesting feature is that agents produce diffs which are validated by a sequence alignment function to identify where to place them and attempt to correct minor ambiguities. Like many pieces of our CRS, there is a lot of engineering and research to handle this single small component.
When POVs are created or fuzzing inputs are found and associated with a vulnerability, the associated patches are automatically run to verify they actually remediate the findings. If not, the patching process is restarted and the failing test cases are used in the automated verification for the patch to ensure it correctly remediates the bugs.
Scaffolding and Glue
Unfortunately, a large amount of the actual work in building a CRS is actually in all the support code to make sure the system can do its job. Although our CRS performs great in our testing, the main failure modes we have needed to address have been issues in this scaffolding rather than fundamental issues with the overall approach.
Orchestration and Scheduling
The first obvious requirement is orchestration of all these agents and tasks. Our CRS used a single async Python process to manage all of the work, which included launching thousands of concurrent workers. Job scheduling managed access to the resources that were intentionally limited such as LLM APIs and Dockers.
Source Browsing
Rather than using a generic tool such as the ability to run arbitrary Unix commands, all of the source browsing by our agents is done through specific dedicated tools. A bit more about this can be found on our blog about agent design. In terms of tooling, it means supporting several data sources ranging from simple such as TreeSitter and GTags, to more complex such as Joern and clang-ast.
Builds
For almost all the tasks (though notably not the LLM static analysis), the CRS needs to build the target project. In general, this is a difficult problem, only made possible by the packaging already done by oss-fuzz. Still, we need several different builds: builds with debugging information, builds with coverage information, instrumented builds for static analysis, builds with each configured sanitizer, and builds with any patches attempted. Some builds are done eagerly to reduce latency, and others lazily to reduce resource usage, with caching to alleviate repeated work.
Docker Running
Resource intensive things like builds and fuzzing are not set up to run on the same host as the actual CRS as this would lead to resource exhaustion. Rather than using a dynamic scaling approach like Kubernetes, we opt for a fixed allocation of hosts running Docker. Any request to run non-trivial code is sent off to a remote host to run. This allows simple scaling and predictable costs for our system, minimizes unpredictability for a system that isnât allowed to have a site-reliability engineer watch over it, and minimizes the time spent waiting for tasks to spin up.
Because these Dockers run remotely, this also means shipping a lot of data to and from the containers over the network. In the version of our CRS submitted for finals, this is simply done sending tarfiles or other simple files over the network.
Conclusion
Despite the challenges involved in creating a system that can find, trigger, and patch bugs in large, unknown codebases, we were able to realize a novel solution that performs well across a variety of projects and languages. While some choices and design decisions appear straightforward in hindsight, even using LLMs to directly craft POVs was not a choice taken by other teams. Although this post is still not a complete view of our CRS, this outline should provide enough information to follow along in our code, and to provide a better understanding of our other blog posts.