Winning the AIxCC Qualification Round

Theori’s Cyber Reasoning System (CRS) “Robo Duck” not only cleared the bar to get us $2M and a spot at the AIxCC finals in 2025, it also got the first place among all the submissions in the highly competitive event.

Xint

Sep 23, 2024

Contents

About AIxCC Entering AIxCC Our Approach Artifacts

In August, Theori’s CTF team, as part of the Maple Mallard Magistrates, won Defcon CTF for the 3rd year in a row–the first team ever to do so! But while we were diligently hacking, so too was our DARPA AIxCC submission. Theori’s Cyber Reasoning System (CRS) “Robo Duck”, written before we went to Las Vegas, not only cleared the bar to get us $2M and a spot at the AIxCC finals in 2025, it also got the first place among all the submissions in the highly competitive event, earning the most achievements and finding the most unique classes of bugs!

About AIxCC

DARPA’s AIxCC is the spiritual successor to the Cyber Grand Challenge which culminated in 2016. The fundamental concept is simple: create computer systems capable of finding, exploiting, and patching software vulnerabilities with no human involvement at all. The Cyber Grand Challenge focused on a specific area: Linux-like programs compiled to x86. Although working with compiled programs is difficult (a lot of information is lost when source code is transformed into machine code), it also has several advantages: there is an obvious specification defined by the loader and processor, the program is runnable, without dynamic libraries (which were not in scope for CGC) everything is self-contained. In the Cyber Grand Challenge, hundreds of programs were specifically crafted for the competition to test different areas of bug finding and patching.

For AIxCC, a lot has changed. Challenge projects (CPs) are now open source repositories which can span a gigabyte of source code. For the qualifying round, this source code could be in Java or C. The CPs could take several inputs from different harnesses, which may go through arbitrarily complex pathways (network stacks, file systems, maybe even other applications) before they get to the actual challenge code. While source code access has benefits, it comes with a lot of costs as well!

To limit the scope to something still relevant practically, but also achievable and scorable, AIxCC considers only commits made on top of a base repository of code, meaning existing bugs in the repository are not considered, only those added in the subsequent commits. A found bug does not count on its own, it must be accompanied by: a specific input which triggers the bug, a commit hash for where the bug was introduced, and the specific type of security check which we expect to fail when the bug is triggered. Patching can only be performed to fix an accepted bug, and is submitted as source diffs which must compile and keep functionality for a set of public and private tests. Additionally, there is a standardization layer applied on top of all repositories. Each repository has a small shell script which can build the project, run public tests against the project, and run an input against one of potentially several harnesses for the project. This helps alleviate trying to build arbitrary projects without access to the internet and standardizes the interface for scoring. An early version of a CP (with some outdated interfaces) is available publicly here.

Of course, with unlimited compute resources there are a lot of possibilities for finding and patching bugs, but these may not scale to the huge volume of software that is actually being produced today. Further, there may be fairness issues if teams could use their own resources in the competition. To that end, all systems in AIxCC are limited to 4 hours to analyze each system with access to 3 moderate compute nodes and $100 in credits for LLM APIs.

While the set of requirements are somewhat daunting for a fully automated system, they are fairly practical: at Theori we perform a lot of code audits which could immediately benefit from tooling that is capable of identifying and fixing bugs, and by requiring inputs we ensure that we don’t waste our very valuable analysts’ time with false positives. Further, scaling up to analyzing hundreds of projects means costs and resources can balloon quickly if not limited.

Entering AIxCC

At Theori, we were excited when AIxCC was first announced. We work not only in security but also have a great team working on using and securing AI, so this event seemed right up our alley. We are also a very competitive group: our employees have played CTFs for over a decade, competed in and won Defcon CTF more times than anyone else, and our company’s CTF team The Duck often dominates the leader boards of competitions. In addition we have expertise from the Cyber Grand Challenge with members from the winning team. We definitely believe in showing our skills through actions, not words, and we believe competitions like this are a great way for us to do so.

Official announcement of our qualification at Defcon

Although the funded track of AIxCC was compelling, we chose not to submit because there was not quite enough information available at the time for us to write a proposal we believed in. Still, we kept close tabs on the competition, and when more information was available and the open track was about to close at the end of April, we decided to submit a draft for what we wanted to build which was accepted in May. With so many uncertainties and a busy work schedule, we spared two employees for the project who were highly experienced and were excited by the prospects of AIxCC.

Our Approach

Bug Finding

Given the design and goals of the competition, Theori’s CRS focused on merging traditional static and dynamic techniques with frontier LLM models. While fuzzing and symbolic execution are exciting and versatile, our compute resources and the design of the competition led us to focus on a design that uses LLMs to enhance the efficacy of these existing methods as much as possible.

Of course, the naive approach of “dump all the code to an LLM and ask it for the bugs” doesn’t work, nor does “dump each commit to an LLM and ask it for the new bugs”. Bugs commonly exist not simply in a line of code, but as part of the context for a larger program: shortening a buffer, removing a length check, adding a new function, can all be safe or unsafe depending on the surrounding context. Although some frontier models now support context sizes up to 2M tokens, there are a number of limitations: that is still too small to ingest an entire large codebase; even if large contexts are supported, model performance tends to drop at high context utilization; and the cost and throughput suffer greatly at high token counts. With the cheapest model available in the competition (Claude 3 Haiku at $0.25/M tokens) just ingesting all of the Linux Kernel across several queries would cost around $130–already over the budget of our analysis.

Token counts for context windows of available models and the Linux Kernel source tree for comparison. *While Gemini supports 2M context sizes, that model was not available for the competition

Still there are a number of ways in which LLMs can assist the vulnerability finding process: generating test cases, writing new or more specific fuzzer harnesses, modifying source to remove fuzzer roadblocks, triaging found bugs, and so on. Modern LLMs can also be extended beyond simple text to call user-defined tools which expand their capabilities even further and new features and model improvements are coming out at an incredible pace. We are looking forward to seeing what we can put together for the final round in August 2025.

Patching

In principle, patching with LLMs is straightforward: given the original, buggy code, prompt the LLM to generate new code which keeps the same “functionality” but fixes the security bug. Of course, there are ambiguities here that humans and LLMs both run up against: what exactly is the correct “functionality”? For the semifinals, we limited this to “check that patches compile and do not fail the public test suite”, but with hidden functionality tests, this obviously is an incomplete picture. Further, outputting valid new code or patches is non-trivial. If the original code was part of a large file, should we output the entire file again–potentially exhausting our LLM context–or should we only output a small chunk or diff? Unfortunately, hallucinations in LLMs are very tricky to work around, and even attempting to repeat code with only a few small changes often results in accidental modifications due to hallucinations. While our patching is based on LLMs, it was substantially more difficult than we anticipated to get it to work reliably.

CWE Breakdown By Team — Created with Datawrapper(https://www.datawrapper.de/_/OKOr5/)

For the security of our testing, we can simply run any inputs that we found which trigger the bug. Due to the rules of AIxCC, it was not possible to submit a patch without a crashing input, so these were always available to use and guide our patching process.

Robustness

Perhaps the most important area for a fully automated system participating in a competition is robustness. If your system falls over, it doesn’t matter how good your bug finding and patching abilities are. Factor in that the system must take in arbitrary code repositories to analyze and parse results from LLMs — which are notorious for making up nonsense — and it’s easy to see that robustness is trickier here than in many other projects. Our system was written in Python for ease of development, but using a dynamically typed language also carries a lot of risks. While testing before the event can help to reduce bugs and surprises, you never know what curve-balls may be passed on in the event.

Unfortunately there is no easy solution here — trying to write software on a compressed schedule that can stand up to anything is a difficult problem. While failing fast and loud is useful for development, without the possibility of human intervention, error reporting or log messages don’t actually help improve reliability. Rather than assert for errant results, we instead need to return the most useful result we can so the system can continue forward progress. Aside from that we followed normal good practices: we focused on building a simple and modular architecture for our CRS and used our software auditing skills to carefully check our code for assumptions or issues not only which could impact our results, but which might prevent our whole system from making progress.

Artifacts

What does the output actually look like from our CRS? While we don’t have access to everything our system did at the competition (logs, etc.), we were given the vulnerabilities and patches that were submitted. Here is an overview:

Counts of POVs and Patches per problem(https://www.datawrapper.de/_/IUSBI/)

Curious what those look like? check them out at https://github.com/theori-io/aixcc-public. Note that this repository only includes artifacts found by our CRS in publicly released challenges, which for now unfortunately only includes nginx.

While hopefully that provides a taste of some of the work Theori put into our CRS for AIxCC, it is of course only a high level overview. After the final event, the source code for all the qualifying CRSs will be released so teams will be better able to discuss the precise approaches. We will definitely be checking out the CRSs made by all of the other awesome teams when we can, and we expect to see a lot of diversity and great ideas among the approaches teams take!