CVE-2026-46181: RCU race may crash ML training nodes

CISO Take

CVE-2026-46181 is a race condition in the Linux kernel's RDMA mlx4 driver where improper RCU locking in mlx4_srq_event() can cause a kernel crash if a hardware event arrives before a Shared Receive Queue object finishes initializing, or after it is being freed. For AI/ML teams, RDMA/InfiniBand is the backbone of high-performance distributed training clusters running PyTorch DDP, Horovod, and MPI-based frameworks — a node panic during a multi-day training run means lost compute and potential checkpoint corruption. There is no CVSS score assigned, no public exploit, and no CISA KEV listing, placing operational risk at low-to-moderate with availability impact rather than a confidentiality or integrity threat. Patch by applying the three upstream Linux stable commits referenced, or obtain the backported kernel update from your distribution vendor (RHEL, Ubuntu, SUSE).

Sources: NVD ATLAS

What is the risk?

Low-to-moderate risk for AI/ML infrastructure. No CVSS score has been assigned and no public exploit exists. The vulnerability requires a specific timing race between hardware RDMA event delivery and SRQ object lifecycle, making deterministic exploitation difficult. However, for organizations running large-scale distributed training on InfiniBand-connected GPU clusters, an unexpected kernel panic terminates long-running jobs and threatens availability. The upstream fix (spinlock replacement, refcount ordering) is already present in stable kernel commits, making patching straightforward.

How does the attack unfold?

Infrastructure Access

Adversary gains local access to or network adjacency on an RDMA-connected ML training cluster node.

AML.T0041

Race Condition Trigger

RDMA hardware event is delivered to mlx4_srq_event() during SRQ object initialization or teardown, hitting the unprotected RCU window.

AML.T0049

Kernel Crash / Node DoS

Kernel panic crashes the training node, aborting active distributed training jobs and potentially corrupting in-progress model checkpoints.

AML.T0029

Infrastructure Access

Adversary gains local access to or network adjacency on an RDMA-connected ML training cluster node.

AML.T0041

Race Condition Trigger

RDMA hardware event is delivered to mlx4_srq_event() during SRQ object initialization or teardown, hitting the unprotected RCU window.

AML.T0049

Kernel Crash / Node DoS

Kernel panic crashes the training node, aborting active distributed training jobs and potentially corrupting in-progress model checkpoints.

AML.T0029

How severe is it?

CVSS 3.1

7.8 / 10

EPSS

0.1%

chance of exploitation in 30 days

Higher than 2% of all CVEs

Source: EPSS v3 — FIRST.org

Exploitation Status

No known exploitation

Sophistication

Advanced

What is the attack surface?

AV Local

AC Low

PR Low

UI None

S Unchanged

C High

I High

A High

What should I do?

5 steps

Apply the upstream stable kernel patches: commits 1e2a448, 8b7833f, and c934130 from git.kernel.org/stable.
Check your Linux distribution for backported kernel updates incorporating this fix.
If immediate patching is not feasible, monitor dmesg and kernel logs for RDMA-related oops or panic messages on training nodes.
Ensure checkpoint-and-resume is enabled for all long-running training jobs to minimize data loss if a crash occurs.
Evaluate whether graceful degradation to alternative networking is possible for critical workloads during the patching window.

How is it classified?

DoS Framework AML.T0010.000 - Hardware

Which compliance frameworks are affected?

This CVE is relevant to:

ISO 42001

A.7.4 - AI System Infrastructure Security

NIST AI RMF

MANAGE 2.2 - AI Risk Treatment and Prioritization

Frequently Asked Questions

What is CVE-2026-46181?

CVE-2026-46181 is a race condition in the Linux kernel's RDMA mlx4 driver where improper RCU locking in mlx4_srq_event() can cause a kernel crash if a hardware event arrives before a Shared Receive Queue object finishes initializing, or after it is being freed. For AI/ML teams, RDMA/InfiniBand is the backbone of high-performance distributed training clusters running PyTorch DDP, Horovod, and MPI-based frameworks — a node panic during a multi-day training run means lost compute and potential checkpoint corruption. There is no CVSS score assigned, no public exploit, and no CISA KEV listing, placing operational risk at low-to-moderate with availability impact rather than a confidentiality or integrity threat. Patch by applying the three upstream Linux stable commits referenced, or obtain the backported kernel update from your distribution vendor (RHEL, Ubuntu, SUSE).

Is CVE-2026-46181 actively exploited?

No confirmed active exploitation of CVE-2026-46181 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-46181?

1. Apply the upstream stable kernel patches: commits 1e2a448, 8b7833f, and c934130 from git.kernel.org/stable. 2. Check your Linux distribution for backported kernel updates incorporating this fix. 3. If immediate patching is not feasible, monitor dmesg and kernel logs for RDMA-related oops or panic messages on training nodes. 4. Ensure checkpoint-and-resume is enabled for all long-running training jobs to minimize data loss if a crash occurs. 5. Evaluate whether graceful degradation to alternative networking is possible for critical workloads during the patching window.

What systems are affected by CVE-2026-46181?

This vulnerability affects the following AI/ML architecture patterns: Distributed ML training clusters (InfiniBand/RDMA), HPC GPU clusters with Mellanox mlx4 adapters, On-premises multi-node training infrastructure.

What is the CVSS score for CVE-2026-46181?

CVE-2026-46181 has a CVSS v3.1 base score of 7.8 (HIGH). The EPSS exploitation probability is 0.11%.

What is the AI security impact?

Affected AI Architectures

Distributed ML training clusters (InfiniBand/RDMA)HPC GPU clusters with Mellanox mlx4 adaptersOn-premises multi-node training infrastructure

MITRE ATLAS Techniques

AML.T0010.000 Hardware

Compliance Controls Affected

ISO 42001: A.7.4

NIST AI RMF: MANAGE 2.2

What are the technical details?

Original Advisory

In the Linux kernel, the following vulnerability has been resolved: RDMA/mlx4: Fix mis-use of RCU in mlx4_srq_event() Sashiko points out the radix_tree itself is RCU safe, but nothing ever frees the mlx4_srq struct with RCU, and it isn't even accessed within the RCU critical section. It also will crash if an event is delivered before the srq object is finished initializing. Use the spinlock since it isn't easy to make RCU work, use refcount_inc_not_zero() to protect against partially initialized objects, and order the refcount_set() to be after the srq is fully initialized.

Exploitation Scenario

A local attacker or compromised cluster management node with access to a distributed ML training cluster could craft or time RDMA events to arrive during the narrow window when an SRQ object is being initialized or torn down, triggering a kernel panic on a target training node. More realistically, this manifests as an accidental crash under high RDMA event load during large-scale distributed training runs. A targeted insider could repeatedly crash specific nodes to cause expensive job restarts and exhaust GPU time budgets.