CVE-2026-46181: Linux kernel RDMA/mlx4: RCU race may crash ML training nodes
AWAITING NVDCVE-2026-46181 is a race condition in the Linux kernel's RDMA mlx4 driver where improper RCU locking in mlx4_srq_event() can cause a kernel crash if a hardware event arrives before a Shared Receive Queue object finishes initializing, or after it is being freed. For AI/ML teams, RDMA/InfiniBand is the backbone of high-performance distributed training clusters running PyTorch DDP, Horovod, and MPI-based frameworks — a node panic during a multi-day training run means lost compute and potential checkpoint corruption. There is no CVSS score assigned, no public exploit, and no CISA KEV listing, placing operational risk at low-to-moderate with availability impact rather than a confidentiality or integrity threat. Patch by applying the three upstream Linux stable commits referenced, or obtain the backported kernel update from your distribution vendor (RHEL, Ubuntu, SUSE).
What is the risk?
Low-to-moderate risk for AI/ML infrastructure. No CVSS score has been assigned and no public exploit exists. The vulnerability requires a specific timing race between hardware RDMA event delivery and SRQ object lifecycle, making deterministic exploitation difficult. However, for organizations running large-scale distributed training on InfiniBand-connected GPU clusters, an unexpected kernel panic terminates long-running jobs and threatens availability. The upstream fix (spinlock replacement, refcount ordering) is already present in stable kernel commits, making patching straightforward.
Attack Kill Chain
Severity & Risk
What should I do?
5 steps-
Apply the upstream stable kernel patches: commits 1e2a448, 8b7833f, and c934130 from git.kernel.org/stable.
-
Check your Linux distribution for backported kernel updates incorporating this fix.
-
If immediate patching is not feasible, monitor dmesg and kernel logs for RDMA-related oops or panic messages on training nodes.
-
Ensure checkpoint-and-resume is enabled for all long-running training jobs to minimize data loss if a crash occurs.
-
Evaluate whether graceful degradation to alternative networking is possible for critical workloads during the patching window.
Classification
Compliance Impact
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2026-46181?
CVE-2026-46181 is a race condition in the Linux kernel's RDMA mlx4 driver where improper RCU locking in mlx4_srq_event() can cause a kernel crash if a hardware event arrives before a Shared Receive Queue object finishes initializing, or after it is being freed. For AI/ML teams, RDMA/InfiniBand is the backbone of high-performance distributed training clusters running PyTorch DDP, Horovod, and MPI-based frameworks — a node panic during a multi-day training run means lost compute and potential checkpoint corruption. There is no CVSS score assigned, no public exploit, and no CISA KEV listing, placing operational risk at low-to-moderate with availability impact rather than a confidentiality or integrity threat. Patch by applying the three upstream Linux stable commits referenced, or obtain the backported kernel update from your distribution vendor (RHEL, Ubuntu, SUSE).
Is CVE-2026-46181 actively exploited?
No confirmed active exploitation of CVE-2026-46181 has been reported, but organizations should still patch proactively.
How to fix CVE-2026-46181?
1. Apply the upstream stable kernel patches: commits 1e2a448, 8b7833f, and c934130 from git.kernel.org/stable. 2. Check your Linux distribution for backported kernel updates incorporating this fix. 3. If immediate patching is not feasible, monitor dmesg and kernel logs for RDMA-related oops or panic messages on training nodes. 4. Ensure checkpoint-and-resume is enabled for all long-running training jobs to minimize data loss if a crash occurs. 5. Evaluate whether graceful degradation to alternative networking is possible for critical workloads during the patching window.
What systems are affected by CVE-2026-46181?
This vulnerability affects the following AI/ML architecture patterns: Distributed ML training clusters (InfiniBand/RDMA), HPC GPU clusters with Mellanox mlx4 adapters, On-premises multi-node training infrastructure.
What is the CVSS score for CVE-2026-46181?
No CVSS score has been assigned yet.
AI Security Impact
Affected AI Architectures
MITRE ATLAS Techniques
AML.T0010.000 Hardware Compliance Controls Affected
Technical Details
Original Advisory
In the Linux kernel, the following vulnerability has been resolved: RDMA/mlx4: Fix mis-use of RCU in mlx4_srq_event() Sashiko points out the radix_tree itself is RCU safe, but nothing ever frees the mlx4_srq struct with RCU, and it isn't even accessed within the RCU critical section. It also will crash if an event is delivered before the srq object is finished initializing. Use the spinlock since it isn't easy to make RCU work, use refcount_inc_not_zero() to protect against partially initialized objects, and order the refcount_set() to be after the srq is fully initialized.
Exploitation Scenario
A local attacker or compromised cluster management node with access to a distributed ML training cluster could craft or time RDMA events to arrive during the narrow window when an SRQ object is being initialized or torn down, triggering a kernel panic on a target training node. More realistically, this manifests as an accidental crash under high RDMA event load during large-scale distributed training runs. A targeted insider could repeatedly crash specific nodes to cause expensive job restarts and exhaust GPU time budgets.
References
Timeline
Related Vulnerabilities
CVE-2026-33660 10.0 TensorFlow: type confusion NPD in tensor conversion
Same attack type: DoS CVE-2023-25668 9.8 TensorFlow: unauthenticated RCE via heap buffer overflow
Same attack type: DoS CVE-2022-23587 9.8 TensorFlow: integer overflow in Grappler enables RCE
Same attack type: DoS CVE-2022-35939 9.8 TensorFlow: ScatterNd OOB write enables RCE/crash
Same attack type: DoS CVE-2022-41900 9.8 TensorFlow: heap OOB RCE in FractionalMaxPool op
Same attack type: DoS