CVE-2026-46181: Linux kernel RDMA/mlx4: RCU race may crash ML training nodes

AWAITING NVD
Published May 28, 2026
CISO Take

CVE-2026-46181 is a race condition in the Linux kernel's RDMA mlx4 driver where improper RCU locking in mlx4_srq_event() can cause a kernel crash if a hardware event arrives before a Shared Receive Queue object finishes initializing, or after it is being freed. For AI/ML teams, RDMA/InfiniBand is the backbone of high-performance distributed training clusters running PyTorch DDP, Horovod, and MPI-based frameworks — a node panic during a multi-day training run means lost compute and potential checkpoint corruption. There is no CVSS score assigned, no public exploit, and no CISA KEV listing, placing operational risk at low-to-moderate with availability impact rather than a confidentiality or integrity threat. Patch by applying the three upstream Linux stable commits referenced, or obtain the backported kernel update from your distribution vendor (RHEL, Ubuntu, SUSE).

Sources: NVD ATLAS

What is the risk?

Low-to-moderate risk for AI/ML infrastructure. No CVSS score has been assigned and no public exploit exists. The vulnerability requires a specific timing race between hardware RDMA event delivery and SRQ object lifecycle, making deterministic exploitation difficult. However, for organizations running large-scale distributed training on InfiniBand-connected GPU clusters, an unexpected kernel panic terminates long-running jobs and threatens availability. The upstream fix (spinlock replacement, refcount ordering) is already present in stable kernel commits, making patching straightforward.

Attack Kill Chain

Infrastructure Access
Adversary gains local access to or network adjacency on an RDMA-connected ML training cluster node.
AML.T0041
Race Condition Trigger
RDMA hardware event is delivered to mlx4_srq_event() during SRQ object initialization or teardown, hitting the unprotected RCU window.
AML.T0049
Kernel Crash / Node DoS
Kernel panic crashes the training node, aborting active distributed training jobs and potentially corrupting in-progress model checkpoints.
AML.T0029

Severity & Risk

CVSS 3.1
N/A
EPSS
N/A
Exploitation Status
No known exploitation
Sophistication
Advanced

What should I do?

5 steps
  1. Apply the upstream stable kernel patches: commits 1e2a448, 8b7833f, and c934130 from git.kernel.org/stable.

  2. Check your Linux distribution for backported kernel updates incorporating this fix.

  3. If immediate patching is not feasible, monitor dmesg and kernel logs for RDMA-related oops or panic messages on training nodes.

  4. Ensure checkpoint-and-resume is enabled for all long-running training jobs to minimize data loss if a crash occurs.

  5. Evaluate whether graceful degradation to alternative networking is possible for critical workloads during the patching window.

Classification

Compliance Impact

This CVE is relevant to:

ISO 42001
A.7.4 - AI System Infrastructure Security
NIST AI RMF
MANAGE 2.2 - AI Risk Treatment and Prioritization

Frequently Asked Questions

What is CVE-2026-46181?

CVE-2026-46181 is a race condition in the Linux kernel's RDMA mlx4 driver where improper RCU locking in mlx4_srq_event() can cause a kernel crash if a hardware event arrives before a Shared Receive Queue object finishes initializing, or after it is being freed. For AI/ML teams, RDMA/InfiniBand is the backbone of high-performance distributed training clusters running PyTorch DDP, Horovod, and MPI-based frameworks — a node panic during a multi-day training run means lost compute and potential checkpoint corruption. There is no CVSS score assigned, no public exploit, and no CISA KEV listing, placing operational risk at low-to-moderate with availability impact rather than a confidentiality or integrity threat. Patch by applying the three upstream Linux stable commits referenced, or obtain the backported kernel update from your distribution vendor (RHEL, Ubuntu, SUSE).

Is CVE-2026-46181 actively exploited?

No confirmed active exploitation of CVE-2026-46181 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-46181?

1. Apply the upstream stable kernel patches: commits 1e2a448, 8b7833f, and c934130 from git.kernel.org/stable. 2. Check your Linux distribution for backported kernel updates incorporating this fix. 3. If immediate patching is not feasible, monitor dmesg and kernel logs for RDMA-related oops or panic messages on training nodes. 4. Ensure checkpoint-and-resume is enabled for all long-running training jobs to minimize data loss if a crash occurs. 5. Evaluate whether graceful degradation to alternative networking is possible for critical workloads during the patching window.

What systems are affected by CVE-2026-46181?

This vulnerability affects the following AI/ML architecture patterns: Distributed ML training clusters (InfiniBand/RDMA), HPC GPU clusters with Mellanox mlx4 adapters, On-premises multi-node training infrastructure.

What is the CVSS score for CVE-2026-46181?

No CVSS score has been assigned yet.

AI Security Impact

Affected AI Architectures

Distributed ML training clusters (InfiniBand/RDMA)HPC GPU clusters with Mellanox mlx4 adaptersOn-premises multi-node training infrastructure

MITRE ATLAS Techniques

AML.T0010.000 Hardware

Compliance Controls Affected

ISO 42001: A.7.4
NIST AI RMF: MANAGE 2.2

Technical Details

Original Advisory

In the Linux kernel, the following vulnerability has been resolved: RDMA/mlx4: Fix mis-use of RCU in mlx4_srq_event() Sashiko points out the radix_tree itself is RCU safe, but nothing ever frees the mlx4_srq struct with RCU, and it isn't even accessed within the RCU critical section. It also will crash if an event is delivered before the srq object is finished initializing. Use the spinlock since it isn't easy to make RCU work, use refcount_inc_not_zero() to protect against partially initialized objects, and order the refcount_set() to be after the srq is fully initialized.

Exploitation Scenario

A local attacker or compromised cluster management node with access to a distributed ML training cluster could craft or time RDMA events to arrive during the narrow window when an SRQ object is being initialized or torn down, triggering a kernel panic on a target training node. More realistically, this manifests as an accidental crash under high RDMA event load during large-scale distributed training runs. A targeted insider could repeatedly crash specific nodes to cause expensive job restarts and exhaust GPU time budgets.

Timeline

Published
May 28, 2026
Last Modified
May 28, 2026
First Seen
May 28, 2026

Related Vulnerabilities