Linux mlx5: RDMA hang DoS on AI training clusters — AWAITING NVD (CVE-2026-45973)

CISO Take

A race condition in the Linux kernel's mlx5 RDMA driver causes an indefinite system hang during firmware reset in LAG (link aggregation) mode, effectively requiring a hard reboot to recover the affected node. AI and ML teams operating distributed training clusters on Mellanox/NVIDIA ConnectX adapters using RDMA fabrics (InfiniBand or RoCE) are at risk: the hang occurs during firmware reset procedures that are routine in maintenance and failover workflows, making this an operational reliability threat rather than a remote exploitation scenario. There is no public exploit, no CVSS score, and this is not in CISA KEV, placing deliberate exploitation likelihood as low — but unintentional triggering during maintenance on unpatched systems is a real concern for high-availability AI compute environments. Apply the available stable kernel patches (four commits published at kernel.org) and audit any mlx5 LAG configurations on AI compute nodes before the next scheduled maintenance window.

Sources: NVD ATLAS

What is the risk?

Low-to-medium risk for most enterprise environments, elevated for organizations running distributed AI training at scale. Exploitation requires local or privileged access to initiate a firmware reset on an mlx5 adapter in LAG mode — it is not remotely exploitable under normal conditions. The impact is availability-only: a system hang requiring a hard reboot. For AI training clusters dependent on continuous RDMA fabric availability, the blast radius extends beyond the single node: a hang during gradient synchronization can stall all participating nodes in a training job, requiring coordinated recovery and full job restart.

Attack Kill Chain

Trigger Condition

A firmware reset is initiated on an mlx5 adapter in LAG bonding mode — either via routine maintenance, failover event, or privileged adversarial action on the compute node.

Race Window Opens

The slave adapter enters error state during firmware reset while the master adapter remains active, creating a window where UMR operations succeed at post but completions are never delivered.

Kernel Deadlock

The kernel blocks indefinitely in __mutex_lock awaiting UMR completion inside __mlx5_ib_dereg_mr, preventing device unload and rendering the node completely unresponsive.

AI Service Disruption

Distributed training jobs or RDMA-dependent inference workloads are terminated across all cluster nodes sharing the fabric, requiring hard reboot and full job restart to recover.

AML.T0029

Trigger Condition

A firmware reset is initiated on an mlx5 adapter in LAG bonding mode — either via routine maintenance, failover event, or privileged adversarial action on the compute node.

Race Window Opens

The slave adapter enters error state during firmware reset while the master adapter remains active, creating a window where UMR operations succeed at post but completions are never delivered.

Kernel Deadlock

The kernel blocks indefinitely in __mutex_lock awaiting UMR completion inside __mlx5_ib_dereg_mr, preventing device unload and rendering the node completely unresponsive.

AI Service Disruption

Distributed training jobs or RDMA-dependent inference workloads are terminated across all cluster nodes sharing the fabric, requiring hard reboot and full job restart to recover.

AML.T0029

Severity & Risk

CVSS 3.1

N/A

EPSS

N/A

Exploitation Status

No known exploitation

Sophistication

Advanced

What should I do?

5 steps

Apply stable kernel patches: cherry-pick commits 613f5d4139b6, 6d838873da9c, c8fb5c965ac7, or ebc2164a4cd4 from kernel.org stable branches.
Inventory all systems with mlx5 RDMA adapters in LAG/bond mode using 'ip link show type bond' and 'lsmod | grep mlx5'.
On unpatched systems, avoid initiating firmware resets during active training jobs — gate maintenance windows on job idle state.
Monitor for kernel hangs via dmesg patterns involving 'mlx5_ib', 'ib_core', and '__mutex_lock' with indefinite wait traces.
Configure watchdog timers or cluster health monitors to detect and alert on node unresponsiveness within RDMA-connected training pools.

Classification

DoS Inference AML.T0010.000 - Hardware AML.T0029 - Denial of AI Service

Compliance Impact

This CVE is relevant to:

ISO 42001

A.6.2 - AI System Resources and Infrastructure

NIST AI RMF

MANAGE 2.2 - Mechanisms for AI Risk Management — System Reliability

Frequently Asked Questions

What is CVE-2026-45973?

A race condition in the Linux kernel's mlx5 RDMA driver causes an indefinite system hang during firmware reset in LAG (link aggregation) mode, effectively requiring a hard reboot to recover the affected node. AI and ML teams operating distributed training clusters on Mellanox/NVIDIA ConnectX adapters using RDMA fabrics (InfiniBand or RoCE) are at risk: the hang occurs during firmware reset procedures that are routine in maintenance and failover workflows, making this an operational reliability threat rather than a remote exploitation scenario. There is no public exploit, no CVSS score, and this is not in CISA KEV, placing deliberate exploitation likelihood as low — but unintentional triggering during maintenance on unpatched systems is a real concern for high-availability AI compute environments. Apply the available stable kernel patches (four commits published at kernel.org) and audit any mlx5 LAG configurations on AI compute nodes before the next scheduled maintenance window.

Is CVE-2026-45973 actively exploited?

No confirmed active exploitation of CVE-2026-45973 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-45973?

1. Apply stable kernel patches: cherry-pick commits 613f5d4139b6, 6d838873da9c, c8fb5c965ac7, or ebc2164a4cd4 from kernel.org stable branches. 2. Inventory all systems with mlx5 RDMA adapters in LAG/bond mode using 'ip link show type bond' and 'lsmod | grep mlx5'. 3. On unpatched systems, avoid initiating firmware resets during active training jobs — gate maintenance windows on job idle state. 4. Monitor for kernel hangs via dmesg patterns involving 'mlx5_ib', 'ib_core', and '__mutex_lock' with indefinite wait traces. 5. Configure watchdog timers or cluster health monitors to detect and alert on node unresponsiveness within RDMA-connected training pools.

What systems are affected by CVE-2026-45973?

This vulnerability affects the following AI/ML architecture patterns: Distributed training clusters, RDMA-based GPU compute nodes, High-performance compute AI infrastructure, Multi-node large model training pipelines, RDMA-accelerated inference serving nodes.

What is the CVSS score for CVE-2026-45973?

No CVSS score has been assigned yet.

AI Security Impact

Affected AI Architectures

Distributed training clustersRDMA-based GPU compute nodesHigh-performance compute AI infrastructureMulti-node large model training pipelinesRDMA-accelerated inference serving nodes

MITRE ATLAS Techniques

AML.T0010.000 Hardware

AML.T0029 Denial of AI Service

Compliance Controls Affected

ISO 42001: A.6.2

NIST AI RMF: MANAGE 2.2

Technical Details

Original Advisory

In the Linux kernel, the following vulnerability has been resolved: RDMA/mlx5: Fix UMR hang in LAG error state unload During firmware reset in LAG mode, a race condition causes the driver to hang indefinitely while waiting for UMR completion during device unload. See [1]. In LAG mode the bond device is only registered on the master, so it never sees sys_error events from the slave. During firmware reset this causes UMR waits to hang forever on unload as the slave is dead but the master hasn't entered error state yet, so UMR posts succeed but completions never arrive. Fix this by adding a sys_error notifier that gets registered before MLX5_IB_STAGE_IB_REG and stays alive until after ib_unregister_device(). This ensures error events reach the bond device throughout teardown. [1] Call Trace: __schedule+0x2bd/0x760 schedule+0x37/0xa0 schedule_preempt_disabled+0xa/0x10 __mutex_lock.isra.6+0x2b5/0x4a0 __mlx5_ib_dereg_mr+0x606/0x870 [mlx5_ib] ? __xa_erase+0x4a/0xa0 ? _cond_resched+0x15/0x30 ? wait_for_completion+0x31/0x100 ib_dereg_mr_user+0x48/0xc0 [ib_core] ? rdmacg_uncharge_hierarchy+0xa0/0x100 destroy_hw_idr_uobject+0x20/0x50 [ib_uverbs] uverbs_destroy_uobject+0x37/0x150 [ib_uverbs] __uverbs_cleanup_ufile+0xda/0x140 [ib_uverbs] uverbs_destroy_ufile_hw+0x3a/0xf0 [ib_uverbs] ib_uverbs_remove_one+0xc3/0x140 [ib_uverbs] remove_client_context+0x8b/0xd0 [ib_core] disable_device+0x8c/0x130 [ib_core] __ib_unregister_device+0x10d/0x180 [ib_core] ib_unregister_device+0x21/0x30 [ib_core] __mlx5_ib_remove+0x1e4/0x1f0 [mlx5_ib] auxiliary_bus_remove+0x1e/0x30 device_release_driver_internal+0x103/0x1f0 bus_remove_device+0xf7/0x170 device_del+0x181/0x410 mlx5_rescan_drivers_locked.part.10+0xa9/0x1d0 [mlx5_core] mlx5_disable_lag+0x253/0x260 [mlx5_core] mlx5_lag_disable_change+0x89/0xc0 [mlx5_core] mlx5_eswitch_disable+0x67/0xa0 [mlx5_core] mlx5_unload+0x15/0xd0 [mlx5_core] mlx5_unload_one+0x71/0xc0 [mlx5_core] mlx5_sync_reset_reload_work+0x83/0x100 [mlx5_core] process_one_work+0x1a7/0x360 worker_thread+0x30/0x390 ? create_worker+0x1a0/0x1a0 kthread+0x116/0x130 ? kthread_flush_work_fn+0x10/0x10 ret_from_fork+0x22/0x40

Exploitation Scenario

A system administrator performs routine firmware maintenance on an mlx5 adapter operating in LAG bonding mode on an AI training node. The firmware reset causes the slave adapter to enter an error state, but the master adapter has not yet propagated the error. Outstanding UMR (User Memory Region) operations — used by RDMA memory registration during active training — post to hardware successfully but their completions are never delivered. The kernel blocks indefinitely in __mutex_lock awaiting UMR completion inside __mlx5_ib_dereg_mr, preventing device teardown. The training node becomes unresponsive, disrupting the distributed NCCL all-reduce operations across all peer nodes in the training job, requiring a hard reboot and full checkpoint-based restart of the training run.