CVE-2026-45973: Linux mlx5: RDMA hang DoS on AI training clusters
AWAITING NVDA race condition in the Linux kernel's mlx5 RDMA driver causes an indefinite system hang during firmware reset in LAG (link aggregation) mode, effectively requiring a hard reboot to recover the affected node. AI and ML teams operating distributed training clusters on Mellanox/NVIDIA ConnectX adapters using RDMA fabrics (InfiniBand or RoCE) are at risk: the hang occurs during firmware reset procedures that are routine in maintenance and failover workflows, making this an operational reliability threat rather than a remote exploitation scenario. There is no public exploit, no CVSS score, and this is not in CISA KEV, placing deliberate exploitation likelihood as low — but unintentional triggering during maintenance on unpatched systems is a real concern for high-availability AI compute environments. Apply the available stable kernel patches (four commits published at kernel.org) and audit any mlx5 LAG configurations on AI compute nodes before the next scheduled maintenance window.
What is the risk?
Low-to-medium risk for most enterprise environments, elevated for organizations running distributed AI training at scale. Exploitation requires local or privileged access to initiate a firmware reset on an mlx5 adapter in LAG mode — it is not remotely exploitable under normal conditions. The impact is availability-only: a system hang requiring a hard reboot. For AI training clusters dependent on continuous RDMA fabric availability, the blast radius extends beyond the single node: a hang during gradient synchronization can stall all participating nodes in a training job, requiring coordinated recovery and full job restart.
Attack Kill Chain
Severity & Risk
What should I do?
5 steps-
Apply stable kernel patches: cherry-pick commits 613f5d4139b6, 6d838873da9c, c8fb5c965ac7, or ebc2164a4cd4 from kernel.org stable branches.
-
Inventory all systems with mlx5 RDMA adapters in LAG/bond mode using 'ip link show type bond' and 'lsmod | grep mlx5'.
-
On unpatched systems, avoid initiating firmware resets during active training jobs — gate maintenance windows on job idle state.
-
Monitor for kernel hangs via dmesg patterns involving 'mlx5_ib', 'ib_core', and '__mutex_lock' with indefinite wait traces.
-
Configure watchdog timers or cluster health monitors to detect and alert on node unresponsiveness within RDMA-connected training pools.
Classification
Compliance Impact
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2026-45973?
A race condition in the Linux kernel's mlx5 RDMA driver causes an indefinite system hang during firmware reset in LAG (link aggregation) mode, effectively requiring a hard reboot to recover the affected node. AI and ML teams operating distributed training clusters on Mellanox/NVIDIA ConnectX adapters using RDMA fabrics (InfiniBand or RoCE) are at risk: the hang occurs during firmware reset procedures that are routine in maintenance and failover workflows, making this an operational reliability threat rather than a remote exploitation scenario. There is no public exploit, no CVSS score, and this is not in CISA KEV, placing deliberate exploitation likelihood as low — but unintentional triggering during maintenance on unpatched systems is a real concern for high-availability AI compute environments. Apply the available stable kernel patches (four commits published at kernel.org) and audit any mlx5 LAG configurations on AI compute nodes before the next scheduled maintenance window.
Is CVE-2026-45973 actively exploited?
No confirmed active exploitation of CVE-2026-45973 has been reported, but organizations should still patch proactively.
How to fix CVE-2026-45973?
1. Apply stable kernel patches: cherry-pick commits 613f5d4139b6, 6d838873da9c, c8fb5c965ac7, or ebc2164a4cd4 from kernel.org stable branches. 2. Inventory all systems with mlx5 RDMA adapters in LAG/bond mode using 'ip link show type bond' and 'lsmod | grep mlx5'. 3. On unpatched systems, avoid initiating firmware resets during active training jobs — gate maintenance windows on job idle state. 4. Monitor for kernel hangs via dmesg patterns involving 'mlx5_ib', 'ib_core', and '__mutex_lock' with indefinite wait traces. 5. Configure watchdog timers or cluster health monitors to detect and alert on node unresponsiveness within RDMA-connected training pools.
What systems are affected by CVE-2026-45973?
This vulnerability affects the following AI/ML architecture patterns: Distributed training clusters, RDMA-based GPU compute nodes, High-performance compute AI infrastructure, Multi-node large model training pipelines, RDMA-accelerated inference serving nodes.
What is the CVSS score for CVE-2026-45973?
No CVSS score has been assigned yet.
AI Security Impact
Affected AI Architectures
MITRE ATLAS Techniques
AML.T0010.000 Hardware AML.T0029 Denial of AI Service Compliance Controls Affected
Technical Details
Original Advisory
In the Linux kernel, the following vulnerability has been resolved: RDMA/mlx5: Fix UMR hang in LAG error state unload During firmware reset in LAG mode, a race condition causes the driver to hang indefinitely while waiting for UMR completion during device unload. See [1]. In LAG mode the bond device is only registered on the master, so it never sees sys_error events from the slave. During firmware reset this causes UMR waits to hang forever on unload as the slave is dead but the master hasn't entered error state yet, so UMR posts succeed but completions never arrive. Fix this by adding a sys_error notifier that gets registered before MLX5_IB_STAGE_IB_REG and stays alive until after ib_unregister_device(). This ensures error events reach the bond device throughout teardown. [1] Call Trace: __schedule+0x2bd/0x760 schedule+0x37/0xa0 schedule_preempt_disabled+0xa/0x10 __mutex_lock.isra.6+0x2b5/0x4a0 __mlx5_ib_dereg_mr+0x606/0x870 [mlx5_ib] ? __xa_erase+0x4a/0xa0 ? _cond_resched+0x15/0x30 ? wait_for_completion+0x31/0x100 ib_dereg_mr_user+0x48/0xc0 [ib_core] ? rdmacg_uncharge_hierarchy+0xa0/0x100 destroy_hw_idr_uobject+0x20/0x50 [ib_uverbs] uverbs_destroy_uobject+0x37/0x150 [ib_uverbs] __uverbs_cleanup_ufile+0xda/0x140 [ib_uverbs] uverbs_destroy_ufile_hw+0x3a/0xf0 [ib_uverbs] ib_uverbs_remove_one+0xc3/0x140 [ib_uverbs] remove_client_context+0x8b/0xd0 [ib_core] disable_device+0x8c/0x130 [ib_core] __ib_unregister_device+0x10d/0x180 [ib_core] ib_unregister_device+0x21/0x30 [ib_core] __mlx5_ib_remove+0x1e4/0x1f0 [mlx5_ib] auxiliary_bus_remove+0x1e/0x30 device_release_driver_internal+0x103/0x1f0 bus_remove_device+0xf7/0x170 device_del+0x181/0x410 mlx5_rescan_drivers_locked.part.10+0xa9/0x1d0 [mlx5_core] mlx5_disable_lag+0x253/0x260 [mlx5_core] mlx5_lag_disable_change+0x89/0xc0 [mlx5_core] mlx5_eswitch_disable+0x67/0xa0 [mlx5_core] mlx5_unload+0x15/0xd0 [mlx5_core] mlx5_unload_one+0x71/0xc0 [mlx5_core] mlx5_sync_reset_reload_work+0x83/0x100 [mlx5_core] process_one_work+0x1a7/0x360 worker_thread+0x30/0x390 ? create_worker+0x1a0/0x1a0 kthread+0x116/0x130 ? kthread_flush_work_fn+0x10/0x10 ret_from_fork+0x22/0x40
Exploitation Scenario
A system administrator performs routine firmware maintenance on an mlx5 adapter operating in LAG bonding mode on an AI training node. The firmware reset causes the slave adapter to enter an error state, but the master adapter has not yet propagated the error. Outstanding UMR (User Memory Region) operations — used by RDMA memory registration during active training — post to hardware successfully but their completions are never delivered. The kernel blocks indefinitely in __mutex_lock awaiting UMR completion inside __mlx5_ib_dereg_mr, preventing device teardown. The training node becomes unresponsive, disrupting the distributed NCCL all-reduce operations across all peer nodes in the training job, requiring a hard reboot and full checkpoint-based restart of the training run.
References
Timeline
Related Vulnerabilities
CVE-2026-33660 10.0 TensorFlow: type confusion NPD in tensor conversion
Same attack type: DoS CVE-2022-35939 9.8 TensorFlow: ScatterNd OOB write enables RCE/crash
Same attack type: DoS CVE-2022-23587 9.8 TensorFlow: integer overflow in Grappler enables RCE
Same attack type: DoS CVE-2022-41900 9.8 TensorFlow: heap OOB RCE in FractionalMaxPool op
Same attack type: DoS CVE-2023-25668 9.8 TensorFlow: unauthenticated RCE via heap buffer overflow
Same attack type: DoS