CVE-2026-45907: Linux mlx5e: deadlock DoS in Mellanox NIC recovery paths
AWAITING NVDA lock-ordering inversion in the Linux kernel's Mellanox mlx5e Ethernet driver (net/mlx5e) creates ABBA deadlocks across four NIC health-recovery code paths: TX error CQE, RX timeout, PTP queue unhealthy, and TX timeout. When any of these recovery handlers fires, it acquires the netdev instance lock before calling into devlink, which then tries to acquire the devlink lock — inverting the kernel-mandated order (devlink → rtnl → netdev) and hanging the host's networking stack indefinitely. While not remotely exploitable in isolation and carrying no CVSS score or active exploitation data, this vulnerability is a real availability concern for GPU and HPC clusters where Mellanox/NVIDIA ConnectX NICs underpin high-bandwidth distributed ML training over RDMA/RoCE fabrics — a kernel hang on a training node kills in-flight jobs and requires a reboot. Teams running Linux-based AI training infrastructure on Mellanox hardware should prioritize patching to the fixed stable kernel commits (4329514c, 63f9d5fb, 83ac0304) as part of routine kernel maintenance.
What is the risk?
LOW-MEDIUM for AI/ML infrastructure. The vulnerability is not remotely exploitable by network packet alone and requires either local access or the ability to reliably trigger NIC hardware error conditions (TX CQE faults, RX timeouts). No CVSS assigned, no public exploit, not in CISA KEV, EPSS not scored. Exploitation complexity is high. Impact is limited to availability — networking stack freeze requiring reboot — with no confidentiality or integrity implications. Risk is elevated in shared GPU cluster environments where a single node hang disrupts multi-tenant distributed training jobs and can cascade into checkpoint loss.
Attack Kill Chain
Severity & Risk
What should I do?
5 steps-
Apply stable kernel patches from the three upstream commits: 4329514c61ab, 63f9d5fb4d80, 83ac0304a2d7 (git.kernel.org/stable).
-
Update to a patched kernel version once vendor distros (RHEL, Ubuntu, SUSE) incorporate the fix — monitor vendor security channels.
-
Audit kernel logs (dmesg, journalctl -k) on Mellanox NIC hosts for lockdep warnings or soft lockup traces indicating ABBA deadlock conditions.
-
In critical training clusters, consider temporarily disabling mlx5e devlink health reporter auto-recovery via devlink CLI as a stop-gap, accepting manual intervention in exchange for deadlock elimination.
-
Run lockdep-enabled debug kernels in staging environments before promoting kernel updates to production GPU nodes.
Classification
Compliance Impact
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2026-45907?
A lock-ordering inversion in the Linux kernel's Mellanox mlx5e Ethernet driver (net/mlx5e) creates ABBA deadlocks across four NIC health-recovery code paths: TX error CQE, RX timeout, PTP queue unhealthy, and TX timeout. When any of these recovery handlers fires, it acquires the netdev instance lock before calling into devlink, which then tries to acquire the devlink lock — inverting the kernel-mandated order (devlink → rtnl → netdev) and hanging the host's networking stack indefinitely. While not remotely exploitable in isolation and carrying no CVSS score or active exploitation data, this vulnerability is a real availability concern for GPU and HPC clusters where Mellanox/NVIDIA ConnectX NICs underpin high-bandwidth distributed ML training over RDMA/RoCE fabrics — a kernel hang on a training node kills in-flight jobs and requires a reboot. Teams running Linux-based AI training infrastructure on Mellanox hardware should prioritize patching to the fixed stable kernel commits (4329514c, 63f9d5fb, 83ac0304) as part of routine kernel maintenance.
Is CVE-2026-45907 actively exploited?
No confirmed active exploitation of CVE-2026-45907 has been reported, but organizations should still patch proactively.
How to fix CVE-2026-45907?
1. Apply stable kernel patches from the three upstream commits: 4329514c61ab, 63f9d5fb4d80, 83ac0304a2d7 (git.kernel.org/stable). 2. Update to a patched kernel version once vendor distros (RHEL, Ubuntu, SUSE) incorporate the fix — monitor vendor security channels. 3. Audit kernel logs (dmesg, journalctl -k) on Mellanox NIC hosts for lockdep warnings or soft lockup traces indicating ABBA deadlock conditions. 4. In critical training clusters, consider temporarily disabling mlx5e devlink health reporter auto-recovery via devlink CLI as a stop-gap, accepting manual intervention in exchange for deadlock elimination. 5. Run lockdep-enabled debug kernels in staging environments before promoting kernel updates to production GPU nodes.
What systems are affected by CVE-2026-45907?
This vulnerability affects the following AI/ML architecture patterns: Distributed ML training clusters, GPU cluster RDMA/RoCE networking, High-performance AI inference serving infrastructure.
What is the CVSS score for CVE-2026-45907?
No CVSS score has been assigned yet.
AI Security Impact
Affected AI Architectures
MITRE ATLAS Techniques
AML.T0029 Denial of AI Service AML.T0112 Machine Compromise Compliance Controls Affected
Technical Details
Original Advisory
In the Linux kernel, the following vulnerability has been resolved: net/mlx5e: Fix deadlocks between devlink and netdev instance locks In the mentioned "Fixes" commit, various work tasks triggering devlink health reporter recovery were switched to use netdev_trylock to protect against concurrent tear down of the channels being recovered. But this had the side effect of introducing potential deadlocks because of incorrect lock ordering. The correct lock order is described by the init flow: probe_one -> mlx5_init_one (acquires devlink lock) -> mlx5_init_one_devl_locked -> mlx5_register_device -> mlx5_rescan_drivers_locked -...-> mlx5e_probe -> _mlx5e_probe -> register_netdev (acquires rtnl lock) -> register_netdevice (acquires netdev lock) => devlink lock -> rtnl lock -> netdev lock. But in the current recovery flow, the order is wrong: mlx5e_tx_err_cqe_work (acquires netdev lock) -> mlx5e_reporter_tx_err_cqe -> mlx5e_health_report -> devlink_health_report (acquires devlink lock => boom!) -> devlink_health_reporter_recover -> mlx5e_tx_reporter_recover -> mlx5e_tx_reporter_recover_from_ctx -> mlx5e_tx_reporter_err_cqe_recover The same pattern exists in: mlx5e_reporter_rx_timeout mlx5e_reporter_tx_ptpsq_unhealthy mlx5e_reporter_tx_timeout Fix these by moving the netdev_trylock calls from the work handlers lower in the call stack, in the respective recovery functions, where they are actually necessary.
Exploitation Scenario
An attacker with local access to a GPU training cluster node — or one who can inject malformed packets or manipulate hardware to induce persistent TX CQE errors on a Mellanox NIC — causes mlx5e_tx_err_cqe_work to execute. This work handler acquires the netdev instance lock and calls mlx5e_health_report, which calls devlink_health_report and attempts to acquire the devlink lock. Concurrently, a legitimate devlink probe or teardown operation holds the devlink lock and waits for the netdev lock. Neither thread can proceed — classic ABBA deadlock — freezing the host networking stack without any kernel crash or recovery. In a shared GPU cluster, this terminates every distributed training job on the node, causes checkpoint loss for long-running experiments, and requires operator intervention to reboot.
References
Timeline
Related Vulnerabilities
CVE-2026-33660 10.0 TensorFlow: type confusion NPD in tensor conversion
Same attack type: DoS CVE-2023-25668 9.8 TensorFlow: unauthenticated RCE via heap buffer overflow
Same attack type: DoS CVE-2022-23587 9.8 TensorFlow: integer overflow in Grappler enables RCE
Same attack type: DoS CVE-2022-35939 9.8 TensorFlow: ScatterNd OOB write enables RCE/crash
Same attack type: DoS CVE-2022-41900 9.8 TensorFlow: heap OOB RCE in FractionalMaxPool op
Same attack type: DoS