CVE-2026-46178: Linux mlx4 RDMA: resource leak on SRQ creation error
AWAITING NVDCVE-2026-46178 is a resource leak in the Linux kernel's mlx4 RDMA driver, where a missing call to mlx4_srq_free() during error unwind in mlx4_ib_create_srq() can cause kernel memory to leak on affected systems. For most AI/ML teams this is low priority, but distributed training clusters using RDMA/InfiniBand with older Mellanox ConnectX-3/4 hardware (mlx4 generation) — common in PyTorch DDP, Horovod, or MPI-based large-scale training environments — should be aware that repeated SRQ creation failures could degrade node stability over time. There is no assigned CVSS score, no public exploit, and no CISA KEV entry, placing this firmly in routine patch cadence rather than emergency response. Apply the four kernel stable-branch patches referenced in the CVE and ensure affected training nodes are included in your standard kernel update schedule.
What is the risk?
Low risk. This is a resource leak on an error path in the RDMA kernel driver — it does not enable direct code execution, privilege escalation, or data exfiltration. Exploitation requires either local kernel access or specific hardware error conditions that trigger the faulty cleanup path repeatedly. No CVSS score has been assigned, no public exploit exists, and the vulnerability is not in CISA KEV. Impact is limited to systems running older Mellanox mlx4-generation hardware (ConnectX-3/4); operators using mlx5 or later hardware are not affected. Risk to AI/ML workloads is indirect and limited to training cluster stability.
Attack Kill Chain
Severity & Risk
What should I do?
5 steps-
Identify training or inference nodes running Linux kernels with the mlx4_ib RDMA driver on Mellanox ConnectX-3/4 hardware.
-
Apply the kernel patches from the four stable branches referenced in the CVE advisory (git.kernel.org commits: 0dbd6197, 388617f4, c54c7e4c, c5dc30da, e01b8c92).
-
If immediate patching is not feasible, consider restarting long-running RDMA workloads periodically to reclaim leaked memory as a temporary measure.
-
Monitor kernel memory usage (e.g., /proc/meminfo, kernel OOM events) on RDMA-enabled nodes.
-
Operators using mlx5 or later Mellanox hardware are not affected and require no action.
Classification
Compliance Impact
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2026-46178?
CVE-2026-46178 is a resource leak in the Linux kernel's mlx4 RDMA driver, where a missing call to mlx4_srq_free() during error unwind in mlx4_ib_create_srq() can cause kernel memory to leak on affected systems. For most AI/ML teams this is low priority, but distributed training clusters using RDMA/InfiniBand with older Mellanox ConnectX-3/4 hardware (mlx4 generation) — common in PyTorch DDP, Horovod, or MPI-based large-scale training environments — should be aware that repeated SRQ creation failures could degrade node stability over time. There is no assigned CVSS score, no public exploit, and no CISA KEV entry, placing this firmly in routine patch cadence rather than emergency response. Apply the four kernel stable-branch patches referenced in the CVE and ensure affected training nodes are included in your standard kernel update schedule.
Is CVE-2026-46178 actively exploited?
No confirmed active exploitation of CVE-2026-46178 has been reported, but organizations should still patch proactively.
How to fix CVE-2026-46178?
1. Identify training or inference nodes running Linux kernels with the mlx4_ib RDMA driver on Mellanox ConnectX-3/4 hardware. 2. Apply the kernel patches from the four stable branches referenced in the CVE advisory (git.kernel.org commits: 0dbd6197, 388617f4, c54c7e4c, c5dc30da, e01b8c92). 3. If immediate patching is not feasible, consider restarting long-running RDMA workloads periodically to reclaim leaked memory as a temporary measure. 4. Monitor kernel memory usage (e.g., /proc/meminfo, kernel OOM events) on RDMA-enabled nodes. 5. Operators using mlx5 or later Mellanox hardware are not affected and require no action.
What systems are affected by CVE-2026-46178?
This vulnerability affects the following AI/ML architecture patterns: Distributed ML training clusters, RDMA/InfiniBand HPC training infrastructure, Multi-node training pipelines (PyTorch DDP, Horovod).
What is the CVSS score for CVE-2026-46178?
No CVSS score has been assigned yet.
AI Security Impact
Affected AI Architectures
MITRE ATLAS Techniques
AML.T0029 Denial of AI Service Compliance Controls Affected
Technical Details
Original Advisory
In the Linux kernel, the following vulnerability has been resolved: RDMA/mlx4: Fix resource leak on error in mlx4_ib_create_srq() Sashiko points out that mlx4_srq_alloc() was not undone during error unwind, add the missing call to mlx4_srq_free().
Exploitation Scenario
An adversary with local access to an HPC or ML training node equipped with Mellanox mlx4 RDMA hardware could craft a workload or tool that repeatedly triggers RDMA Shared Receive Queue creation failures — for example, by exhausting a specific resource limit or inducing a transient hardware error condition. Each failed mlx4_ib_create_srq() call leaks a kernel SRQ object. Over the duration of a multi-day distributed training job, accumulated leaks could degrade available kernel memory, eventually triggering OOM kills on training processes or destabilizing the node. This is an indirect, low-sophistication denial-of-service against training infrastructure rather than a targeted AI attack, and it requires physical or local access to the affected host.
References
- git.kernel.org/stable/c/0dbd619716fb07b7de1acd64fec673ee6e1adde7
- git.kernel.org/stable/c/388617f44d81604a760742a0b5de292d411e63e3
- git.kernel.org/stable/c/c54c7e4cb679c0aaa1cb489b9c3f2cd98e63a44c
- git.kernel.org/stable/c/c5dc30da990045105c9762248d23076223e7878a
- git.kernel.org/stable/c/e01b8c9286c470b71a38acd320106f2c4f2826a1
Timeline
Related Vulnerabilities
CVE-2026-33660 10.0 TensorFlow: type confusion NPD in tensor conversion
Same attack type: DoS CVE-2022-35939 9.8 TensorFlow: ScatterNd OOB write enables RCE/crash
Same attack type: DoS CVE-2022-23587 9.8 TensorFlow: integer overflow in Grappler enables RCE
Same attack type: DoS CVE-2022-41900 9.8 TensorFlow: heap OOB RCE in FractionalMaxPool op
Same attack type: DoS CVE-2023-25668 9.8 TensorFlow: unauthenticated RCE via heap buffer overflow
Same attack type: DoS