Linux mlx4 RDMA: resource leak on SRQ creation error — AWAITING NVD (CVE-2026-46178)

CISO Take

CVE-2026-46178 is a resource leak in the Linux kernel's mlx4 RDMA driver, where a missing call to mlx4_srq_free() during error unwind in mlx4_ib_create_srq() can cause kernel memory to leak on affected systems. For most AI/ML teams this is low priority, but distributed training clusters using RDMA/InfiniBand with older Mellanox ConnectX-3/4 hardware (mlx4 generation) — common in PyTorch DDP, Horovod, or MPI-based large-scale training environments — should be aware that repeated SRQ creation failures could degrade node stability over time. There is no assigned CVSS score, no public exploit, and no CISA KEV entry, placing this firmly in routine patch cadence rather than emergency response. Apply the four kernel stable-branch patches referenced in the CVE and ensure affected training nodes are included in your standard kernel update schedule.

Sources: NVD ATLAS

What is the risk?

Low risk. This is a resource leak on an error path in the RDMA kernel driver — it does not enable direct code execution, privilege escalation, or data exfiltration. Exploitation requires either local kernel access or specific hardware error conditions that trigger the faulty cleanup path repeatedly. No CVSS score has been assigned, no public exploit exists, and the vulnerability is not in CISA KEV. Impact is limited to systems running older Mellanox mlx4-generation hardware (ConnectX-3/4); operators using mlx5 or later hardware are not affected. Risk to AI/ML workloads is indirect and limited to training cluster stability.

Attack Kill Chain

Initial Access

Attacker gains local access to an RDMA-enabled ML training node running the mlx4_ib kernel driver on Mellanox ConnectX-3/4 hardware.

AML.T0049

Resource Exhaustion Trigger

Attacker repeatedly triggers error conditions in mlx4_ib_create_srq(), causing each failed call to leak a kernel SRQ object due to the missing mlx4_srq_free() call.

Impact

Accumulated kernel memory leaks degrade available memory on the training node, ultimately causing OOM kills of distributed training processes or node instability.

AML.T0029

Initial Access

Attacker gains local access to an RDMA-enabled ML training node running the mlx4_ib kernel driver on Mellanox ConnectX-3/4 hardware.

AML.T0049

Resource Exhaustion Trigger

Attacker repeatedly triggers error conditions in mlx4_ib_create_srq(), causing each failed call to leak a kernel SRQ object due to the missing mlx4_srq_free() call.

Impact

Accumulated kernel memory leaks degrade available memory on the training node, ultimately causing OOM kills of distributed training processes or node instability.

AML.T0029

Severity & Risk

CVSS 3.1

N/A

EPSS

N/A

Exploitation Status

No known exploitation

Sophistication

Advanced

What should I do?

5 steps

Identify training or inference nodes running Linux kernels with the mlx4_ib RDMA driver on Mellanox ConnectX-3/4 hardware.
Apply the kernel patches from the four stable branches referenced in the CVE advisory (git.kernel.org commits: 0dbd6197, 388617f4, c54c7e4c, c5dc30da, e01b8c92).
If immediate patching is not feasible, consider restarting long-running RDMA workloads periodically to reclaim leaked memory as a temporary measure.
Monitor kernel memory usage (e.g., /proc/meminfo, kernel OOM events) on RDMA-enabled nodes.
Operators using mlx5 or later Mellanox hardware are not affected and require no action.

Classification

DoS Inference AML.T0029 - Denial of AI Service

Compliance Impact

This CVE is relevant to:

ISO 42001

A.6.2 - Resources for AI systems

NIST AI RMF

MANAGE 2.2 - Mechanisms are in place to respond to AI risks

Frequently Asked Questions

What is CVE-2026-46178?

CVE-2026-46178 is a resource leak in the Linux kernel's mlx4 RDMA driver, where a missing call to mlx4_srq_free() during error unwind in mlx4_ib_create_srq() can cause kernel memory to leak on affected systems. For most AI/ML teams this is low priority, but distributed training clusters using RDMA/InfiniBand with older Mellanox ConnectX-3/4 hardware (mlx4 generation) — common in PyTorch DDP, Horovod, or MPI-based large-scale training environments — should be aware that repeated SRQ creation failures could degrade node stability over time. There is no assigned CVSS score, no public exploit, and no CISA KEV entry, placing this firmly in routine patch cadence rather than emergency response. Apply the four kernel stable-branch patches referenced in the CVE and ensure affected training nodes are included in your standard kernel update schedule.

Is CVE-2026-46178 actively exploited?

No confirmed active exploitation of CVE-2026-46178 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-46178?

1. Identify training or inference nodes running Linux kernels with the mlx4_ib RDMA driver on Mellanox ConnectX-3/4 hardware. 2. Apply the kernel patches from the four stable branches referenced in the CVE advisory (git.kernel.org commits: 0dbd6197, 388617f4, c54c7e4c, c5dc30da, e01b8c92). 3. If immediate patching is not feasible, consider restarting long-running RDMA workloads periodically to reclaim leaked memory as a temporary measure. 4. Monitor kernel memory usage (e.g., /proc/meminfo, kernel OOM events) on RDMA-enabled nodes. 5. Operators using mlx5 or later Mellanox hardware are not affected and require no action.

What systems are affected by CVE-2026-46178?

This vulnerability affects the following AI/ML architecture patterns: Distributed ML training clusters, RDMA/InfiniBand HPC training infrastructure, Multi-node training pipelines (PyTorch DDP, Horovod).

What is the CVSS score for CVE-2026-46178?

No CVSS score has been assigned yet.

AI Security Impact

Affected AI Architectures

Distributed ML training clustersRDMA/InfiniBand HPC training infrastructureMulti-node training pipelines (PyTorch DDP, Horovod)

MITRE ATLAS Techniques

AML.T0029 Denial of AI Service

Compliance Controls Affected

ISO 42001: A.6.2

NIST AI RMF: MANAGE 2.2

Technical Details

Original Advisory

In the Linux kernel, the following vulnerability has been resolved: RDMA/mlx4: Fix resource leak on error in mlx4_ib_create_srq() Sashiko points out that mlx4_srq_alloc() was not undone during error unwind, add the missing call to mlx4_srq_free().

Exploitation Scenario

An adversary with local access to an HPC or ML training node equipped with Mellanox mlx4 RDMA hardware could craft a workload or tool that repeatedly triggers RDMA Shared Receive Queue creation failures — for example, by exhausting a specific resource limit or inducing a transient hardware error condition. Each failed mlx4_ib_create_srq() call leaks a kernel SRQ object. Over the duration of a multi-day distributed training job, accumulated leaks could degrade available kernel memory, eventually triggering OOM kills on training processes or destabilizing the node. This is an indirect, low-sophistication denial-of-service against training infrastructure rather than a targeted AI attack, and it requires physical or local access to the affected host.