CVE-2025-4287: PyTorch NCCL: local DoS in distributed training reduce op
LOW CISA: TRACK*Low-severity local DoS in PyTorch's NCCL reduce function (torch.cuda.nccl.reduce). Exploiting requires local access with unprivileged credentials — primary risk is in shared GPU clusters or multi-tenant ML training environments where a rogue user can crash distributed training jobs. Apply the upstream patch; if patching is blocked, restrict local access to training nodes.
Risk Assessment
Risk is low in typical deployments. CVSS 3.3 reflects the local-only attack vector and availability-only impact. Effective risk elevates in shared HPC/GPU cluster environments where multiple teams share nodes — there, a low-privileged insider or compromised account can disrupt expensive distributed training runs. Not exploitable remotely. No active exploitation observed. No CISA KEV listing.
Severity & Risk
Attack Surface
Recommended Action
5 steps-
Patch: apply commit 5827d2061dcb4acd05ac5f8e65d8693a481ba0f5 or update PyTorch once a patched release ships.
-
Workaround: restrict local shell access to GPU training nodes to authorized users via SSH key controls and namespace isolation.
-
In Kubernetes/containerized training (e.g., Kubeflow, Ray), enforce pod security standards and limit inter-pod privilege escalation.
-
Detection: monitor for unexpected process terminations or hangs in distributed training jobs correlated with nccl.reduce call stacks (check NCCL logs and PyTorch DDP error traces).
-
Inventory all PyTorch 2.6.0+cu124 deployments in training infrastructure.
CISA SSVC Assessment
Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.
Classification
Compliance Impact
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2025-4287?
Low-severity local DoS in PyTorch's NCCL reduce function (torch.cuda.nccl.reduce). Exploiting requires local access with unprivileged credentials — primary risk is in shared GPU clusters or multi-tenant ML training environments where a rogue user can crash distributed training jobs. Apply the upstream patch; if patching is blocked, restrict local access to training nodes.
Is CVE-2025-4287 actively exploited?
No confirmed active exploitation of CVE-2025-4287 has been reported, but organizations should still patch proactively.
How to fix CVE-2025-4287?
1. Patch: apply commit 5827d2061dcb4acd05ac5f8e65d8693a481ba0f5 or update PyTorch once a patched release ships. 2. Workaround: restrict local shell access to GPU training nodes to authorized users via SSH key controls and namespace isolation. 3. In Kubernetes/containerized training (e.g., Kubeflow, Ray), enforce pod security standards and limit inter-pod privilege escalation. 4. Detection: monitor for unexpected process terminations or hangs in distributed training jobs correlated with nccl.reduce call stacks (check NCCL logs and PyTorch DDP error traces). 5. Inventory all PyTorch 2.6.0+cu124 deployments in training infrastructure.
What systems are affected by CVE-2025-4287?
This vulnerability affects the following AI/ML architecture patterns: distributed training pipelines, multi-GPU model serving, training pipelines.
What is the CVSS score for CVE-2025-4287?
CVE-2025-4287 has a CVSS v3.1 base score of 3.3 (LOW). The EPSS exploitation probability is 0.08%.
Technical Details
NVD Description
A vulnerability was found in PyTorch 2.6.0+cu124. It has been rated as problematic. Affected by this issue is the function torch.cuda.nccl.reduce of the file torch/cuda/nccl.py. The manipulation leads to denial of service. It is possible to launch the attack on the local host. The exploit has been disclosed to the public and may be used. The patch is identified as 5827d2061dcb4acd05ac5f8e65d8693a481ba0f5. It is recommended to apply a patch to fix this issue.
Exploitation Scenario
An adversary with low-privileged local access to a shared GPU training node (e.g., via a shared HPC account or a compromised ML engineer credential) triggers torch.cuda.nccl.reduce with crafted input that causes improper resource release. This crashes or hangs the NCCL collective operation, causing the entire distributed training job to stall — potentially destroying hours or days of in-progress model training with no data corruption of stored checkpoints. In a multi-tenant GPU cluster scenario (e.g., research institution or internal ML platform), this could be used as sabotage against a competing team's training run.
Weaknesses (CWE)
CVSS Vector
CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:L References
Timeline
Related Vulnerabilities
CVE-2026-33660 10.0 TensorFlow: type confusion NPD in tensor conversion
Same attack type: DoS CVE-2022-35939 9.8 TensorFlow: ScatterNd OOB write enables RCE/crash
Same attack type: DoS CVE-2022-23587 9.8 TensorFlow: integer overflow in Grappler enables RCE
Same attack type: DoS CVE-2022-41900 9.8 TensorFlow: heap OOB RCE in FractionalMaxPool op
Same attack type: DoS CVE-2023-25668 9.8 TensorFlow: unauthenticated RCE via heap buffer overflow
Same attack type: DoS
AI Threat Alert