CVE-2025-4287: PyTorch NCCL: local DoS in distributed training reduce op

LOW CISA: TRACK*
Published May 5, 2025
CISO Take

Low-severity local DoS in PyTorch's NCCL reduce function (torch.cuda.nccl.reduce). Exploiting requires local access with unprivileged credentials — primary risk is in shared GPU clusters or multi-tenant ML training environments where a rogue user can crash distributed training jobs. Apply the upstream patch; if patching is blocked, restrict local access to training nodes.

Risk Assessment

Risk is low in typical deployments. CVSS 3.3 reflects the local-only attack vector and availability-only impact. Effective risk elevates in shared HPC/GPU cluster environments where multiple teams share nodes — there, a low-privileged insider or compromised account can disrupt expensive distributed training runs. Not exploitable remotely. No active exploitation observed. No CISA KEV listing.

Severity & Risk

CVSS 3.1
3.3 / 10
EPSS
0.1%
chance of exploitation in 30 days
Higher than 23% of all CVEs
Exploitation Status
Exploit Available
Exploitation: MEDIUM
Sophistication
Trivial
Exploitation Confidence
medium
CISA SSVC: Public PoC
Composite signal derived from CISA KEV, CISA SSVC, EPSS, trickest/cve, and Nuclei templates.

Attack Surface

AV AC PR UI S C I A
AV Local
AC Low
PR Low
UI None
S Unchanged
C None
I None
A Low

Recommended Action

5 steps
  1. Patch: apply commit 5827d2061dcb4acd05ac5f8e65d8693a481ba0f5 or update PyTorch once a patched release ships.

  2. Workaround: restrict local shell access to GPU training nodes to authorized users via SSH key controls and namespace isolation.

  3. In Kubernetes/containerized training (e.g., Kubeflow, Ray), enforce pod security standards and limit inter-pod privilege escalation.

  4. Detection: monitor for unexpected process terminations or hangs in distributed training jobs correlated with nccl.reduce call stacks (check NCCL logs and PyTorch DDP error traces).

  5. Inventory all PyTorch 2.6.0+cu124 deployments in training infrastructure.

CISA SSVC Assessment

Decision Track*
Exploitation poc
Automatable No
Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

Classification

Compliance Impact

This CVE is relevant to:

EU AI Act
Article 15 - Accuracy, robustness and cybersecurity
ISO 42001
A.10.2 - Availability of AI system resources
NIST AI RMF
MANAGE-2.4 - Mechanisms to respond to risks or harms
OWASP LLM Top 10
LLM05 - Supply Chain Vulnerabilities

Frequently Asked Questions

What is CVE-2025-4287?

Low-severity local DoS in PyTorch's NCCL reduce function (torch.cuda.nccl.reduce). Exploiting requires local access with unprivileged credentials — primary risk is in shared GPU clusters or multi-tenant ML training environments where a rogue user can crash distributed training jobs. Apply the upstream patch; if patching is blocked, restrict local access to training nodes.

Is CVE-2025-4287 actively exploited?

No confirmed active exploitation of CVE-2025-4287 has been reported, but organizations should still patch proactively.

How to fix CVE-2025-4287?

1. Patch: apply commit 5827d2061dcb4acd05ac5f8e65d8693a481ba0f5 or update PyTorch once a patched release ships. 2. Workaround: restrict local shell access to GPU training nodes to authorized users via SSH key controls and namespace isolation. 3. In Kubernetes/containerized training (e.g., Kubeflow, Ray), enforce pod security standards and limit inter-pod privilege escalation. 4. Detection: monitor for unexpected process terminations or hangs in distributed training jobs correlated with nccl.reduce call stacks (check NCCL logs and PyTorch DDP error traces). 5. Inventory all PyTorch 2.6.0+cu124 deployments in training infrastructure.

What systems are affected by CVE-2025-4287?

This vulnerability affects the following AI/ML architecture patterns: distributed training pipelines, multi-GPU model serving, training pipelines.

What is the CVSS score for CVE-2025-4287?

CVE-2025-4287 has a CVSS v3.1 base score of 3.3 (LOW). The EPSS exploitation probability is 0.08%.

Technical Details

NVD Description

A vulnerability was found in PyTorch 2.6.0+cu124. It has been rated as problematic. Affected by this issue is the function torch.cuda.nccl.reduce of the file torch/cuda/nccl.py. The manipulation leads to denial of service. It is possible to launch the attack on the local host. The exploit has been disclosed to the public and may be used. The patch is identified as 5827d2061dcb4acd05ac5f8e65d8693a481ba0f5. It is recommended to apply a patch to fix this issue.

Exploitation Scenario

An adversary with low-privileged local access to a shared GPU training node (e.g., via a shared HPC account or a compromised ML engineer credential) triggers torch.cuda.nccl.reduce with crafted input that causes improper resource release. This crashes or hangs the NCCL collective operation, causing the entire distributed training job to stall — potentially destroying hours or days of in-progress model training with no data corruption of stored checkpoints. In a multi-tenant GPU cluster scenario (e.g., research institution or internal ML platform), this could be used as sabotage against a competing team's training run.

Weaknesses (CWE)

CVSS Vector

CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:L

Timeline

Published
May 5, 2025
Last Modified
May 5, 2025
First Seen
May 5, 2025

Related Vulnerabilities