CVE-2025-63396
LOWLow-severity local DoS in PyTorch's profiling subsystem (v2.5, v2.7.1) — not a production threat, but a real risk in shared ML compute environments where omitting profiler.stop() can crash or hang training jobs during teardown. Enforce the context manager pattern ('with torch.profiler.profile(...)') in all internal ML code immediately as a zero-cost workaround, and track the upstream fix on issue #156563 before deploying patched PyTorch to training infrastructure.
Affected Systems
| Package | Ecosystem | Vulnerable Range | Patched |
|---|---|---|---|
| pytorch | pip | — | No patch |
| pytorch | pip | — | No patch |
Severity & Risk
Recommended Action
- 1. Immediate workaround (zero cost): Mandate use of torch.profiler.profile exclusively as a context manager ('with' statement) — this guarantees stop() is called on all exit paths including exceptions. 2. Code audit: Scan codebase for torch.profiler.profile instantiated without 'with' — grep/semgrep rule: 'torch.profiler.profile(' not preceded by 'with'. 3. CI gate: Add lint check to block merges with non-context-manager profiler usage. 4. Patch: Upgrade PyTorch when a fix is confirmed via pytorch/pytorch#156563; prioritize training infrastructure over dev environments. 5. Detection: Alert on training jobs stuck in finalization phase >5 minutes — signals potential profiler hang.
Classification
Compliance Impact
This CVE is relevant to:
Technical Details
NVD Description
An issue was discovered in PyTorch v2.5 and v2.7.1. Omission of profiler.stop() can cause torch.profiler.profile (PythonTracer) to crash or hang during finalization, leading to a Denial of Service (DoS).
Exploitation Scenario
A malicious insider or compromised CI pipeline submits a training script to a shared ML repository with profiling enabled and no stop() call in the exception handler. When the automated training platform processes a specific dataset condition that triggers the exception path, the profiler hangs during Python finalization. On a shared GPU cluster with resource isolation gaps, this consumes the allocated node indefinitely, denying training capacity to other teams until the job is forcibly killed. More commonly exploited unintentionally: a developer forgets profiler cleanup in async training code, causing intermittent job failures that burn engineering time to debug.
Weaknesses (CWE)
CVSS Vector
CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:L References
- pytorch.com Product
- github.com/Daisy2ang Not Applicable
- github.com/pytorch/pytorch Product
- github.com/pytorch/pytorch/issues/156563 Exploit Issue