If your organization runs vLLM versions 0.5.5 through 0.11.0 to serve LLM inference APIs, any authenticated user with minimal access can deliberately stall your entire API server by sending crafted chat_template_kwargs parameters, denying service to all other users. Patch to vLLM 0.11.1 immediately; if patching is not possible today, add request-level timeouts and block or sanitize the chat_template_kwargs parameter at your API gateway before it reaches vLLM. This is a low-complexity, network-reachable DoS with no user interaction required.
What is the risk?
Operational risk exceeds the CVSS 6.5 score in AI-serving contexts. vLLM is the de facto standard for high-throughput LLM inference and is widely deployed in internal AI platforms, SaaS products, and model-serving infrastructure. The attack requires only low privileges (any valid API token), is network-accessible, and has low complexity—meaning it is trivially scriptable. A single malicious actor can monopolize server processing time, causing cascading failures across all concurrent inference workloads. EPSS is currently low (0.00087), suggesting no known active exploitation, but the technique is straightforward enough that weaponization risk is elevated once awareness increases.
What systems are affected?
How severe is it?
What is the attack surface?
What should I do?
6 steps-
PATCH
Upgrade vLLM to 0.11.1 or later—this is the only complete fix.
-
WORKAROUND (if patching is not immediate): Inject a middleware or API gateway rule to strip or reject the chat_template_kwargs parameter from all incoming requests to /v1/chat/completions and /tokenize.
-
RATE LIMIT
Enforce per-client request rate limits and maximum concurrent request counts at the proxy or load balancer layer (nginx, Traefik, or Envoy).
-
TIMEOUT
Configure server-side request processing timeouts to cap how long any single request can hold the processing thread.
-
DETECT
Alert on requests to vLLM endpoints containing chat_template_kwargs in your WAF or API gateway logs; investigate any request with abnormally high server-side latency (>30s).
-
AUDIT
Review API access logs to identify any accounts that may have already attempted exploitation.
What does CISA's SSVC say?
Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.
How is it classified?
Which compliance frameworks are affected?
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2025-62426?
If your organization runs vLLM versions 0.5.5 through 0.11.0 to serve LLM inference APIs, any authenticated user with minimal access can deliberately stall your entire API server by sending crafted chat_template_kwargs parameters, denying service to all other users. Patch to vLLM 0.11.1 immediately; if patching is not possible today, add request-level timeouts and block or sanitize the chat_template_kwargs parameter at your API gateway before it reaches vLLM. This is a low-complexity, network-reachable DoS with no user interaction required.
Is CVE-2025-62426 actively exploited?
No confirmed active exploitation of CVE-2025-62426 has been reported, but organizations should still patch proactively.
How to fix CVE-2025-62426?
1. PATCH: Upgrade vLLM to 0.11.1 or later—this is the only complete fix. 2. WORKAROUND (if patching is not immediate): Inject a middleware or API gateway rule to strip or reject the chat_template_kwargs parameter from all incoming requests to /v1/chat/completions and /tokenize. 3. RATE LIMIT: Enforce per-client request rate limits and maximum concurrent request counts at the proxy or load balancer layer (nginx, Traefik, or Envoy). 4. TIMEOUT: Configure server-side request processing timeouts to cap how long any single request can hold the processing thread. 5. DETECT: Alert on requests to vLLM endpoints containing chat_template_kwargs in your WAF or API gateway logs; investigate any request with abnormally high server-side latency (>30s). 6. AUDIT: Review API access logs to identify any accounts that may have already attempted exploitation.
What systems are affected by CVE-2025-62426?
This vulnerability affects the following AI/ML architecture patterns: LLM inference serving, OpenAI-compatible API endpoints, model serving, multi-tenant AI platforms, internal AI developer platforms.
What is the CVSS score for CVE-2025-62426?
CVE-2025-62426 has a CVSS v3.1 base score of 6.5 (MEDIUM). The EPSS exploitation probability is 0.32%.
What is the AI security impact?
Affected AI Architectures
MITRE ATLAS Techniques
AML.T0029 Denial of AI Service AML.T0034 Cost Harvesting AML.T0040 AI Model Inference API Access AML.T0049 Exploit Public-Facing Application Compliance Controls Affected
What are the technical details?
Original Advisory
vLLM is an inference and serving engine for large language models (LLMs). From version 0.5.5 to before 0.11.1, the /v1/chat/completions and /tokenize endpoints allow a chat_template_kwargs request parameter that is used in the code before it is properly validated against the chat template. With the right chat_template_kwargs parameters, it is possible to block processing of the API server for long periods of time, delaying all other requests. This issue has been patched in version 0.11.1.
Exploitation Scenario
An adversary obtains low-privilege API credentials—via a free trial account, a compromised developer key, or a misconfigured multi-tenant setup. They craft a POST request to /v1/chat/completions with a malicious chat_template_kwargs payload designed to trigger unbounded resource consumption in vLLM's template processing path (chat_utils.py L1602-1610). Because the parameter is consumed before validation, the server thread blocks indefinitely processing the malicious request. The adversary repeats this with a handful of concurrent requests to monopolize all available server threads. All legitimate inference requests queue indefinitely, causing a full service outage. The adversary may use this to cause SLA violations, disrupt a competing AI product, or as a smokescreen for other API abuse occurring simultaneously.
Weaknesses (CWE)
CWE-770 Allocation of Resources Without Limits or Throttling
Primary
CWE-770 Allocation of Resources Without Limits or Throttling CWE-770 — Allocation of Resources Without Limits or Throttling: The product allocates a reusable resource or group of resources on behalf of an actor without imposing any intended restrictions on the size or number of resources that can be allocated.
- [Requirements] Clearly specify the minimum and maximum expectations for capabilities, and dictate which behaviors are acceptable when resource allocation reaches limits.
- [Architecture and Design] Limit the amount of resources that are accessible to unprivileged users. Set per-user limits for resources. Allow the system administrator to define these limits. Be careful to avoid CWE-410.
Source: MITRE CWE corpus.
CVSS Vector
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H References
- github.com/advisories/GHSA-69j4-grxj-j64p
- nvd.nist.gov/vuln/detail/CVE-2025-62426
- github.com/vllm-project/vllm/blob/2a6dc67eb520ddb9c4138d8b35ed6fe6226997fb/vllm/entrypoints/chat_utils.py Product
- github.com/vllm-project/vllm/blob/2a6dc67eb520ddb9c4138d8b35ed6fe6226997fb/vllm/entrypoints/openai/serving_engine.py Product
- github.com/vllm-project/vllm/commit/3ada34f9cb4d1af763fdfa3b481862a93eb6bd2b Patch
- github.com/vllm-project/vllm/pull/27205 Issue
- github.com/vllm-project/vllm/security/advisories/GHSA-69j4-grxj-j64p Vendor
Timeline
Related Vulnerabilities
CVE-2024-9053 9.8 vllm: RCE via unsafe pickle deserialization in RPC server
Same package: vllm CVE-2024-11041 9.8 vllm: RCE via unsafe pickle deserialization in MessageQueue
Same package: vllm CVE-2026-25960 9.8 vllm: SSRF allows internal network access
Same package: vllm CVE-2025-47277 9.8 vLLM: RCE via exposed TCPStore in distributed inference
Same package: vllm CVE-2025-32444 9.8 vLLM: RCE via pickle deserialization on ZeroMQ
Same package: vllm