vllm's OpenAI-compatible server allows authenticated users to inject malicious Jinja templates via chat_template or chat_template_kwargs, exhausting CPU/memory and taking down LLM inference endpoints. The mitigation is non-trivial: blocking chat_template alone is insufficient because chat_template_kwargs can bypass controls via a dict.update overwrite. Upgrade to vllm >= 0.11.0 immediately; if not possible, restrict API access to fully-trusted clients and block both parameters at the gateway.
What is the risk?
Medium CVSS (6.5) understates operational risk for organizations running vllm as a shared inference service. Exploitability is high — any authenticated API user can trigger it with a single malformed request requiring no special AI/ML knowledge. The non-obvious bypass via chat_template_kwargs means operators who implement partial mitigations remain fully exposed. In multi-tenant, developer-facing, or internally-shared deployments, this is a realistic availability risk with immediate blast radius across all downstream AI-dependent workloads.
What systems are affected?
| Package | Ecosystem | Vulnerable Range | Patched |
|---|---|---|---|
| vLLM | pip | >= 0.5.1, < 0.11.0 | 0.11.0 |
Do you use vLLM? You're affected.
How severe is it?
What is the attack surface?
What should I do?
5 steps-
PATCH
Upgrade to vllm >= 0.11.0 (fix in PR #25794, commit 7977e50).
-
CRITICAL WORKAROUND
Block BOTH chat_template AND chat_template_kwargs at the API gateway — blocking only chat_template is insufficient due to the dict.update bypass path.
-
ACCESS CONTROL
Restrict vllm API endpoints to fully-trusted internal clients; never expose directly to end users or the internet without an authenticated proxy.
-
RESOURCE LIMITS
Implement request timeouts and per-request CPU/memory quotas on the inference server to contain blast radius.
-
DETECTION
Alert on requests containing chat_template or chat_template_kwargs fields in API request logs; monitor for sudden CPU/memory spikes on inference nodes correlating with individual API requests.
How is it classified?
Which compliance frameworks are affected?
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2025-61620?
vllm's OpenAI-compatible server allows authenticated users to inject malicious Jinja templates via chat_template or chat_template_kwargs, exhausting CPU/memory and taking down LLM inference endpoints. The mitigation is non-trivial: blocking chat_template alone is insufficient because chat_template_kwargs can bypass controls via a dict.update overwrite. Upgrade to vllm >= 0.11.0 immediately; if not possible, restrict API access to fully-trusted clients and block both parameters at the gateway.
Is CVE-2025-61620 actively exploited?
No confirmed active exploitation of CVE-2025-61620 has been reported, but organizations should still patch proactively.
How to fix CVE-2025-61620?
1. PATCH: Upgrade to vllm >= 0.11.0 (fix in PR #25794, commit 7977e50). 2. CRITICAL WORKAROUND: Block BOTH chat_template AND chat_template_kwargs at the API gateway — blocking only chat_template is insufficient due to the dict.update bypass path. 3. ACCESS CONTROL: Restrict vllm API endpoints to fully-trusted internal clients; never expose directly to end users or the internet without an authenticated proxy. 4. RESOURCE LIMITS: Implement request timeouts and per-request CPU/memory quotas on the inference server to contain blast radius. 5. DETECTION: Alert on requests containing chat_template or chat_template_kwargs fields in API request logs; monitor for sudden CPU/memory spikes on inference nodes correlating with individual API requests.
What systems are affected by CVE-2025-61620?
This vulnerability affects the following AI/ML architecture patterns: LLM inference APIs, model serving, RAG pipelines, agent frameworks, multi-tenant AI platforms.
What is the CVSS score for CVE-2025-61620?
CVE-2025-61620 has a CVSS v3.1 base score of 6.5 (MEDIUM). The EPSS exploitation probability is 0.21%.
What is the AI security impact?
Affected AI Architectures
MITRE ATLAS Techniques
AML.T0029 Denial of AI Service AML.T0034 Cost Harvesting AML.T0040 AI Model Inference API Access AML.T0049 Exploit Public-Facing Application AML.T0050 Command and Scripting Interpreter Compliance Controls Affected
What are the technical details?
Original Advisory
### Summary A resource-exhaustion (denial-of-service) vulnerability exists in multiple endpoints of the OpenAI-Compatible Server due to the ability to specify Jinja templates via the `chat_template` and `chat_template_kwargs` parameters. If an attacker can supply these parameters to the API, they can cause a service outage by exhausting CPU and/or memory resources. ### Details When using an LLM as a chat model, the conversation history must be rendered into a text input for the model. In `hf/transformer`, this rendering is performed using a Jinja template. The OpenAI-Compatible Server launched by vllm serve exposes a `chat_template` parameter that lets users specify that template. In addition, the server accepts a `chat_template_kwargs` parameter to pass extra keyword arguments to the rendering function. Because Jinja templates support programming-language-like constructs (loops, nested iterations, etc.), a crafted template can consume extremely large amounts of CPU and memory and thereby trigger a denial-of-service condition. Importantly, simply forbidding the `chat_template` parameter does not fully mitigate the issue. The implementation constructs a dictionary of keyword arguments for `apply_hf_chat_template` and then updates that dictionary with the user-supplied `chat_template_kwargs` via `dict.update`. Since `dict.update` can overwrite existing keys, an attacker can place a `chat_template` key inside `chat_template_kwargs` to replace the template that will be used by `apply_hf_chat_template`. ```python # vllm/entrypoints/openai/serving_engine.py#L794-L816 _chat_template_kwargs: dict[str, Any] = dict( chat_template=chat_template, add_generation_prompt=add_generation_prompt, continue_final_message=continue_final_message, tools=tool_dicts, documents=documents, ) _chat_template_kwargs.update(chat_template_kwargs or {}) request_prompt: Union[str, list[int]] if isinstance(tokenizer, MistralTokenizer): ... else: request_prompt = apply_hf_chat_template( tokenizer=tokenizer, conversation=conversation, model_config=model_config, **_chat_template_kwargs, ) ``` ### Impact If an OpenAI-Compatible Server exposes endpoints that accept `chat_template` or `chat_template_kwargs` from untrusted clients, an attacker can submit a malicious Jinja template (directly or by overriding `chat_template` inside `chat_template_kwargs`) that consumes excessive CPU and/or memory. This can result in a resource-exhaustion denial-of-service that renders the server unresponsive to legitimate requests. ### Fixes * https://github.com/vllm-project/vllm/pull/25794
Exploitation Scenario
An attacker with API credentials — an internal developer, a compromised service account, or a malicious user on a shared platform — sends a POST to /v1/chat/completions with a chat_template_kwargs body containing a chat_template key embedding a malicious Jinja template (e.g., nested loops iterating over exponentially large ranges). Because dict.update overwrites the server-side chat_template value, vllm processes the attacker-controlled template, consuming all available CPU and memory. The inference server becomes unresponsive to all legitimate traffic. In a shared GPU cluster, this disrupts every team dependent on the endpoint until the process is manually killed and restarted, with no data exfiltration required to achieve full service disruption.
Weaknesses (CWE)
CWE-20 Improper Input Validation
Primary
CWE-400 Uncontrolled Resource Consumption
Primary
CWE-770 Allocation of Resources Without Limits or Throttling
Primary
CWE-20 — Improper Input Validation: The product receives input or data, but it does not validate or incorrectly validates that the input has the properties that are required to process the data safely and correctly.
- [Architecture and Design] Consider using language-theoretic security (LangSec) techniques that characterize inputs using a formal language and build "recognizers" for that language. This effectively requires parsing to be a distinct layer that effectively enforces a boundary between raw input and internal data representations, instead of allowing parser code to be scattered throughout the program, where it could be subject to errors or inconsistencies that create weaknesses. [REF-1109] [REF-1110] [REF-1111]
- [Architecture and Design] Use an input validation framework such as Struts or the OWASP ESAPI Validation API. Note that using a framework does not automatically address all input validation problems; be mindful of weaknesses that could arise from misusing the framework itself (CWE-1173).
Source: MITRE CWE corpus.
CVSS Vector
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H References
Timeline
Related Vulnerabilities
CVE-2024-9053 9.8 vllm: RCE via unsafe pickle deserialization in RPC server
Same package: vllm CVE-2024-11041 9.8 vllm: RCE via unsafe pickle deserialization in MessageQueue
Same package: vllm CVE-2026-25960 9.8 vllm: SSRF allows internal network access
Same package: vllm CVE-2025-47277 9.8 vLLM: RCE via exposed TCPStore in distributed inference
Same package: vllm CVE-2025-32444 9.8 vLLM: RCE via pickle deserialization on ZeroMQ
Same package: vllm