vllm's OpenAI-compatible server allows authenticated users to inject malicious Jinja templates via chat_template or chat_template_kwargs, exhausting CPU/memory and taking down LLM inference endpoints. The mitigation is non-trivial: blocking chat_template alone is insufficient because chat_template_kwargs can bypass controls via a dict.update overwrite. Upgrade to vllm >= 0.11.0 immediately; if not possible, restrict API access to fully-trusted clients and block both parameters at the gateway.
Risk Assessment
Medium CVSS (6.5) understates operational risk for organizations running vllm as a shared inference service. Exploitability is high — any authenticated API user can trigger it with a single malformed request requiring no special AI/ML knowledge. The non-obvious bypass via chat_template_kwargs means operators who implement partial mitigations remain fully exposed. In multi-tenant, developer-facing, or internally-shared deployments, this is a realistic availability risk with immediate blast radius across all downstream AI-dependent workloads.
Affected Systems
| Package | Ecosystem | Vulnerable Range | Patched |
|---|---|---|---|
| vllm | pip | >= 0.5.1, < 0.11.0 | 0.11.0 |
Do you use vllm? You're affected.
Severity & Risk
Attack Surface
Recommended Action
5 steps-
PATCH
Upgrade to vllm >= 0.11.0 (fix in PR #25794, commit 7977e50).
-
CRITICAL WORKAROUND
Block BOTH chat_template AND chat_template_kwargs at the API gateway — blocking only chat_template is insufficient due to the dict.update bypass path.
-
ACCESS CONTROL
Restrict vllm API endpoints to fully-trusted internal clients; never expose directly to end users or the internet without an authenticated proxy.
-
RESOURCE LIMITS
Implement request timeouts and per-request CPU/memory quotas on the inference server to contain blast radius.
-
DETECTION
Alert on requests containing chat_template or chat_template_kwargs fields in API request logs; monitor for sudden CPU/memory spikes on inference nodes correlating with individual API requests.
Classification
Compliance Impact
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2025-61620?
vllm's OpenAI-compatible server allows authenticated users to inject malicious Jinja templates via chat_template or chat_template_kwargs, exhausting CPU/memory and taking down LLM inference endpoints. The mitigation is non-trivial: blocking chat_template alone is insufficient because chat_template_kwargs can bypass controls via a dict.update overwrite. Upgrade to vllm >= 0.11.0 immediately; if not possible, restrict API access to fully-trusted clients and block both parameters at the gateway.
Is CVE-2025-61620 actively exploited?
No confirmed active exploitation of CVE-2025-61620 has been reported, but organizations should still patch proactively.
How to fix CVE-2025-61620?
1. PATCH: Upgrade to vllm >= 0.11.0 (fix in PR #25794, commit 7977e50). 2. CRITICAL WORKAROUND: Block BOTH chat_template AND chat_template_kwargs at the API gateway — blocking only chat_template is insufficient due to the dict.update bypass path. 3. ACCESS CONTROL: Restrict vllm API endpoints to fully-trusted internal clients; never expose directly to end users or the internet without an authenticated proxy. 4. RESOURCE LIMITS: Implement request timeouts and per-request CPU/memory quotas on the inference server to contain blast radius. 5. DETECTION: Alert on requests containing chat_template or chat_template_kwargs fields in API request logs; monitor for sudden CPU/memory spikes on inference nodes correlating with individual API requests.
What systems are affected by CVE-2025-61620?
This vulnerability affects the following AI/ML architecture patterns: LLM inference APIs, model serving, RAG pipelines, agent frameworks, multi-tenant AI platforms.
What is the CVSS score for CVE-2025-61620?
CVE-2025-61620 has a CVSS v3.1 base score of 6.5 (MEDIUM).
Technical Details
NVD Description
### Summary A resource-exhaustion (denial-of-service) vulnerability exists in multiple endpoints of the OpenAI-Compatible Server due to the ability to specify Jinja templates via the `chat_template` and `chat_template_kwargs` parameters. If an attacker can supply these parameters to the API, they can cause a service outage by exhausting CPU and/or memory resources. ### Details When using an LLM as a chat model, the conversation history must be rendered into a text input for the model. In `hf/transformer`, this rendering is performed using a Jinja template. The OpenAI-Compatible Server launched by vllm serve exposes a `chat_template` parameter that lets users specify that template. In addition, the server accepts a `chat_template_kwargs` parameter to pass extra keyword arguments to the rendering function. Because Jinja templates support programming-language-like constructs (loops, nested iterations, etc.), a crafted template can consume extremely large amounts of CPU and memory and thereby trigger a denial-of-service condition. Importantly, simply forbidding the `chat_template` parameter does not fully mitigate the issue. The implementation constructs a dictionary of keyword arguments for `apply_hf_chat_template` and then updates that dictionary with the user-supplied `chat_template_kwargs` via `dict.update`. Since `dict.update` can overwrite existing keys, an attacker can place a `chat_template` key inside `chat_template_kwargs` to replace the template that will be used by `apply_hf_chat_template`. ```python # vllm/entrypoints/openai/serving_engine.py#L794-L816 _chat_template_kwargs: dict[str, Any] = dict( chat_template=chat_template, add_generation_prompt=add_generation_prompt, continue_final_message=continue_final_message, tools=tool_dicts, documents=documents, ) _chat_template_kwargs.update(chat_template_kwargs or {}) request_prompt: Union[str, list[int]] if isinstance(tokenizer, MistralTokenizer): ... else: request_prompt = apply_hf_chat_template( tokenizer=tokenizer, conversation=conversation, model_config=model_config, **_chat_template_kwargs, ) ``` ### Impact If an OpenAI-Compatible Server exposes endpoints that accept `chat_template` or `chat_template_kwargs` from untrusted clients, an attacker can submit a malicious Jinja template (directly or by overriding `chat_template` inside `chat_template_kwargs`) that consumes excessive CPU and/or memory. This can result in a resource-exhaustion denial-of-service that renders the server unresponsive to legitimate requests. ### Fixes * https://github.com/vllm-project/vllm/pull/25794
Exploitation Scenario
An attacker with API credentials — an internal developer, a compromised service account, or a malicious user on a shared platform — sends a POST to /v1/chat/completions with a chat_template_kwargs body containing a chat_template key embedding a malicious Jinja template (e.g., nested loops iterating over exponentially large ranges). Because dict.update overwrites the server-side chat_template value, vllm processes the attacker-controlled template, consuming all available CPU and memory. The inference server becomes unresponsive to all legitimate traffic. In a shared GPU cluster, this disrupts every team dependent on the endpoint until the process is manually killed and restarted, with no data exfiltration required to achieve full service disruption.
Weaknesses (CWE)
CVSS Vector
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H References
Timeline
Related Vulnerabilities
CVE-2024-9053 9.8 vllm: RCE via unsafe pickle deserialization in RPC server
Same package: vllm CVE-2024-11041 9.8 vllm: RCE via unsafe pickle deserialization in MessageQueue
Same package: vllm CVE-2026-25960 9.8 vllm: SSRF allows internal network access
Same package: vllm CVE-2025-47277 9.8 vLLM: RCE via exposed TCPStore in distributed inference
Same package: vllm CVE-2025-32444 9.8 vLLM: RCE via pickle deserialization on ZeroMQ
Same package: vllm
AI Threat Alert