CVE-2025-61620 — MEDIUM (CVSS 6.5) AI Security Vulnerability

CISO Take

vllm's OpenAI-compatible server allows authenticated users to inject malicious Jinja templates via chat_template or chat_template_kwargs, exhausting CPU/memory and taking down LLM inference endpoints. The mitigation is non-trivial: blocking chat_template alone is insufficient because chat_template_kwargs can bypass controls via a dict.update overwrite. Upgrade to vllm >= 0.11.0 immediately; if not possible, restrict API access to fully-trusted clients and block both parameters at the gateway.

Risk Assessment

Medium CVSS (6.5) understates operational risk for organizations running vllm as a shared inference service. Exploitability is high — any authenticated API user can trigger it with a single malformed request requiring no special AI/ML knowledge. The non-obvious bypass via chat_template_kwargs means operators who implement partial mitigations remain fully exposed. In multi-tenant, developer-facing, or internally-shared deployments, this is a realistic availability risk with immediate blast radius across all downstream AI-dependent workloads.

Affected Systems

Package	Ecosystem	Vulnerable Range	Patched
vllm	pip	>= 0.5.1, < 0.11.0	`0.11.0`
78.9K 126 dependents Pushed 6d ago 56% patched ~32d to patch Full package profile →

Do you use vllm? You're affected.

Severity & Risk

CVSS 3.1

6.5 / 10

EPSS

N/A

Exploitation Status

No known exploitation

Sophistication

Trivial

Attack Surface

AV Network

AC Low

PR Low

UI None

S Unchanged

C None

I None

A High

Recommended Action

5 steps

PATCH

Upgrade to vllm >= 0.11.0 (fix in PR #25794, commit 7977e50).
CRITICAL WORKAROUND

Block BOTH chat_template AND chat_template_kwargs at the API gateway — blocking only chat_template is insufficient due to the dict.update bypass path.
ACCESS CONTROL

Restrict vllm API endpoints to fully-trusted internal clients; never expose directly to end users or the internet without an authenticated proxy.
RESOURCE LIMITS

Implement request timeouts and per-request CPU/memory quotas on the inference server to contain blast radius.
DETECTION

Alert on requests containing chat_template or chat_template_kwargs fields in API request logs; monitor for sudden CPU/memory spikes on inference nodes correlating with individual API requests.

Classification

DoS Framework Inference API AML.T0029 - Denial of AI Service AML.T0034 - Cost Harvesting AML.T0040 - AI Model Inference API Access AML.T0049 - Exploit Public-Facing Application AML.T0050 - Command and Scripting Interpreter

Compliance Impact

This CVE is relevant to:

EU AI Act

Art. 9 - Risk management system

ISO 42001

A.9.3 - AI system risk treatment

NIST AI RMF

MANAGE-2.2 - Risks or benefits of the AI system are communicated and managed

OWASP LLM Top 10

LLM04 - Model Denial of Service

Frequently Asked Questions

What is CVE-2025-61620?

vllm's OpenAI-compatible server allows authenticated users to inject malicious Jinja templates via chat_template or chat_template_kwargs, exhausting CPU/memory and taking down LLM inference endpoints. The mitigation is non-trivial: blocking chat_template alone is insufficient because chat_template_kwargs can bypass controls via a dict.update overwrite. Upgrade to vllm >= 0.11.0 immediately; if not possible, restrict API access to fully-trusted clients and block both parameters at the gateway.

Is CVE-2025-61620 actively exploited?

No confirmed active exploitation of CVE-2025-61620 has been reported, but organizations should still patch proactively.

How to fix CVE-2025-61620?

1. PATCH: Upgrade to vllm >= 0.11.0 (fix in PR #25794, commit 7977e50). 2. CRITICAL WORKAROUND: Block BOTH chat_template AND chat_template_kwargs at the API gateway — blocking only chat_template is insufficient due to the dict.update bypass path. 3. ACCESS CONTROL: Restrict vllm API endpoints to fully-trusted internal clients; never expose directly to end users or the internet without an authenticated proxy. 4. RESOURCE LIMITS: Implement request timeouts and per-request CPU/memory quotas on the inference server to contain blast radius. 5. DETECTION: Alert on requests containing chat_template or chat_template_kwargs fields in API request logs; monitor for sudden CPU/memory spikes on inference nodes correlating with individual API requests.

What systems are affected by CVE-2025-61620?

This vulnerability affects the following AI/ML architecture patterns: LLM inference APIs, model serving, RAG pipelines, agent frameworks, multi-tenant AI platforms.

What is the CVSS score for CVE-2025-61620?

CVE-2025-61620 has a CVSS v3.1 base score of 6.5 (MEDIUM).

Technical Details

NVD Description

### Summary A resource-exhaustion (denial-of-service) vulnerability exists in multiple endpoints of the OpenAI-Compatible Server due to the ability to specify Jinja templates via the `chat_template` and `chat_template_kwargs` parameters. If an attacker can supply these parameters to the API, they can cause a service outage by exhausting CPU and/or memory resources. ### Details When using an LLM as a chat model, the conversation history must be rendered into a text input for the model. In `hf/transformer`, this rendering is performed using a Jinja template. The OpenAI-Compatible Server launched by vllm serve exposes a `chat_template` parameter that lets users specify that template. In addition, the server accepts a `chat_template_kwargs` parameter to pass extra keyword arguments to the rendering function. Because Jinja templates support programming-language-like constructs (loops, nested iterations, etc.), a crafted template can consume extremely large amounts of CPU and memory and thereby trigger a denial-of-service condition. Importantly, simply forbidding the `chat_template` parameter does not fully mitigate the issue. The implementation constructs a dictionary of keyword arguments for `apply_hf_chat_template` and then updates that dictionary with the user-supplied `chat_template_kwargs` via `dict.update`. Since `dict.update` can overwrite existing keys, an attacker can place a `chat_template` key inside `chat_template_kwargs` to replace the template that will be used by `apply_hf_chat_template`. ```python # vllm/entrypoints/openai/serving_engine.py#L794-L816 _chat_template_kwargs: dict[str, Any] = dict( chat_template=chat_template, add_generation_prompt=add_generation_prompt, continue_final_message=continue_final_message, tools=tool_dicts, documents=documents, ) _chat_template_kwargs.update(chat_template_kwargs or {}) request_prompt: Union[str, list[int]] if isinstance(tokenizer, MistralTokenizer): ... else: request_prompt = apply_hf_chat_template( tokenizer=tokenizer, conversation=conversation, model_config=model_config, **_chat_template_kwargs, ) ``` ### Impact If an OpenAI-Compatible Server exposes endpoints that accept `chat_template` or `chat_template_kwargs` from untrusted clients, an attacker can submit a malicious Jinja template (directly or by overriding `chat_template` inside `chat_template_kwargs`) that consumes excessive CPU and/or memory. This can result in a resource-exhaustion denial-of-service that renders the server unresponsive to legitimate requests. ### Fixes * https://github.com/vllm-project/vllm/pull/25794

Exploitation Scenario

An attacker with API credentials — an internal developer, a compromised service account, or a malicious user on a shared platform — sends a POST to /v1/chat/completions with a chat_template_kwargs body containing a chat_template key embedding a malicious Jinja template (e.g., nested loops iterating over exponentially large ranges). Because dict.update overwrites the server-side chat_template value, vllm processes the attacker-controlled template, consuming all available CPU and memory. The inference server becomes unresponsive to all legitimate traffic. In a shared GPU cluster, this disrupts every team dependent on the endpoint until the process is manually killed and restarted, with no data exfiltration required to achieve full service disruption.