CVE-2025-29770 — MEDIUM (CVSS 6.5) AI Security Vulnerability

CISO Take

Any authenticated user of your vLLM inference API can crash it by flooding structured output requests with unique schemas, filling the host filesystem. Upgrade to vLLM 0.8.0 immediately — the fix is available and the attack is trivial to execute. If you cannot patch now, restrict per-request backend selection and apply filesystem quotas to contain the blast radius.

Risk Assessment

Medium CVSS but practically significant for production deployments. The attack requires only low-privilege API access and is technically trivial — a simple loop with randomized JSON schemas suffices. The per-request override of the guided_decoding_backend parameter makes default-configuration mitigations ineffective without patching. Risk is elevated for multi-tenant or externally accessible vLLM deployments. Low EPSS and absence from CISA KEV suggest no active exploitation yet, but the simplicity of the exploit warrants prompt action.

Affected Systems

Package	Ecosystem	Vulnerable Range	Patched
vllm	pip	—	No patch
78.9K 126 dependents Pushed 6d ago 56% patched ~32d to patch Full package profile →
vllm	pip	< 0.8.0	`0.8.0`
78.9K 126 dependents Pushed 6d ago 56% patched ~32d to patch Full package profile →

Severity & Risk

CVSS 3.1

6.5 / 10

EPSS

0.7%

chance of exploitation in 30 days

Higher than 71% of all CVEs

Source: EPSS v3 — FIRST.org

Exploitation Status

Exploit Available

Exploitation: MEDIUM

Sophistication

Trivial

Exploitation Confidence

medium

○ Public PoC indexed (trickest/cve)

Composite signal derived from CISA KEV, CISA SSVC, EPSS, trickest/cve, and Nuclei templates.

Attack Surface

AV Network

AC Low

PR Low

UI None

S Unchanged

C None

I None

A High

Recommended Action

6 steps

PATCH

Upgrade vLLM to >= 0.8.0 — the root fix is available.
WORKAROUND (if patching is delayed): Block the guided_decoding_backend key in extra_body via an API gateway or middleware layer; disable outlines backend entirely if structured output is not required.
RATE-LIMIT: Apply per-user/per-IP rate limits on the /v1/chat/completions endpoint.
ISOLATE

Run vLLM in a container with a dedicated filesystem or enforced disk quota to limit blast radius.
DETECT

Alert on sudden disk growth in the outlines grammar cache directory (typically ~/.cache/outlines/); baseline normal cache size and alert on anomalous growth.
AUDIT

Review API access logs for users submitting high volumes of structured output requests with unique schemas.

CISA SSVC Assessment

Decision Track

Exploitation none

Automatable No

Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

Classification

DoS Inference Framework API AML.T0029 - Denial of AI Service AML.T0034 - Cost Harvesting AML.T0040 - AI Model Inference API Access AML.T0049 - Exploit Public-Facing Application

Compliance Impact

This CVE is relevant to:

EU AI Act

Article 15 - Accuracy, robustness and cybersecurity

ISO 42001

A.9.3 - AI system operation and monitoring

NIST AI RMF

MANAGE-2.2 - Mechanisms to sustain value of deployed AI systems

OWASP LLM Top 10

LLM04 - Model Denial of Service

Frequently Asked Questions

What is CVE-2025-29770?

Any authenticated user of your vLLM inference API can crash it by flooding structured output requests with unique schemas, filling the host filesystem. Upgrade to vLLM 0.8.0 immediately — the fix is available and the attack is trivial to execute. If you cannot patch now, restrict per-request backend selection and apply filesystem quotas to contain the blast radius.

Is CVE-2025-29770 actively exploited?

Proof-of-concept exploit code is publicly available for CVE-2025-29770, increasing the risk of exploitation.

How to fix CVE-2025-29770?

1. PATCH: Upgrade vLLM to >= 0.8.0 — the root fix is available. 2. WORKAROUND (if patching is delayed): Block the guided_decoding_backend key in extra_body via an API gateway or middleware layer; disable outlines backend entirely if structured output is not required. 3. RATE-LIMIT: Apply per-user/per-IP rate limits on the /v1/chat/completions endpoint. 4. ISOLATE: Run vLLM in a container with a dedicated filesystem or enforced disk quota to limit blast radius. 5. DETECT: Alert on sudden disk growth in the outlines grammar cache directory (typically ~/.cache/outlines/); baseline normal cache size and alert on anomalous growth. 6. AUDIT: Review API access logs for users submitting high volumes of structured output requests with unique schemas.

What systems are affected by CVE-2025-29770?

This vulnerability affects the following AI/ML architecture patterns: model serving, LLM inference APIs, agent frameworks, RAG pipelines.

What is the CVSS score for CVE-2025-29770?

CVE-2025-29770 has a CVSS v3.1 base score of 6.5 (MEDIUM). The EPSS exploitation probability is 0.66%.

Technical Details

NVD Description

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. The outlines library is one of the backends used by vLLM to support structured output (a.k.a. guided decoding). Outlines provides an optional cache for its compiled grammars on the local filesystem. This cache has been on by default in vLLM. Outlines is also available by default through the OpenAI compatible API server. The affected code in vLLM is vllm/model_executor/guided_decoding/outlines_logits_processors.py, which unconditionally uses the cache from outlines. A malicious user can send a stream of very short decoding requests with unique schemas, resulting in an addition to the cache for each request. This can result in a Denial of Service if the filesystem runs out of space. Note that even if vLLM was configured to use a different backend by default, it is still possible to choose outlines on a per-request basis using the guided_decoding_backend key of the extra_body field of the request. This issue applies only to the V0 engine and is fixed in 0.8.0.

Exploitation Scenario

An attacker with any valid API credential — including a free trial account — scripts a loop sending POST requests to the vLLM /v1/chat/completions endpoint. Each request includes a unique JSON schema in response_format and sets guided_decoding_backend=outlines in extra_body. vLLM compiles each schema and writes it to the local filesystem cache. After thousands of requests (automatable in minutes), the host disk fills up, the vLLM process crashes, and all inference is unavailable to legitimate users. No AI/ML expertise is required; the attack vector is identical to classic resource exhaustion attacks adapted for LLM inference infrastructure.