CVE-2025-29770: vLLM: DoS via unbounded grammar cache exhausts disk

GHSA-mgrm-fgjv-mhv8 MEDIUM PoC AVAILABLE
Published March 19, 2025
CISO Take

Any authenticated user of your vLLM inference API can crash it by flooding structured output requests with unique schemas, filling the host filesystem. Upgrade to vLLM 0.8.0 immediately — the fix is available and the attack is trivial to execute. If you cannot patch now, restrict per-request backend selection and apply filesystem quotas to contain the blast radius.

Risk Assessment

Medium CVSS but practically significant for production deployments. The attack requires only low-privilege API access and is technically trivial — a simple loop with randomized JSON schemas suffices. The per-request override of the guided_decoding_backend parameter makes default-configuration mitigations ineffective without patching. Risk is elevated for multi-tenant or externally accessible vLLM deployments. Low EPSS and absence from CISA KEV suggest no active exploitation yet, but the simplicity of the exploit warrants prompt action.

Affected Systems

Package Ecosystem Vulnerable Range Patched
vllm pip No patch
78.9K 126 dependents Pushed 6d ago 56% patched ~32d to patch Full package profile →
vllm pip < 0.8.0 0.8.0
78.9K 126 dependents Pushed 6d ago 56% patched ~32d to patch Full package profile →

Severity & Risk

CVSS 3.1
6.5 / 10
EPSS
0.7%
chance of exploitation in 30 days
Higher than 71% of all CVEs
Exploitation Status
Exploit Available
Exploitation: MEDIUM
Sophistication
Trivial
Exploitation Confidence
medium
Public PoC indexed (trickest/cve)
Composite signal derived from CISA KEV, CISA SSVC, EPSS, trickest/cve, and Nuclei templates.

Attack Surface

AV AC PR UI S C I A
AV Network
AC Low
PR Low
UI None
S Unchanged
C None
I None
A High

Recommended Action

6 steps
  1. PATCH

    Upgrade vLLM to >= 0.8.0 — the root fix is available.

  2. WORKAROUND (if patching is delayed): Block the guided_decoding_backend key in extra_body via an API gateway or middleware layer; disable outlines backend entirely if structured output is not required.

  3. RATE-LIMIT: Apply per-user/per-IP rate limits on the /v1/chat/completions endpoint.

  4. ISOLATE

    Run vLLM in a container with a dedicated filesystem or enforced disk quota to limit blast radius.

  5. DETECT

    Alert on sudden disk growth in the outlines grammar cache directory (typically ~/.cache/outlines/); baseline normal cache size and alert on anomalous growth.

  6. AUDIT

    Review API access logs for users submitting high volumes of structured output requests with unique schemas.

CISA SSVC Assessment

Decision Track
Exploitation none
Automatable No
Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

Classification

Compliance Impact

This CVE is relevant to:

EU AI Act
Article 15 - Accuracy, robustness and cybersecurity
ISO 42001
A.9.3 - AI system operation and monitoring
NIST AI RMF
MANAGE-2.2 - Mechanisms to sustain value of deployed AI systems
OWASP LLM Top 10
LLM04 - Model Denial of Service

Frequently Asked Questions

What is CVE-2025-29770?

Any authenticated user of your vLLM inference API can crash it by flooding structured output requests with unique schemas, filling the host filesystem. Upgrade to vLLM 0.8.0 immediately — the fix is available and the attack is trivial to execute. If you cannot patch now, restrict per-request backend selection and apply filesystem quotas to contain the blast radius.

Is CVE-2025-29770 actively exploited?

Proof-of-concept exploit code is publicly available for CVE-2025-29770, increasing the risk of exploitation.

How to fix CVE-2025-29770?

1. PATCH: Upgrade vLLM to >= 0.8.0 — the root fix is available. 2. WORKAROUND (if patching is delayed): Block the guided_decoding_backend key in extra_body via an API gateway or middleware layer; disable outlines backend entirely if structured output is not required. 3. RATE-LIMIT: Apply per-user/per-IP rate limits on the /v1/chat/completions endpoint. 4. ISOLATE: Run vLLM in a container with a dedicated filesystem or enforced disk quota to limit blast radius. 5. DETECT: Alert on sudden disk growth in the outlines grammar cache directory (typically ~/.cache/outlines/); baseline normal cache size and alert on anomalous growth. 6. AUDIT: Review API access logs for users submitting high volumes of structured output requests with unique schemas.

What systems are affected by CVE-2025-29770?

This vulnerability affects the following AI/ML architecture patterns: model serving, LLM inference APIs, agent frameworks, RAG pipelines.

What is the CVSS score for CVE-2025-29770?

CVE-2025-29770 has a CVSS v3.1 base score of 6.5 (MEDIUM). The EPSS exploitation probability is 0.66%.

Technical Details

NVD Description

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. The outlines library is one of the backends used by vLLM to support structured output (a.k.a. guided decoding). Outlines provides an optional cache for its compiled grammars on the local filesystem. This cache has been on by default in vLLM. Outlines is also available by default through the OpenAI compatible API server. The affected code in vLLM is vllm/model_executor/guided_decoding/outlines_logits_processors.py, which unconditionally uses the cache from outlines. A malicious user can send a stream of very short decoding requests with unique schemas, resulting in an addition to the cache for each request. This can result in a Denial of Service if the filesystem runs out of space. Note that even if vLLM was configured to use a different backend by default, it is still possible to choose outlines on a per-request basis using the guided_decoding_backend key of the extra_body field of the request. This issue applies only to the V0 engine and is fixed in 0.8.0.

Exploitation Scenario

An attacker with any valid API credential — including a free trial account — scripts a loop sending POST requests to the vLLM /v1/chat/completions endpoint. Each request includes a unique JSON schema in response_format and sets guided_decoding_backend=outlines in extra_body. vLLM compiles each schema and writes it to the local filesystem cache. After thousands of requests (automatable in minutes), the host disk fills up, the vLLM process crashes, and all inference is unavailable to legitimate users. No AI/ML expertise is required; the attack vector is identical to classic resource exhaustion attacks adapted for LLM inference infrastructure.

CVSS Vector

CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

Timeline

Published
March 19, 2025
Last Modified
July 31, 2025
First Seen
March 19, 2025

Related Vulnerabilities