GHSA-hf3c-wxg2-49q9: vLLM: DoS via unbounded XGrammar schema cache

GHSA-hf3c-wxg2-49q9 MEDIUM
Published April 15, 2025
CISO Take

Any vLLM deployment exposing the OpenAI-compatible API to untrusted users is vulnerable to RAM exhaustion through crafted structured-output requests. Upgrade to vLLM 0.8.4 immediately; if patching is blocked, gate API access to authenticated, trusted clients only. This is low-effort to exploit and high-impact on availability of your AI inference infrastructure.

Risk Assessment

CVSS 6.5 (medium) understates operational risk for production inference servers. The attack requires only a low-privilege API account and no special AI knowledge — any authenticated user can trigger it by sending a stream of structured-output requests with unique JSON schemas. Availability impact is HIGH: successful exploitation exhausts all system RAM, crashing the inference server. For multi-tenant or internally shared vLLM deployments, one malicious insider or compromised account can take down AI services for all users.

Affected Systems

Package Ecosystem Vulnerable Range Patched
vllm pip >= 0.6.5, < 0.8.4 0.8.4
78.9K 126 dependents Pushed 6d ago 56% patched ~32d to patch Full package profile →

Do you use vllm? You're affected.

Severity & Risk

CVSS 3.1
6.5 / 10
EPSS
N/A
Exploitation Status
No known exploitation
Sophistication
Trivial

Attack Surface

AV AC PR UI S C I A
AV Network
AC Low
PR Low
UI None
S Unchanged
C None
I None
A High

Recommended Action

5 steps
  1. Patch

    Upgrade vLLM to >= 0.8.4 — this is the only complete fix.

  2. Workaround (if patching is blocked)

    Restrict the OpenAI-compatible API to trusted, authenticated clients only; block or rate-limit external access.

  3. Detection

    Monitor RAM consumption on inference nodes for sustained growth correlated with structured-output requests; alert on memory usage > 80% sustained over 5 minutes.

  4. V0 engine hardening

    If you cannot upgrade, consider disabling the per-request guided_decoding_backend override or blocking the extra_body.guided_decoding_backend parameter at your API gateway.

  5. Inventory

    Audit which internal services call vLLM's structured output endpoints and their trust level.

Classification

Compliance Impact

This CVE is relevant to:

EU AI Act
Article 9 - Risk Management System — Robustness and Cybersecurity
ISO 42001
A.6.2.6 - AI System Availability and Resilience
NIST AI RMF
RMF-RS-1 - Reliable and Available AI Systems
OWASP LLM Top 10
LLM04 - Model Denial of Service

Frequently Asked Questions

What is GHSA-hf3c-wxg2-49q9?

Any vLLM deployment exposing the OpenAI-compatible API to untrusted users is vulnerable to RAM exhaustion through crafted structured-output requests. Upgrade to vLLM 0.8.4 immediately; if patching is blocked, gate API access to authenticated, trusted clients only. This is low-effort to exploit and high-impact on availability of your AI inference infrastructure.

Is GHSA-hf3c-wxg2-49q9 actively exploited?

No confirmed active exploitation of GHSA-hf3c-wxg2-49q9 has been reported, but organizations should still patch proactively.

How to fix GHSA-hf3c-wxg2-49q9?

1. **Patch**: Upgrade vLLM to >= 0.8.4 — this is the only complete fix. 2. **Workaround (if patching is blocked)**: Restrict the OpenAI-compatible API to trusted, authenticated clients only; block or rate-limit external access. 3. **Detection**: Monitor RAM consumption on inference nodes for sustained growth correlated with structured-output requests; alert on memory usage > 80% sustained over 5 minutes. 4. **V0 engine hardening**: If you cannot upgrade, consider disabling the per-request guided_decoding_backend override or blocking the extra_body.guided_decoding_backend parameter at your API gateway. 5. **Inventory**: Audit which internal services call vLLM's structured output endpoints and their trust level.

What systems are affected by GHSA-hf3c-wxg2-49q9?

This vulnerability affects the following AI/ML architecture patterns: LLM inference serving, OpenAI-compatible API servers, Model serving, Agent frameworks, RAG pipelines.

What is the CVSS score for GHSA-hf3c-wxg2-49q9?

GHSA-hf3c-wxg2-49q9 has a CVSS v3.1 base score of 6.5 (MEDIUM).

Technical Details

NVD Description

### Impact This report is to highlight a vulnerability in XGrammar, a library used by the structured output feature in vLLM. The XGrammar advisory is here: https://github.com/mlc-ai/xgrammar/security/advisories/GHSA-389x-67px-mjg3 The [xgrammar](https://xgrammar.mlc.ai/docs/) library is the default backend used by vLLM to support structured output (a.k.a. guided decoding). Xgrammar provides a required, built-in cache for its compiled grammars stored in RAM. xgrammar is available by default through the OpenAI compatible API server with both the V0 and V1 engines. A malicious user can send a stream of very short decoding requests with unique schemas, resulting in an addition to the cache for each request. This can result in a Denial of Service by consuming all of the system's RAM. Note that even if vLLM was configured to use a different backend by default, it is still possible to choose xgrammar on a per-request basis using the `guided_decoding_backend` key of the `extra_body` field of the request with the V0 engine. This per-request choice is not available when using the V1 engine. ### Patches * https://github.com/vllm-project/vllm/pull/16283 ### Workarounds There is no way to workaround this issue in existing versions of vLLM other than preventing untrusted access to the OpenAI compatible API server. ### References * https://github.com/mlc-ai/xgrammar/security/advisories/GHSA-389x-67px-mjg3

Exploitation Scenario

An attacker with a valid API key (insider threat, stolen credential, or paying trial user) writes a script that sends hundreds of /v1/chat/completions requests per minute, each specifying a unique JSON schema in the response_format field. vLLM's XGrammar backend compiles and caches a grammar object for each unique schema in RAM with no eviction policy. Within minutes, the inference server's available memory is exhausted, causing the process to OOM-crash or the OS to kill it, resulting in a complete outage of AI inference capabilities. The attacker needs no ML expertise — only knowledge of the OpenAI structured output API format, which is publicly documented.

CVSS Vector

CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

Timeline

Published
April 15, 2025
Last Modified
April 15, 2025
First Seen
March 24, 2026

Related Vulnerabilities