GHSA-hf3c-wxg2-49q9: vLLM: DoS via unbounded XGrammar schema cache

GHSA-hf3c-wxg2-49q9 MEDIUM
Published April 15, 2025
CISO Take

Any vLLM deployment exposing the OpenAI-compatible API to untrusted users is vulnerable to RAM exhaustion through crafted structured-output requests. Upgrade to vLLM 0.8.4 immediately; if patching is blocked, gate API access to authenticated, trusted clients only. This is low-effort to exploit and high-impact on availability of your AI inference infrastructure.

What is the risk?

CVSS 6.5 (medium) understates operational risk for production inference servers. The attack requires only a low-privilege API account and no special AI knowledge — any authenticated user can trigger it by sending a stream of structured-output requests with unique JSON schemas. Availability impact is HIGH: successful exploitation exhausts all system RAM, crashing the inference server. For multi-tenant or internally shared vLLM deployments, one malicious insider or compromised account can take down AI services for all users.

What systems are affected?

Package Ecosystem Vulnerable Range Patched
vLLM pip >= 0.6.5, < 0.8.4 0.8.4
83.4K 130 dependents Pushed 2d ago 34% patched ~32d to patch Full package profile →

Do you use vLLM? You're affected.

How severe is it?

CVSS 3.1
6.5 / 10
EPSS
N/A
Exploitation Status
No known exploitation
Sophistication
Trivial

What is the attack surface?

AV AC PR UI S C I A
AV Network
AC Low
PR Low
UI None
S Unchanged
C None
I None
A High

What should I do?

5 steps
  1. Patch

    Upgrade vLLM to >= 0.8.4 — this is the only complete fix.

  2. Workaround (if patching is blocked)

    Restrict the OpenAI-compatible API to trusted, authenticated clients only; block or rate-limit external access.

  3. Detection

    Monitor RAM consumption on inference nodes for sustained growth correlated with structured-output requests; alert on memory usage > 80% sustained over 5 minutes.

  4. V0 engine hardening

    If you cannot upgrade, consider disabling the per-request guided_decoding_backend override or blocking the extra_body.guided_decoding_backend parameter at your API gateway.

  5. Inventory

    Audit which internal services call vLLM's structured output endpoints and their trust level.

How is it classified?

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act
Article 9 - Risk Management System — Robustness and Cybersecurity
ISO 42001
A.6.2.6 - AI System Availability and Resilience
NIST AI RMF
RMF-RS-1 - Reliable and Available AI Systems
OWASP LLM Top 10
LLM04 - Model Denial of Service

Frequently Asked Questions

What is GHSA-hf3c-wxg2-49q9?

Any vLLM deployment exposing the OpenAI-compatible API to untrusted users is vulnerable to RAM exhaustion through crafted structured-output requests. Upgrade to vLLM 0.8.4 immediately; if patching is blocked, gate API access to authenticated, trusted clients only. This is low-effort to exploit and high-impact on availability of your AI inference infrastructure.

Is GHSA-hf3c-wxg2-49q9 actively exploited?

No confirmed active exploitation of GHSA-hf3c-wxg2-49q9 has been reported, but organizations should still patch proactively.

How to fix GHSA-hf3c-wxg2-49q9?

1. **Patch**: Upgrade vLLM to >= 0.8.4 — this is the only complete fix. 2. **Workaround (if patching is blocked)**: Restrict the OpenAI-compatible API to trusted, authenticated clients only; block or rate-limit external access. 3. **Detection**: Monitor RAM consumption on inference nodes for sustained growth correlated with structured-output requests; alert on memory usage > 80% sustained over 5 minutes. 4. **V0 engine hardening**: If you cannot upgrade, consider disabling the per-request guided_decoding_backend override or blocking the extra_body.guided_decoding_backend parameter at your API gateway. 5. **Inventory**: Audit which internal services call vLLM's structured output endpoints and their trust level.

What systems are affected by GHSA-hf3c-wxg2-49q9?

This vulnerability affects the following AI/ML architecture patterns: LLM inference serving, OpenAI-compatible API servers, Model serving, Agent frameworks, RAG pipelines.

What is the CVSS score for GHSA-hf3c-wxg2-49q9?

GHSA-hf3c-wxg2-49q9 has a CVSS v3.1 base score of 6.5 (MEDIUM).

What is the AI security impact?

Affected AI Architectures

LLM inference servingOpenAI-compatible API serversModel servingAgent frameworksRAG pipelines

MITRE ATLAS Techniques

AML.T0010.001 AI Software
AML.T0029 Denial of AI Service
AML.T0034 Cost Harvesting
AML.T0040 AI Model Inference API Access
AML.T0049 Exploit Public-Facing Application

Compliance Controls Affected

EU AI Act: Article 9
ISO 42001: A.6.2.6
NIST AI RMF: RMF-RS-1
OWASP LLM Top 10: LLM04

What are the technical details?

Original Advisory

### Impact This report is to highlight a vulnerability in XGrammar, a library used by the structured output feature in vLLM. The XGrammar advisory is here: https://github.com/mlc-ai/xgrammar/security/advisories/GHSA-389x-67px-mjg3 The [xgrammar](https://xgrammar.mlc.ai/docs/) library is the default backend used by vLLM to support structured output (a.k.a. guided decoding). Xgrammar provides a required, built-in cache for its compiled grammars stored in RAM. xgrammar is available by default through the OpenAI compatible API server with both the V0 and V1 engines. A malicious user can send a stream of very short decoding requests with unique schemas, resulting in an addition to the cache for each request. This can result in a Denial of Service by consuming all of the system's RAM. Note that even if vLLM was configured to use a different backend by default, it is still possible to choose xgrammar on a per-request basis using the `guided_decoding_backend` key of the `extra_body` field of the request with the V0 engine. This per-request choice is not available when using the V1 engine. ### Patches * https://github.com/vllm-project/vllm/pull/16283 ### Workarounds There is no way to workaround this issue in existing versions of vLLM other than preventing untrusted access to the OpenAI compatible API server. ### References * https://github.com/mlc-ai/xgrammar/security/advisories/GHSA-389x-67px-mjg3

Exploitation Scenario

An attacker with a valid API key (insider threat, stolen credential, or paying trial user) writes a script that sends hundreds of /v1/chat/completions requests per minute, each specifying a unique JSON schema in the response_format field. vLLM's XGrammar backend compiles and caches a grammar object for each unique schema in RAM with no eviction policy. Within minutes, the inference server's available memory is exhausted, causing the process to OOM-crash or the OS to kill it, resulting in a complete outage of AI inference capabilities. The attacker needs no ML expertise — only knowledge of the OpenAI structured output API format, which is publicly documented.

Weaknesses (CWE)

CWE-1395 — Dependency on Vulnerable Third-Party Component: The product has a dependency on a third-party component that contains one or more known vulnerabilities.

  • [Requirements, Policy] In some industries such as healthcare [REF-1320] [REF-1322] or technologies such as the cloud [REF-1321], it might be unclear about who is responsible for applying patches for third-party vulnerabilities: the vendor, the operator/customer, or a separate service. Clarifying roles and responsibilities can be important to minimize confusion or unnecessary delay when third-party vulnerabilities are disclosed.
  • [Requirements] Require a Bill of Materials for all components and sub-components of the product. For software, require a Software Bill of Materials (SBOM) [REF-1247] [REF-1311].

Source: MITRE CWE corpus.

CVSS Vector

CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

Timeline

Published
April 15, 2025
Last Modified
April 15, 2025
First Seen
March 24, 2026

Related Vulnerabilities