GHSA-hf3c-wxg2-49q9 — MEDIUM (CVSS 6.5) AI Security Vulnerability

CISO Take

Any vLLM deployment exposing the OpenAI-compatible API to untrusted users is vulnerable to RAM exhaustion through crafted structured-output requests. Upgrade to vLLM 0.8.4 immediately; if patching is blocked, gate API access to authenticated, trusted clients only. This is low-effort to exploit and high-impact on availability of your AI inference infrastructure.

Risk Assessment

CVSS 6.5 (medium) understates operational risk for production inference servers. The attack requires only a low-privilege API account and no special AI knowledge — any authenticated user can trigger it by sending a stream of structured-output requests with unique JSON schemas. Availability impact is HIGH: successful exploitation exhausts all system RAM, crashing the inference server. For multi-tenant or internally shared vLLM deployments, one malicious insider or compromised account can take down AI services for all users.

Affected Systems

Package	Ecosystem	Vulnerable Range	Patched
vllm	pip	>= 0.6.5, < 0.8.4	`0.8.4`
78.9K 126 dependents Pushed 6d ago 56% patched ~32d to patch Full package profile →

Do you use vllm? You're affected.

Severity & Risk

CVSS 3.1

6.5 / 10

EPSS

N/A

Exploitation Status

No known exploitation

Sophistication

Trivial

Attack Surface

AV Network

AC Low

PR Low

UI None

S Unchanged

C None

I None

A High

Recommended Action

5 steps

Patch

Upgrade vLLM to >= 0.8.4 — this is the only complete fix.
Workaround (if patching is blocked)

Restrict the OpenAI-compatible API to trusted, authenticated clients only; block or rate-limit external access.
Detection

Monitor RAM consumption on inference nodes for sustained growth correlated with structured-output requests; alert on memory usage > 80% sustained over 5 minutes.
V0 engine hardening

If you cannot upgrade, consider disabling the per-request guided_decoding_backend override or blocking the extra_body.guided_decoding_backend parameter at your API gateway.
Inventory

Audit which internal services call vLLM's structured output endpoints and their trust level.

Classification

DoS Supply Chain Inference Framework API AML.T0010.001 - AI Software AML.T0029 - Denial of AI Service AML.T0034 - Cost Harvesting AML.T0040 - AI Model Inference API Access AML.T0049 - Exploit Public-Facing Application

Compliance Impact

This CVE is relevant to:

EU AI Act

Article 9 - Risk Management System — Robustness and Cybersecurity

ISO 42001

A.6.2.6 - AI System Availability and Resilience

NIST AI RMF

RMF-RS-1 - Reliable and Available AI Systems

OWASP LLM Top 10

LLM04 - Model Denial of Service

Frequently Asked Questions

What is GHSA-hf3c-wxg2-49q9?

Any vLLM deployment exposing the OpenAI-compatible API to untrusted users is vulnerable to RAM exhaustion through crafted structured-output requests. Upgrade to vLLM 0.8.4 immediately; if patching is blocked, gate API access to authenticated, trusted clients only. This is low-effort to exploit and high-impact on availability of your AI inference infrastructure.

Is GHSA-hf3c-wxg2-49q9 actively exploited?

No confirmed active exploitation of GHSA-hf3c-wxg2-49q9 has been reported, but organizations should still patch proactively.

How to fix GHSA-hf3c-wxg2-49q9?

1. **Patch**: Upgrade vLLM to >= 0.8.4 — this is the only complete fix. 2. **Workaround (if patching is blocked)**: Restrict the OpenAI-compatible API to trusted, authenticated clients only; block or rate-limit external access. 3. **Detection**: Monitor RAM consumption on inference nodes for sustained growth correlated with structured-output requests; alert on memory usage > 80% sustained over 5 minutes. 4. **V0 engine hardening**: If you cannot upgrade, consider disabling the per-request guided_decoding_backend override or blocking the extra_body.guided_decoding_backend parameter at your API gateway. 5. **Inventory**: Audit which internal services call vLLM's structured output endpoints and their trust level.

What systems are affected by GHSA-hf3c-wxg2-49q9?

This vulnerability affects the following AI/ML architecture patterns: LLM inference serving, OpenAI-compatible API servers, Model serving, Agent frameworks, RAG pipelines.

What is the CVSS score for GHSA-hf3c-wxg2-49q9?

GHSA-hf3c-wxg2-49q9 has a CVSS v3.1 base score of 6.5 (MEDIUM).

Technical Details

NVD Description

### Impact This report is to highlight a vulnerability in XGrammar, a library used by the structured output feature in vLLM. The XGrammar advisory is here: https://github.com/mlc-ai/xgrammar/security/advisories/GHSA-389x-67px-mjg3 The [xgrammar](https://xgrammar.mlc.ai/docs/) library is the default backend used by vLLM to support structured output (a.k.a. guided decoding). Xgrammar provides a required, built-in cache for its compiled grammars stored in RAM. xgrammar is available by default through the OpenAI compatible API server with both the V0 and V1 engines. A malicious user can send a stream of very short decoding requests with unique schemas, resulting in an addition to the cache for each request. This can result in a Denial of Service by consuming all of the system's RAM. Note that even if vLLM was configured to use a different backend by default, it is still possible to choose xgrammar on a per-request basis using the `guided_decoding_backend` key of the `extra_body` field of the request with the V0 engine. This per-request choice is not available when using the V1 engine. ### Patches * https://github.com/vllm-project/vllm/pull/16283 ### Workarounds There is no way to workaround this issue in existing versions of vLLM other than preventing untrusted access to the OpenAI compatible API server. ### References * https://github.com/mlc-ai/xgrammar/security/advisories/GHSA-389x-67px-mjg3

Exploitation Scenario

An attacker with a valid API key (insider threat, stolen credential, or paying trial user) writes a script that sends hundreds of /v1/chat/completions requests per minute, each specifying a unique JSON schema in the response_format field. vLLM's XGrammar backend compiles and caches a grammar object for each unique schema in RAM with no eviction policy. Within minutes, the inference server's available memory is exhausted, causing the process to OOM-crash or the OS to kill it, resulting in a complete outage of AI inference capabilities. The attacker needs no ML expertise — only knowledge of the OpenAI structured output API format, which is publicly documented.