Any vLLM deployment exposing the OpenAI-compatible API to untrusted users is vulnerable to RAM exhaustion through crafted structured-output requests. Upgrade to vLLM 0.8.4 immediately; if patching is blocked, gate API access to authenticated, trusted clients only. This is low-effort to exploit and high-impact on availability of your AI inference infrastructure.
What is the risk?
CVSS 6.5 (medium) understates operational risk for production inference servers. The attack requires only a low-privilege API account and no special AI knowledge — any authenticated user can trigger it by sending a stream of structured-output requests with unique JSON schemas. Availability impact is HIGH: successful exploitation exhausts all system RAM, crashing the inference server. For multi-tenant or internally shared vLLM deployments, one malicious insider or compromised account can take down AI services for all users.
What systems are affected?
| Package | Ecosystem | Vulnerable Range | Patched |
|---|---|---|---|
| vLLM | pip | >= 0.6.5, < 0.8.4 | 0.8.4 |
Do you use vLLM? You're affected.
How severe is it?
What is the attack surface?
What should I do?
5 steps-
Patch
Upgrade vLLM to >= 0.8.4 — this is the only complete fix.
-
Workaround (if patching is blocked)
Restrict the OpenAI-compatible API to trusted, authenticated clients only; block or rate-limit external access.
-
Detection
Monitor RAM consumption on inference nodes for sustained growth correlated with structured-output requests; alert on memory usage > 80% sustained over 5 minutes.
-
V0 engine hardening
If you cannot upgrade, consider disabling the per-request guided_decoding_backend override or blocking the extra_body.guided_decoding_backend parameter at your API gateway.
-
Inventory
Audit which internal services call vLLM's structured output endpoints and their trust level.
How is it classified?
Which compliance frameworks are affected?
This CVE is relevant to:
Frequently Asked Questions
What is GHSA-hf3c-wxg2-49q9?
Any vLLM deployment exposing the OpenAI-compatible API to untrusted users is vulnerable to RAM exhaustion through crafted structured-output requests. Upgrade to vLLM 0.8.4 immediately; if patching is blocked, gate API access to authenticated, trusted clients only. This is low-effort to exploit and high-impact on availability of your AI inference infrastructure.
Is GHSA-hf3c-wxg2-49q9 actively exploited?
No confirmed active exploitation of GHSA-hf3c-wxg2-49q9 has been reported, but organizations should still patch proactively.
How to fix GHSA-hf3c-wxg2-49q9?
1. **Patch**: Upgrade vLLM to >= 0.8.4 — this is the only complete fix. 2. **Workaround (if patching is blocked)**: Restrict the OpenAI-compatible API to trusted, authenticated clients only; block or rate-limit external access. 3. **Detection**: Monitor RAM consumption on inference nodes for sustained growth correlated with structured-output requests; alert on memory usage > 80% sustained over 5 minutes. 4. **V0 engine hardening**: If you cannot upgrade, consider disabling the per-request guided_decoding_backend override or blocking the extra_body.guided_decoding_backend parameter at your API gateway. 5. **Inventory**: Audit which internal services call vLLM's structured output endpoints and their trust level.
What systems are affected by GHSA-hf3c-wxg2-49q9?
This vulnerability affects the following AI/ML architecture patterns: LLM inference serving, OpenAI-compatible API servers, Model serving, Agent frameworks, RAG pipelines.
What is the CVSS score for GHSA-hf3c-wxg2-49q9?
GHSA-hf3c-wxg2-49q9 has a CVSS v3.1 base score of 6.5 (MEDIUM).
What is the AI security impact?
Affected AI Architectures
MITRE ATLAS Techniques
AML.T0010.001 AI Software AML.T0029 Denial of AI Service AML.T0034 Cost Harvesting AML.T0040 AI Model Inference API Access AML.T0049 Exploit Public-Facing Application Compliance Controls Affected
What are the technical details?
Original Advisory
### Impact This report is to highlight a vulnerability in XGrammar, a library used by the structured output feature in vLLM. The XGrammar advisory is here: https://github.com/mlc-ai/xgrammar/security/advisories/GHSA-389x-67px-mjg3 The [xgrammar](https://xgrammar.mlc.ai/docs/) library is the default backend used by vLLM to support structured output (a.k.a. guided decoding). Xgrammar provides a required, built-in cache for its compiled grammars stored in RAM. xgrammar is available by default through the OpenAI compatible API server with both the V0 and V1 engines. A malicious user can send a stream of very short decoding requests with unique schemas, resulting in an addition to the cache for each request. This can result in a Denial of Service by consuming all of the system's RAM. Note that even if vLLM was configured to use a different backend by default, it is still possible to choose xgrammar on a per-request basis using the `guided_decoding_backend` key of the `extra_body` field of the request with the V0 engine. This per-request choice is not available when using the V1 engine. ### Patches * https://github.com/vllm-project/vllm/pull/16283 ### Workarounds There is no way to workaround this issue in existing versions of vLLM other than preventing untrusted access to the OpenAI compatible API server. ### References * https://github.com/mlc-ai/xgrammar/security/advisories/GHSA-389x-67px-mjg3
Exploitation Scenario
An attacker with a valid API key (insider threat, stolen credential, or paying trial user) writes a script that sends hundreds of /v1/chat/completions requests per minute, each specifying a unique JSON schema in the response_format field. vLLM's XGrammar backend compiles and caches a grammar object for each unique schema in RAM with no eviction policy. Within minutes, the inference server's available memory is exhausted, causing the process to OOM-crash or the OS to kill it, resulting in a complete outage of AI inference capabilities. The attacker needs no ML expertise — only knowledge of the OpenAI structured output API format, which is publicly documented.
Weaknesses (CWE)
CWE-1395 Dependency on Vulnerable Third-Party Component
Primary
CWE-770 Allocation of Resources Without Limits or Throttling
Primary
CWE-1395 — Dependency on Vulnerable Third-Party Component: The product has a dependency on a third-party component that contains one or more known vulnerabilities.
- [Requirements, Policy] In some industries such as healthcare [REF-1320] [REF-1322] or technologies such as the cloud [REF-1321], it might be unclear about who is responsible for applying patches for third-party vulnerabilities: the vendor, the operator/customer, or a separate service. Clarifying roles and responsibilities can be important to minimize confusion or unnecessary delay when third-party vulnerabilities are disclosed.
- [Requirements] Require a Bill of Materials for all components and sub-components of the product. For software, require a Software Bill of Materials (SBOM) [REF-1247] [REF-1311].
Source: MITRE CWE corpus.
CVSS Vector
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H References
Timeline
Related Vulnerabilities
CVE-2024-9053 9.8 vllm: RCE via unsafe pickle deserialization in RPC server
Same package: vllm CVE-2024-11041 9.8 vllm: RCE via unsafe pickle deserialization in MessageQueue
Same package: vllm CVE-2026-25960 9.8 vllm: SSRF allows internal network access
Same package: vllm CVE-2025-47277 9.8 vLLM: RCE via exposed TCPStore in distributed inference
Same package: vllm CVE-2025-32444 9.8 vLLM: RCE via pickle deserialization on ZeroMQ
Same package: vllm