CVE-2025-32381: xgrammar: unbounded grammar cache causes LLM server DoS

GHSA-389x-67px-mjg3 MEDIUM
Published April 9, 2025
CISO Take

Any vLLM or xgrammar-powered inference endpoint accepting user-supplied JSON schemas is vulnerable to memory exhaustion DoS — no authentication required beyond a valid user session (CVSS PR:L). Patch to xgrammar 0.1.18 immediately; if patching is delayed, rate-limit structured-output requests and cap unique schema submissions per session. This is a low-sophistication attack: a script sending thousands of unique schemas can take down an inference node.

Risk Assessment

Medium severity in isolation, but operationally significant for AI inference infrastructure. The attack surface is broad — vLLM is widely deployed in enterprise LLM serving stacks and the exploit requires only low-privilege API access. EPSS is low (0.003) suggesting no active exploitation yet, and it is not in CISA KEV. However, the simplicity of the attack (no special knowledge needed, just unique JSON schemas) and the high availability impact on inference nodes elevate operational risk above the 6.5 CVSS score suggests.

Affected Systems

Package Ecosystem Vulnerable Range Patched
xgrammar pip < 0.1.18 0.1.18
1.7K 154 dependents Pushed 2d ago 100% patched ~5d to patch Full package profile →

Do you use xgrammar? You're affected.

Severity & Risk

CVSS 3.1
6.5 / 10
EPSS
0.3%
chance of exploitation in 30 days
Higher than 55% of all CVEs
Exploitation Status
No known exploitation
Sophistication
Trivial

Attack Surface

AV AC PR UI S C I A
AV Network
AC Low
PR Low
UI None
S Unchanged
C None
I None
A High

Recommended Action

5 steps
  1. PATCH

    Upgrade xgrammar to >= 0.1.18 (cache size limit introduced). Update vLLM to a version referencing xgrammar 0.1.18+ (see vLLM PR #16283).

  2. SHORT-TERM WORKAROUND: Rate-limit structured-output (JSON schema) requests per client/session at the API gateway or load balancer layer. Restrict unique schema submissions to a reasonable bound (e.g., 50/hour per API key).

  3. MONITORING

    Alert on memory growth patterns on inference nodes, particularly correlated with structured-output endpoint traffic. Set OOM kill alerts.

  4. NETWORK CONTROLS

    Ensure inference endpoints are not publicly exposed without authentication; apply the principle of least privilege to schema-submission capabilities.

  5. VERIFY

    Confirm your vLLM deployment version and run pip show xgrammar to check the installed version.

CISA SSVC Assessment

Decision Track
Exploitation none
Automatable No
Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

Classification

Compliance Impact

This CVE is relevant to:

EU AI Act
Article 9 - Risk management system
ISO 42001
8.4 - AI system operation
NIST AI RMF
MANAGE 2.2 - Mechanisms to sustain the value of deployed AI systems are evaluated and applied
OWASP LLM Top 10
LLM10:2025 - Unbounded Consumption

Frequently Asked Questions

What is CVE-2025-32381?

Any vLLM or xgrammar-powered inference endpoint accepting user-supplied JSON schemas is vulnerable to memory exhaustion DoS — no authentication required beyond a valid user session (CVSS PR:L). Patch to xgrammar 0.1.18 immediately; if patching is delayed, rate-limit structured-output requests and cap unique schema submissions per session. This is a low-sophistication attack: a script sending thousands of unique schemas can take down an inference node.

Is CVE-2025-32381 actively exploited?

No confirmed active exploitation of CVE-2025-32381 has been reported, but organizations should still patch proactively.

How to fix CVE-2025-32381?

1. PATCH: Upgrade xgrammar to >= 0.1.18 (cache size limit introduced). Update vLLM to a version referencing xgrammar 0.1.18+ (see vLLM PR #16283). 2. SHORT-TERM WORKAROUND: Rate-limit structured-output (JSON schema) requests per client/session at the API gateway or load balancer layer. Restrict unique schema submissions to a reasonable bound (e.g., 50/hour per API key). 3. MONITORING: Alert on memory growth patterns on inference nodes, particularly correlated with structured-output endpoint traffic. Set OOM kill alerts. 4. NETWORK CONTROLS: Ensure inference endpoints are not publicly exposed without authentication; apply the principle of least privilege to schema-submission capabilities. 5. VERIFY: Confirm your vLLM deployment version and run `pip show xgrammar` to check the installed version.

What systems are affected by CVE-2025-32381?

This vulnerability affects the following AI/ML architecture patterns: LLM inference servers, structured output pipelines, model serving, agent frameworks.

What is the CVSS score for CVE-2025-32381?

CVE-2025-32381 has a CVSS v3.1 base score of 6.5 (MEDIUM). The EPSS exploitation probability is 0.32%.

Technical Details

NVD Description

### Summary Xgrammar includes a cache for compiled grammars to increase performance with repeated use of the same grammar. This cache is held in memory. Since the cache is unbounded, a system making use of xgrammar can be abused to fill up a host's memory and case a denial of service. For example, sending many small requests to an LLM inference server with unique JSON schemas would eventually cause this denial of service to occur. ### Details The fix is to add a limit to the cache size. This was done in https://github.com/mlc-ai/xgrammar/pull/243 An example of making use of the new cache size limit can be found in vLLM here: https://github.com/vllm-project/vllm/pull/16283 ### Impact Any system making use of Xgrammar and taking requests as input from potentially untrusted parties would be vulnerable to this denial of service issue.

Exploitation Scenario

An adversary with low-privilege API access to a vLLM inference endpoint (e.g., a free-tier or trial user) writes a script generating thousands of structurally unique JSON schemas — each schema with slightly different property names or nesting. Each request to the `/v1/chat/completions` endpoint with a unique `response_format.json_schema` triggers xgrammar to compile and cache a new grammar object. With no eviction policy, the cache grows unbounded. After ~10,000-50,000 requests (depending on schema complexity and host RAM), the host's memory is exhausted, the inference process is OOM-killed, and the endpoint becomes unavailable for all users. The attack is fully automatable, requires no special AI/ML knowledge, and can be executed from a single low-bandwidth connection.

CVSS Vector

CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

Timeline

Published
April 9, 2025
Last Modified
April 9, 2025
First Seen
March 24, 2026

Related Vulnerabilities