CVE-2025-32381: xgrammar: unbounded grammar cache causes LLM server DoS

GHSA-389x-67px-mjg3 MEDIUM
Published April 9, 2025
CISO Take

Any vLLM or xgrammar-powered inference endpoint accepting user-supplied JSON schemas is vulnerable to memory exhaustion DoS — no authentication required beyond a valid user session (CVSS PR:L). Patch to xgrammar 0.1.18 immediately; if patching is delayed, rate-limit structured-output requests and cap unique schema submissions per session. This is a low-sophistication attack: a script sending thousands of unique schemas can take down an inference node.

What is the risk?

Medium severity in isolation, but operationally significant for AI inference infrastructure. The attack surface is broad — vLLM is widely deployed in enterprise LLM serving stacks and the exploit requires only low-privilege API access. EPSS is low (0.003) suggesting no active exploitation yet, and it is not in CISA KEV. However, the simplicity of the attack (no special knowledge needed, just unique JSON schemas) and the high availability impact on inference nodes elevate operational risk above the 6.5 CVSS score suggests.

What systems are affected?

Package Ecosystem Vulnerable Range Patched
XGrammar pip < 0.1.18 0.1.18
1.8K 160 dependents Pushed 12d ago 100% patched ~5d to patch Full package profile →

Do you use XGrammar? You're affected.

How severe is it?

CVSS 3.1
6.5 / 10
EPSS
0.4%
chance of exploitation in 30 days
Higher than 33% of all CVEs
Exploitation Status
No known exploitation
Sophistication
Trivial

What is the attack surface?

AV AC PR UI S C I A
AV Network
AC Low
PR Low
UI None
S Unchanged
C None
I None
A High

What should I do?

5 steps
  1. PATCH

    Upgrade xgrammar to >= 0.1.18 (cache size limit introduced). Update vLLM to a version referencing xgrammar 0.1.18+ (see vLLM PR #16283).

  2. SHORT-TERM WORKAROUND: Rate-limit structured-output (JSON schema) requests per client/session at the API gateway or load balancer layer. Restrict unique schema submissions to a reasonable bound (e.g., 50/hour per API key).

  3. MONITORING

    Alert on memory growth patterns on inference nodes, particularly correlated with structured-output endpoint traffic. Set OOM kill alerts.

  4. NETWORK CONTROLS

    Ensure inference endpoints are not publicly exposed without authentication; apply the principle of least privilege to schema-submission capabilities.

  5. VERIFY

    Confirm your vLLM deployment version and run pip show xgrammar to check the installed version.

What does CISA's SSVC say?

Decision Track
Exploitation none
Automatable No
Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

How is it classified?

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act
Article 9 - Risk management system
ISO 42001
8.4 - AI system operation
NIST AI RMF
MANAGE 2.2 - Mechanisms to sustain the value of deployed AI systems are evaluated and applied
OWASP LLM Top 10
LLM10:2025 - Unbounded Consumption

Frequently Asked Questions

What is CVE-2025-32381?

Any vLLM or xgrammar-powered inference endpoint accepting user-supplied JSON schemas is vulnerable to memory exhaustion DoS — no authentication required beyond a valid user session (CVSS PR:L). Patch to xgrammar 0.1.18 immediately; if patching is delayed, rate-limit structured-output requests and cap unique schema submissions per session. This is a low-sophistication attack: a script sending thousands of unique schemas can take down an inference node.

Is CVE-2025-32381 actively exploited?

No confirmed active exploitation of CVE-2025-32381 has been reported, but organizations should still patch proactively.

How to fix CVE-2025-32381?

1. PATCH: Upgrade xgrammar to >= 0.1.18 (cache size limit introduced). Update vLLM to a version referencing xgrammar 0.1.18+ (see vLLM PR #16283). 2. SHORT-TERM WORKAROUND: Rate-limit structured-output (JSON schema) requests per client/session at the API gateway or load balancer layer. Restrict unique schema submissions to a reasonable bound (e.g., 50/hour per API key). 3. MONITORING: Alert on memory growth patterns on inference nodes, particularly correlated with structured-output endpoint traffic. Set OOM kill alerts. 4. NETWORK CONTROLS: Ensure inference endpoints are not publicly exposed without authentication; apply the principle of least privilege to schema-submission capabilities. 5. VERIFY: Confirm your vLLM deployment version and run `pip show xgrammar` to check the installed version.

What systems are affected by CVE-2025-32381?

This vulnerability affects the following AI/ML architecture patterns: LLM inference servers, structured output pipelines, model serving, agent frameworks.

What is the CVSS score for CVE-2025-32381?

CVE-2025-32381 has a CVSS v3.1 base score of 6.5 (MEDIUM). The EPSS exploitation probability is 0.41%.

What is the AI security impact?

Affected AI Architectures

LLM inference serversstructured output pipelinesmodel servingagent frameworks

MITRE ATLAS Techniques

AML.T0029 Denial of AI Service
AML.T0034 Cost Harvesting
AML.T0049 Exploit Public-Facing Application

Compliance Controls Affected

EU AI Act: Article 9
ISO 42001: 8.4
NIST AI RMF: MANAGE 2.2
OWASP LLM Top 10: LLM10:2025

What are the technical details?

Original Advisory

### Summary Xgrammar includes a cache for compiled grammars to increase performance with repeated use of the same grammar. This cache is held in memory. Since the cache is unbounded, a system making use of xgrammar can be abused to fill up a host's memory and case a denial of service. For example, sending many small requests to an LLM inference server with unique JSON schemas would eventually cause this denial of service to occur. ### Details The fix is to add a limit to the cache size. This was done in https://github.com/mlc-ai/xgrammar/pull/243 An example of making use of the new cache size limit can be found in vLLM here: https://github.com/vllm-project/vllm/pull/16283 ### Impact Any system making use of Xgrammar and taking requests as input from potentially untrusted parties would be vulnerable to this denial of service issue.

Exploitation Scenario

An adversary with low-privilege API access to a vLLM inference endpoint (e.g., a free-tier or trial user) writes a script generating thousands of structurally unique JSON schemas — each schema with slightly different property names or nesting. Each request to the `/v1/chat/completions` endpoint with a unique `response_format.json_schema` triggers xgrammar to compile and cache a new grammar object. With no eviction policy, the cache grows unbounded. After ~10,000-50,000 requests (depending on schema complexity and host RAM), the host's memory is exhausted, the inference process is OOM-killed, and the endpoint becomes unavailable for all users. The attack is fully automatable, requires no special AI/ML knowledge, and can be executed from a single low-bandwidth connection.

Weaknesses (CWE)

CWE-770 — Allocation of Resources Without Limits or Throttling: The product allocates a reusable resource or group of resources on behalf of an actor without imposing any intended restrictions on the size or number of resources that can be allocated.

  • [Requirements] Clearly specify the minimum and maximum expectations for capabilities, and dictate which behaviors are acceptable when resource allocation reaches limits.
  • [Architecture and Design] Limit the amount of resources that are accessible to unprivileged users. Set per-user limits for resources. Allow the system administrator to define these limits. Be careful to avoid CWE-410.

Source: MITRE CWE corpus.

CVSS Vector

CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

Timeline

Published
April 9, 2025
Last Modified
April 9, 2025
First Seen
March 24, 2026

Related Vulnerabilities