CVE-2025-32381 — MEDIUM (CVSS 6.5) AI Security Vulnerability

CISO Take

Any vLLM or xgrammar-powered inference endpoint accepting user-supplied JSON schemas is vulnerable to memory exhaustion DoS — no authentication required beyond a valid user session (CVSS PR:L). Patch to xgrammar 0.1.18 immediately; if patching is delayed, rate-limit structured-output requests and cap unique schema submissions per session. This is a low-sophistication attack: a script sending thousands of unique schemas can take down an inference node.

Risk Assessment

Medium severity in isolation, but operationally significant for AI inference infrastructure. The attack surface is broad — vLLM is widely deployed in enterprise LLM serving stacks and the exploit requires only low-privilege API access. EPSS is low (0.003) suggesting no active exploitation yet, and it is not in CISA KEV. However, the simplicity of the attack (no special knowledge needed, just unique JSON schemas) and the high availability impact on inference nodes elevate operational risk above the 6.5 CVSS score suggests.

Affected Systems

Package	Ecosystem	Vulnerable Range	Patched
xgrammar	pip	< 0.1.18	`0.1.18`
1.7K 154 dependents Pushed 2d ago 100% patched ~5d to patch Full package profile →

Do you use xgrammar? You're affected.

Severity & Risk

CVSS 3.1

6.5 / 10

EPSS

0.3%

chance of exploitation in 30 days

Higher than 55% of all CVEs

Source: EPSS v3 — FIRST.org

Exploitation Status

No known exploitation

Sophistication

Trivial

Attack Surface

AV Network

AC Low

PR Low

UI None

S Unchanged

C None

I None

A High

Recommended Action

5 steps

PATCH

Upgrade xgrammar to >= 0.1.18 (cache size limit introduced). Update vLLM to a version referencing xgrammar 0.1.18+ (see vLLM PR #16283).
SHORT-TERM WORKAROUND: Rate-limit structured-output (JSON schema) requests per client/session at the API gateway or load balancer layer. Restrict unique schema submissions to a reasonable bound (e.g., 50/hour per API key).
MONITORING

Alert on memory growth patterns on inference nodes, particularly correlated with structured-output endpoint traffic. Set OOM kill alerts.
NETWORK CONTROLS

Ensure inference endpoints are not publicly exposed without authentication; apply the principle of least privilege to schema-submission capabilities.
VERIFY

Confirm your vLLM deployment version and run pip show xgrammar to check the installed version.

CISA SSVC Assessment

Decision Track

Exploitation none

Automatable No

Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

Classification

DoS Inference Framework AML.T0029 - Denial of AI Service AML.T0034 - Cost Harvesting AML.T0049 - Exploit Public-Facing Application

Compliance Impact

This CVE is relevant to:

EU AI Act

Article 9 - Risk management system

ISO 42001

8.4 - AI system operation

NIST AI RMF

MANAGE 2.2 - Mechanisms to sustain the value of deployed AI systems are evaluated and applied

OWASP LLM Top 10

LLM10:2025 - Unbounded Consumption

Frequently Asked Questions

What is CVE-2025-32381?

Any vLLM or xgrammar-powered inference endpoint accepting user-supplied JSON schemas is vulnerable to memory exhaustion DoS — no authentication required beyond a valid user session (CVSS PR:L). Patch to xgrammar 0.1.18 immediately; if patching is delayed, rate-limit structured-output requests and cap unique schema submissions per session. This is a low-sophistication attack: a script sending thousands of unique schemas can take down an inference node.

Is CVE-2025-32381 actively exploited?

No confirmed active exploitation of CVE-2025-32381 has been reported, but organizations should still patch proactively.

How to fix CVE-2025-32381?

1. PATCH: Upgrade xgrammar to >= 0.1.18 (cache size limit introduced). Update vLLM to a version referencing xgrammar 0.1.18+ (see vLLM PR #16283). 2. SHORT-TERM WORKAROUND: Rate-limit structured-output (JSON schema) requests per client/session at the API gateway or load balancer layer. Restrict unique schema submissions to a reasonable bound (e.g., 50/hour per API key). 3. MONITORING: Alert on memory growth patterns on inference nodes, particularly correlated with structured-output endpoint traffic. Set OOM kill alerts. 4. NETWORK CONTROLS: Ensure inference endpoints are not publicly exposed without authentication; apply the principle of least privilege to schema-submission capabilities. 5. VERIFY: Confirm your vLLM deployment version and run `pip show xgrammar` to check the installed version.

What systems are affected by CVE-2025-32381?

This vulnerability affects the following AI/ML architecture patterns: LLM inference servers, structured output pipelines, model serving, agent frameworks.

What is the CVSS score for CVE-2025-32381?

CVE-2025-32381 has a CVSS v3.1 base score of 6.5 (MEDIUM). The EPSS exploitation probability is 0.32%.

Technical Details

NVD Description

### Summary Xgrammar includes a cache for compiled grammars to increase performance with repeated use of the same grammar. This cache is held in memory. Since the cache is unbounded, a system making use of xgrammar can be abused to fill up a host's memory and case a denial of service. For example, sending many small requests to an LLM inference server with unique JSON schemas would eventually cause this denial of service to occur. ### Details The fix is to add a limit to the cache size. This was done in https://github.com/mlc-ai/xgrammar/pull/243 An example of making use of the new cache size limit can be found in vLLM here: https://github.com/vllm-project/vllm/pull/16283 ### Impact Any system making use of Xgrammar and taking requests as input from potentially untrusted parties would be vulnerable to this denial of service issue.

Exploitation Scenario

An adversary with low-privilege API access to a vLLM inference endpoint (e.g., a free-tier or trial user) writes a script generating thousands of structurally unique JSON schemas — each schema with slightly different property names or nesting. Each request to the `/v1/chat/completions` endpoint with a unique `response_format.json_schema` triggers xgrammar to compile and cache a new grammar object. With no eviction policy, the cache grows unbounded. After ~10,000-50,000 requests (depending on schema complexity and host RAM), the host's memory is exhausted, the inference process is OOM-killed, and the endpoint becomes unavailable for all users. The attack is fully automatable, requires no special AI/ML knowledge, and can be executed from a single low-bandwidth connection.