CVE-2026-44223: vLLM: speculative decoding DoS via penalty params

GHSA-83vm-p52w-f9pw MEDIUM CISA: TRACK*
Published May 6, 2026
CISO Take

A tensor shape mismatch bug in vLLM's extract_hidden_states speculative decoding proposer allows any authenticated API user to permanently crash the EngineCore process with a single request containing any penalty parameter — the crash is deterministic, immediate, and requires no special workload or concurrency. Organizations running vLLM v0.18.0 through v0.19.1 with this speculative decoding configuration face complete inference service unavailability until manual restart, making this a high-availability risk for any AI inference platform exposed to untrusted or semi-trusted users. With 126 downstream dependents and 42 prior CVEs in the same package, vLLM's vulnerability surface warrants systematic attention; EPSS data is not yet available and the vulnerability is absent from CISA KEV, suggesting no active exploitation in the wild. Upgrade to vLLM v0.20.0 immediately, or strip repetition_penalty, frequency_penalty, and presence_penalty parameters at the API gateway as an interim control.

Sources: NVD GitHub Advisory ATLAS

What is the risk?

Medium CVSS (6.5) understates operational impact for affected configurations. The attack requires only low privileges (a valid API key or user account), is network-accessible with low complexity, and produces a complete and permanent service outage with no self-recovery. The constraint is narrow: only deployments using extract_hidden_states as the speculative decoding method on v0.18.0–v0.19.1 are affected. For organizations in that window, effective exploitability is trivial — any user aware of the CVE can weaponize it in one API call. Risk is HIGH for affected configs, LOW for all others.

How does the attack unfold?

Initial Access
Adversary obtains low-privilege API credentials to a vLLM inference endpoint running v0.18.0–v0.19.1 with extract_hidden_states speculative decoding enabled.
AML.T0012
Exploitation
Adversary sends a single chat completion API request containing any penalty parameter (e.g., repetition_penalty: 1.1), triggering the tensor shape mismatch on the first decode step.
AML.T0049
Impact
EngineCore process crashes immediately and permanently, rendering the entire vLLM inference service unavailable to all users until manual operator intervention.
AML.T0029

What systems are affected?

Package Ecosystem Vulnerable Range Patched
vLLM pip >= 0.18.0, < 0.20.0 0.20.0
83.4K 130 dependents Pushed 2d ago 34% patched ~32d to patch Full package profile →

Do you use vLLM? You're affected.

How severe is it?

CVSS 3.1
6.5 / 10
EPSS
0.4%
chance of exploitation in 30 days
Higher than 28% of all CVEs
Exploitation Status
Exploit Available
Exploitation: MEDIUM
Sophistication
Trivial
Exploitation Confidence
medium
CISA SSVC: Public PoC
Composite signal derived from CISA KEV, VulnCheck KEV, CISA SSVC, EPSS, Metasploit, Exploit-DB, trickest/cve, Nuclei templates, and inthewild.io exploitation reports.

What is the attack surface?

AV AC PR UI S C I A
AV Network
AC Low
PR Low
UI None
S Unchanged
C None
I None
A High

What should I do?

5 steps
  1. Patch: Upgrade vLLM to v0.20.0 or later immediately — the fix (PR #38610) slices the return tensor to correct shape.

  2. Workaround A: Switch speculative decoding method away from extract_hidden_states on affected versions.

  3. Workaround B: Reject or strip repetition_penalty, frequency_penalty, and presence_penalty fields at the API gateway or load balancer before requests reach vLLM.

  4. Detection: Monitor EngineCore process health and alert on unexpected restarts or crashes — a pattern of crashes correlated with penalty-parameter requests is a strong indicator.

  5. Audit: Inventory all vLLM instances across the environment and confirm speculative decoding configuration before deprioritizing.

What does CISA's SSVC say?

Decision Track*
Exploitation poc
Automatable No
Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

How is it classified?

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act
Article 15 - Accuracy, robustness and cybersecurity
ISO 42001
8.3 - AI Risk Treatment
NIST AI RMF
GOVERN 1.2 - Policies and processes for AI risk management MANAGE 2.2 - Mechanisms to sustain value of AI systems over time
OWASP LLM Top 10
LLM04 - Model Denial of Service

Frequently Asked Questions

What is CVE-2026-44223?

A tensor shape mismatch bug in vLLM's extract_hidden_states speculative decoding proposer allows any authenticated API user to permanently crash the EngineCore process with a single request containing any penalty parameter — the crash is deterministic, immediate, and requires no special workload or concurrency. Organizations running vLLM v0.18.0 through v0.19.1 with this speculative decoding configuration face complete inference service unavailability until manual restart, making this a high-availability risk for any AI inference platform exposed to untrusted or semi-trusted users. With 126 downstream dependents and 42 prior CVEs in the same package, vLLM's vulnerability surface warrants systematic attention; EPSS data is not yet available and the vulnerability is absent from CISA KEV, suggesting no active exploitation in the wild. Upgrade to vLLM v0.20.0 immediately, or strip repetition_penalty, frequency_penalty, and presence_penalty parameters at the API gateway as an interim control.

Is CVE-2026-44223 actively exploited?

No confirmed active exploitation of CVE-2026-44223 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-44223?

1. Patch: Upgrade vLLM to v0.20.0 or later immediately — the fix (PR #38610) slices the return tensor to correct shape. 2. Workaround A: Switch speculative decoding method away from extract_hidden_states on affected versions. 3. Workaround B: Reject or strip repetition_penalty, frequency_penalty, and presence_penalty fields at the API gateway or load balancer before requests reach vLLM. 4. Detection: Monitor EngineCore process health and alert on unexpected restarts or crashes — a pattern of crashes correlated with penalty-parameter requests is a strong indicator. 5. Audit: Inventory all vLLM instances across the environment and confirm speculative decoding configuration before deprioritizing.

What systems are affected by CVE-2026-44223?

This vulnerability affects the following AI/ML architecture patterns: LLM inference APIs, model serving, speculative decoding pipelines, multi-tenant AI platforms.

What is the CVSS score for CVE-2026-44223?

CVE-2026-44223 has a CVSS v3.1 base score of 6.5 (MEDIUM). The EPSS exploitation probability is 0.37%.

What is the AI security impact?

Affected AI Architectures

LLM inference APIsmodel servingspeculative decoding pipelinesmulti-tenant AI platforms

MITRE ATLAS Techniques

AML.T0029 Denial of AI Service
AML.T0049 Exploit Public-Facing Application

Compliance Controls Affected

EU AI Act: Article 15
ISO 42001: 8.3
NIST AI RMF: GOVERN 1.2, MANAGE 2.2
OWASP LLM Top 10: LLM04

What are the technical details?

Original Advisory

vLLM is an inference and serving engine for large language models (LLMs). From to before 0.20.0, the extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty). A single request with a penalty parameter (e.g., "repetition_penalty": 1.1) is sufficient to crash the server. This vulnerability is fixed in 0.20.0.

Exploitation Scenario

An adversary with a valid API key — an internal developer, a trial user, or an attacker who compromised credentials — sends a single chat completion request to the vLLM inference endpoint with the body parameter repetition_penalty set to 1.1. If the instance runs v0.18.0–v0.19.1 with extract_hidden_states speculative decoding, the EngineCore process crashes immediately upon processing the first decode step, taking down the entire inference service. No retry, no escalation, no special payload crafting required. In a multi-tenant environment, this single request denies service to all other users until an operator manually restarts the process — a scenario that maps directly to insider threat, credential compromise, or API abuse.

Weaknesses (CWE)

CWE-131 — Incorrect Calculation of Buffer Size: The product does not correctly calculate the size to be used when allocating a buffer, which could lead to a buffer overflow.

  • [Implementation] When allocating a buffer for the purpose of transforming, converting, or encoding an input, allocate enough memory to handle the largest possible encoding. For example, in a routine that converts "&" characters to "&amp;" for HTML entity encoding, the output buffer needs to be at least 5 times as large as the input buffer.
  • [Implementation] Understand the programming language's underlying representation and how it interacts with numeric calculation (CWE-681). Pay close attention to byte size discrepancies, precision, signed/unsigned distinctions, truncation, conversion and casting between types, "not-a-number" calculations, and how the language handles numbers that are too large or too small for its underlying representation. [REF-7] Also be careful to account for 32-bit, 64-bit, and other potential differences that may affect the numeric representation.

Source: MITRE CWE corpus.

CVSS Vector

CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

Timeline

Published
May 6, 2026
Last Modified
June 22, 2026
First Seen
May 7, 2026

Related Vulnerabilities