CVE-2026-44223: vLLM: speculative decoding DoS via penalty params
GHSA-83vm-p52w-f9pw MEDIUM CISA: TRACK*A tensor shape mismatch bug in vLLM's extract_hidden_states speculative decoding proposer allows any authenticated API user to permanently crash the EngineCore process with a single request containing any penalty parameter — the crash is deterministic, immediate, and requires no special workload or concurrency. Organizations running vLLM v0.18.0 through v0.19.1 with this speculative decoding configuration face complete inference service unavailability until manual restart, making this a high-availability risk for any AI inference platform exposed to untrusted or semi-trusted users. With 126 downstream dependents and 42 prior CVEs in the same package, vLLM's vulnerability surface warrants systematic attention; EPSS data is not yet available and the vulnerability is absent from CISA KEV, suggesting no active exploitation in the wild. Upgrade to vLLM v0.20.0 immediately, or strip repetition_penalty, frequency_penalty, and presence_penalty parameters at the API gateway as an interim control.
What is the risk?
Medium CVSS (6.5) understates operational impact for affected configurations. The attack requires only low privileges (a valid API key or user account), is network-accessible with low complexity, and produces a complete and permanent service outage with no self-recovery. The constraint is narrow: only deployments using extract_hidden_states as the speculative decoding method on v0.18.0–v0.19.1 are affected. For organizations in that window, effective exploitability is trivial — any user aware of the CVE can weaponize it in one API call. Risk is HIGH for affected configs, LOW for all others.
How does the attack unfold?
What systems are affected?
| Package | Ecosystem | Vulnerable Range | Patched |
|---|---|---|---|
| vLLM | pip | >= 0.18.0, < 0.20.0 | 0.20.0 |
Do you use vLLM? You're affected.
How severe is it?
What is the attack surface?
What should I do?
5 steps-
Patch: Upgrade vLLM to v0.20.0 or later immediately — the fix (PR #38610) slices the return tensor to correct shape.
-
Workaround A: Switch speculative decoding method away from extract_hidden_states on affected versions.
-
Workaround B: Reject or strip repetition_penalty, frequency_penalty, and presence_penalty fields at the API gateway or load balancer before requests reach vLLM.
-
Detection: Monitor EngineCore process health and alert on unexpected restarts or crashes — a pattern of crashes correlated with penalty-parameter requests is a strong indicator.
-
Audit: Inventory all vLLM instances across the environment and confirm speculative decoding configuration before deprioritizing.
What does CISA's SSVC say?
Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.
How is it classified?
Which compliance frameworks are affected?
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2026-44223?
A tensor shape mismatch bug in vLLM's extract_hidden_states speculative decoding proposer allows any authenticated API user to permanently crash the EngineCore process with a single request containing any penalty parameter — the crash is deterministic, immediate, and requires no special workload or concurrency. Organizations running vLLM v0.18.0 through v0.19.1 with this speculative decoding configuration face complete inference service unavailability until manual restart, making this a high-availability risk for any AI inference platform exposed to untrusted or semi-trusted users. With 126 downstream dependents and 42 prior CVEs in the same package, vLLM's vulnerability surface warrants systematic attention; EPSS data is not yet available and the vulnerability is absent from CISA KEV, suggesting no active exploitation in the wild. Upgrade to vLLM v0.20.0 immediately, or strip repetition_penalty, frequency_penalty, and presence_penalty parameters at the API gateway as an interim control.
Is CVE-2026-44223 actively exploited?
No confirmed active exploitation of CVE-2026-44223 has been reported, but organizations should still patch proactively.
How to fix CVE-2026-44223?
1. Patch: Upgrade vLLM to v0.20.0 or later immediately — the fix (PR #38610) slices the return tensor to correct shape. 2. Workaround A: Switch speculative decoding method away from extract_hidden_states on affected versions. 3. Workaround B: Reject or strip repetition_penalty, frequency_penalty, and presence_penalty fields at the API gateway or load balancer before requests reach vLLM. 4. Detection: Monitor EngineCore process health and alert on unexpected restarts or crashes — a pattern of crashes correlated with penalty-parameter requests is a strong indicator. 5. Audit: Inventory all vLLM instances across the environment and confirm speculative decoding configuration before deprioritizing.
What systems are affected by CVE-2026-44223?
This vulnerability affects the following AI/ML architecture patterns: LLM inference APIs, model serving, speculative decoding pipelines, multi-tenant AI platforms.
What is the CVSS score for CVE-2026-44223?
CVE-2026-44223 has a CVSS v3.1 base score of 6.5 (MEDIUM). The EPSS exploitation probability is 0.37%.
What is the AI security impact?
Affected AI Architectures
MITRE ATLAS Techniques
AML.T0029 Denial of AI Service AML.T0049 Exploit Public-Facing Application Compliance Controls Affected
What are the technical details?
Original Advisory
vLLM is an inference and serving engine for large language models (LLMs). From to before 0.20.0, the extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty). A single request with a penalty parameter (e.g., "repetition_penalty": 1.1) is sufficient to crash the server. This vulnerability is fixed in 0.20.0.
Exploitation Scenario
An adversary with a valid API key — an internal developer, a trial user, or an attacker who compromised credentials — sends a single chat completion request to the vLLM inference endpoint with the body parameter repetition_penalty set to 1.1. If the instance runs v0.18.0–v0.19.1 with extract_hidden_states speculative decoding, the EngineCore process crashes immediately upon processing the first decode step, taking down the entire inference service. No retry, no escalation, no special payload crafting required. In a multi-tenant environment, this single request denies service to all other users until an operator manually restarts the process — a scenario that maps directly to insider threat, credential compromise, or API abuse.
Weaknesses (CWE)
CWE-131 Incorrect Calculation of Buffer Size
Primary
CWE-131 Incorrect Calculation of Buffer Size
Primary
CWE-704 Incorrect Type Conversion or Cast
Primary
CWE-704 Incorrect Type Conversion or Cast
Primary
CWE-131 Incorrect Calculation of Buffer Size CWE-704 Incorrect Type Conversion or Cast CWE-131 — Incorrect Calculation of Buffer Size: The product does not correctly calculate the size to be used when allocating a buffer, which could lead to a buffer overflow.
- [Implementation] When allocating a buffer for the purpose of transforming, converting, or encoding an input, allocate enough memory to handle the largest possible encoding. For example, in a routine that converts "&" characters to "&" for HTML entity encoding, the output buffer needs to be at least 5 times as large as the input buffer.
- [Implementation] Understand the programming language's underlying representation and how it interacts with numeric calculation (CWE-681). Pay close attention to byte size discrepancies, precision, signed/unsigned distinctions, truncation, conversion and casting between types, "not-a-number" calculations, and how the language handles numbers that are too large or too small for its underlying representation. [REF-7] Also be careful to account for 32-bit, 64-bit, and other potential differences that may affect the numeric representation.
Source: MITRE CWE corpus.
CVSS Vector
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H References
Timeline
Related Vulnerabilities
CVE-2024-9053 9.8 vllm: RCE via unsafe pickle deserialization in RPC server
Same package: vllm CVE-2026-25960 9.8 vllm: SSRF allows internal network access
Same package: vllm CVE-2025-47277 9.8 vLLM: RCE via exposed TCPStore in distributed inference
Same package: vllm CVE-2024-11041 9.8 vllm: RCE via unsafe pickle deserialization in MessageQueue
Same package: vllm CVE-2025-32444 9.8 vLLM: RCE via pickle deserialization on ZeroMQ
Same package: vllm