### Summary The `extract_hidden_states` speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a `RuntimeError` that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters...
Full CISO analysis pending enrichment.
Affected Systems
| Package | Ecosystem | Vulnerable Range | Patched |
|---|---|---|---|
| vllm | pip | >= 0.18.0, < 0.20.0 | 0.20.0 |
Do you use vllm? You're affected.
Severity & Risk
Attack Surface
Recommended Action
Patch available
Update vllm to version 0.20.0
Compliance Impact
Compliance analysis pending. Sign in for full compliance mapping when available.
Frequently Asked Questions
What is CVE-2026-44223?
vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters
Is CVE-2026-44223 actively exploited?
No confirmed active exploitation of CVE-2026-44223 has been reported, but organizations should still patch proactively.
How to fix CVE-2026-44223?
Update to patched version: vllm 0.20.0.
What is the CVSS score for CVE-2026-44223?
CVE-2026-44223 has a CVSS v3.1 base score of 6.5 (MEDIUM).
Technical Details
NVD Description
### Summary The `extract_hidden_states` speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a `RuntimeError` that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (`repetition_penalty`, `frequency_penalty`, or `presence_penalty`). A single request with a penalty parameter (e.g., `"repetition_penalty": 1.1`) is sufficient to crash the server. The crash is deterministic and immediate — no concurrency, race condition, or special workload is required. ### Details In vLLM v0.17.0, the `extract_hidden_states` proposer's `propose()` method returned `sampled_token_ids.unsqueeze(-1)`, producing a tensor of shape `(batch_size, 1)`. In [PR #37013](https://github.com/vllm-project/vllm/pull/37013) (first released in v0.18.0), the KV connector interface was refactored out of `propose()`. The return type changed from `tuple[Tensor, KVConnectorOutput | None]` to `Tensor`, and the `.unsqueeze(-1)` call was removed along with the KV connector output: ```python # Before (v0.17.0): return sampled_token_ids.unsqueeze(-1), kv_connector_output # shape (batch_size, 1) # After (v0.18.0+): return sampled_token_ids # shape (batch_size, 2) after first decode step ``` The refactor missed that `sampled_token_ids` changed semantics between the first and subsequent decode steps. After the first decode step, the rejection sampler allocates its output as `(batch_size, max_spec_len + 1)`. With `num_speculative_tokens=1`, this produces shape `(batch_size, 2)` instead of the expected `(batch_size, 1)`, causing a broadcast shape mismatch during penalty application. ### Impact Any vLLM deployment between v0.18.0 and v0.19.1 (inclusive) configured with `extract_hidden_states` speculative decoding is affected. A single API request containing any penalty parameter immediately and permanently crashes the EngineCore process, resulting in complete loss of service availability. ### Patches Fixed in [PR #38610](https://github.com/vllm-project/vllm/pull/38610), first included in vLLM v0.20.0. The fix slices the return value to `sampled_token_ids[:, :1]`, ensuring the correct `(batch_size, 1)` shape regardless of the rejection sampler's output dimensions. ### Workarounds - Upgrade to vLLM v0.20.0 or later. - If upgrading is not possible, avoid using `extract_hidden_states` as the speculative decoding method on affected versions. - Alternatively, reject or strip penalty parameters (`repetition_penalty`, `frequency_penalty`, `presence_penalty`) from incoming requests at an API gateway before they reach vLLM.
Weaknesses (CWE)
CVSS Vector
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H References
Timeline
Related Vulnerabilities
CVE-2024-9053 9.8 vllm: RCE via unsafe pickle deserialization in RPC server
Same package: vllm CVE-2026-25960 9.8 vllm: SSRF allows internal network access
Same package: vllm CVE-2025-47277 9.8 vLLM: RCE via exposed TCPStore in distributed inference
Same package: vllm CVE-2024-11041 9.8 vllm: RCE via unsafe pickle deserialization in MessageQueue
Same package: vllm CVE-2025-32444 9.8 vLLM: RCE via pickle deserialization on ZeroMQ
Same package: vllm
AI Threat Alert