CVE-2026-44223 — MEDIUM (CVSS 6.5) AI Security Vulnerability

Q: What is CVE-2026-44223?

vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters

Q: Is CVE-2026-44223 actively exploited?

No confirmed active exploitation of CVE-2026-44223 has been reported, but organizations should still patch proactively.

Q: How to fix CVE-2026-44223?

Update to patched version: vllm 0.20.0.

Q: What is the CVSS score for CVE-2026-44223?

CVE-2026-44223 has a CVSS v3.1 base score of 6.5 (MEDIUM).

### Summary The `extract_hidden_states` speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a `RuntimeError` that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters...

Full CISO analysis pending enrichment.

Affected Systems

Package	Ecosystem	Vulnerable Range	Patched
vllm	pip	>= 0.18.0, < 0.20.0	`0.20.0`
78.9K 126 dependents Pushed 3d ago 56% patched ~32d to patch Full package profile →

Do you use vllm? You're affected.

Severity & Risk

CVSS 3.1

6.5 / 10

EPSS

N/A

Exploitation Status

No known exploitation

Sophistication

N/A

Attack Surface

AV Network

AC Low

PR Low

UI None

S Unchanged

C None

I None

A High

Recommended Action

Patch available

Update vllm to version 0.20.0

Compliance Impact

Compliance analysis pending. Sign in for full compliance mapping when available.

Frequently Asked Questions

What is CVE-2026-44223?

vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters

Is CVE-2026-44223 actively exploited?

No confirmed active exploitation of CVE-2026-44223 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-44223?

Update to patched version: vllm 0.20.0.

What is the CVSS score for CVE-2026-44223?

CVE-2026-44223 has a CVSS v3.1 base score of 6.5 (MEDIUM).

Technical Details

NVD Description

### Summary The `extract_hidden_states` speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a `RuntimeError` that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (`repetition_penalty`, `frequency_penalty`, or `presence_penalty`). A single request with a penalty parameter (e.g., `"repetition_penalty": 1.1`) is sufficient to crash the server. The crash is deterministic and immediate — no concurrency, race condition, or special workload is required. ### Details In vLLM v0.17.0, the `extract_hidden_states` proposer's `propose()` method returned `sampled_token_ids.unsqueeze(-1)`, producing a tensor of shape `(batch_size, 1)`. In [PR #37013](https://github.com/vllm-project/vllm/pull/37013) (first released in v0.18.0), the KV connector interface was refactored out of `propose()`. The return type changed from `tuple[Tensor, KVConnectorOutput | None]` to `Tensor`, and the `.unsqueeze(-1)` call was removed along with the KV connector output: ```python # Before (v0.17.0): return sampled_token_ids.unsqueeze(-1), kv_connector_output # shape (batch_size, 1) # After (v0.18.0+): return sampled_token_ids # shape (batch_size, 2) after first decode step ``` The refactor missed that `sampled_token_ids` changed semantics between the first and subsequent decode steps. After the first decode step, the rejection sampler allocates its output as `(batch_size, max_spec_len + 1)`. With `num_speculative_tokens=1`, this produces shape `(batch_size, 2)` instead of the expected `(batch_size, 1)`, causing a broadcast shape mismatch during penalty application. ### Impact Any vLLM deployment between v0.18.0 and v0.19.1 (inclusive) configured with `extract_hidden_states` speculative decoding is affected. A single API request containing any penalty parameter immediately and permanently crashes the EngineCore process, resulting in complete loss of service availability. ### Patches Fixed in [PR #38610](https://github.com/vllm-project/vllm/pull/38610), first included in vLLM v0.20.0. The fix slices the return value to `sampled_token_ids[:, :1]`, ensuring the correct `(batch_size, 1)` shape regardless of the rejection sampler's output dimensions. ### Workarounds - Upgrade to vLLM v0.20.0 or later. - If upgrading is not possible, avoid using `extract_hidden_states` as the speculative decoding method on affected versions. - Alternatively, reject or strip penalty parameters (`repetition_penalty`, `frequency_penalty`, `presence_penalty`) from incoming requests at an API gateway before they reach vLLM.