CVE-2026-54233: vLLM decompression bomb OOM

Q: Is CVE-2026-54233 actively exploited?

No confirmed active exploitation of CVE-2026-54233 has been reported, but organizations should still patch proactively.

Q: How to fix CVE-2026-54233?

1. Upgrade vLLM to v0.23.1rc0 or later — the fix (PR #44970, commit 1b1359c) adds a decoded-size budget check before `np.concatenate`. 2. If immediate upgrade is blocked, disable or firewall-restrict the `/v1/audio/transcriptions` endpoint at the reverse proxy level. 3. Apply upstream request body limits (e.g., nginx `client_max_body_size 10m`) independently of vLLM's internal check as defence-in-depth. 4. Enforce per-user concurrency limits on audio upload endpoints to cap simultaneous decompression. 5. Set hard memory limits on vLLM containers (`--memory` in Docker or cgroups) to contain blast radius to a single pod. 6. Monitor for sudden RSS spikes in vLLM containers as a detection signal — a single legitimate 30-second transcription should not cause gigabyte-scale allocation.

Q: What systems are affected by CVE-2026-54233?

This vulnerability affects the following AI/ML architecture patterns: LLM inference serving, multimodal AI pipelines, speech-to-text services, model serving.

Q: What is the CVSS score for CVE-2026-54233?

CVE-2026-54233 has a CVSS v3.1 base score of 6.5 (MEDIUM). The EPSS exploitation probability is 0.03%.

CISO Take

vLLM's speech-to-text endpoint validates upload size on compressed bytes but never caps the decoded output, so a single valid 25MB OPUS file expands to roughly 14.9GB of float32 PCM in memory — a textbook decompression bomb applied to audio. Any organization running vLLM ≤0.23.0 with the `/v1/audio/transcriptions` endpoint reachable by external or multi-tenant API users is exposed to availability loss across the entire inference server, not just the audio feature, since memory exhaustion crashes the shared process. The attack sits in the top 91st EPSS percentile and requires only low-privilege API access (`PR:L`), making it trivially reachable for any platform issuing API keys; three to five concurrent malicious uploads are enough to exhaust a typical deployment. Upgrade to vLLM v0.23.1rc0+ (PR #44970) immediately, or gate the audio endpoint behind strict network-level body size limits and per-user concurrency controls as a stop-gap.

Sources: NVD EPSS GitHub Advisory ATLAS

What is the risk?

Operational risk exceeds the CVSS 6.5 medium rating for AI inference deployments. The 232x memory amplification ratio means bandwidth is not the bottleneck — a single attacker on a slow connection can trigger ~14.9GB of server-side allocation per request, and `np.concatenate` doubles the peak allocation with a second contiguous array. Multi-tenant vLLM deployments where the audio endpoint is exposed to API key holders are the highest-risk scenario: the `PR:L` requirement is trivially satisfied by trial or free-tier keys. With 130 downstream dependents and 61 prior CVEs in the same package, vLLM is a high-value infrastructure target for availability attacks against AI inference fleets.

How does the attack unfold?

Initial Access

Attacker obtains low-privilege API credentials (e.g., trial or free-tier key) for a multi-tenant vLLM inference platform exposing the audio transcription endpoint.

AML.T0040

Payload Crafting

Attacker encodes ~8.7 hours of low-complexity audio at 6kbps into a ~25MB OPUS file — a valid upload within the documented size limit that expands 232x on decode.

AML.T0034.001

Memory Exhaustion

Each concurrent upload passes the compressed-byte size check, enters the audio decoder, and triggers `np.concatenate` to allocate ~14.9GB of contiguous float32 PCM per request.

AML.T0049

Service Denial

Multiple concurrent requests exhaust available server RAM, causing OOM termination of the vLLM worker and denying inference service to all tenants on the shared instance.

AML.T0029

Initial Access

Attacker obtains low-privilege API credentials (e.g., trial or free-tier key) for a multi-tenant vLLM inference platform exposing the audio transcription endpoint.

AML.T0040

Payload Crafting

Attacker encodes ~8.7 hours of low-complexity audio at 6kbps into a ~25MB OPUS file — a valid upload within the documented size limit that expands 232x on decode.

AML.T0034.001

Memory Exhaustion

Each concurrent upload passes the compressed-byte size check, enters the audio decoder, and triggers `np.concatenate` to allocate ~14.9GB of contiguous float32 PCM per request.

AML.T0049

Service Denial

Multiple concurrent requests exhaust available server RAM, causing OOM termination of the vLLM worker and denying inference service to all tenants on the shared instance.

AML.T0029

What systems are affected?

Package	Ecosystem	Vulnerable Range	Patched
vLLM	pip	<= 0.23.0	No patch
82.8K 130 dependents Pushed 3d ago 35% patched ~30d to patch Full package profile →

Do you use vLLM? You're affected.

How severe is it?

CVSS 3.1

6.5 / 10

EPSS

0.0%

chance of exploitation in 30 days

Higher than 9% of all CVEs

Source: EPSS v3 — FIRST.org

Exploitation Status

No known exploitation

Sophistication

Trivial

What is the attack surface?

AV Network

AC Low

PR Low

UI None

S Unchanged

C None

I None

A High

What should I do?

6 steps

Upgrade vLLM to v0.23.1rc0 or later — the fix (PR #44970, commit 1b1359c) adds a decoded-size budget check before np.concatenate.
If immediate upgrade is blocked, disable or firewall-restrict the /v1/audio/transcriptions endpoint at the reverse proxy level.
Apply upstream request body limits (e.g., nginx client_max_body_size 10m) independently of vLLM's internal check as defence-in-depth.
Enforce per-user concurrency limits on audio upload endpoints to cap simultaneous decompression.
Set hard memory limits on vLLM containers (--memory in Docker or cgroups) to contain blast radius to a single pod.
Monitor for sudden RSS spikes in vLLM containers as a detection signal — a single legitimate 30-second transcription should not cause gigabyte-scale allocation.

How is it classified?

DoS Inference API AML.T0029 - Denial of AI Service AML.T0034.001 - Resource-Intensive Queries AML.T0049 - Exploit Public-Facing Application

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act

Article 9 - Risk management system

ISO 42001

A.6.1 - AI risk assessment

NIST AI RMF

MANAGE 2.2 - Mechanisms to sustain effectiveness of AI risk management

OWASP LLM Top 10

LLM10:2025 - Unbounded Consumption

Frequently Asked Questions

What is CVE-2026-54233?

vLLM's speech-to-text endpoint validates upload size on compressed bytes but never caps the decoded output, so a single valid 25MB OPUS file expands to roughly 14.9GB of float32 PCM in memory — a textbook decompression bomb applied to audio. Any organization running vLLM ≤0.23.0 with the `/v1/audio/transcriptions` endpoint reachable by external or multi-tenant API users is exposed to availability loss across the entire inference server, not just the audio feature, since memory exhaustion crashes the shared process. The attack sits in the top 91st EPSS percentile and requires only low-privilege API access (`PR:L`), making it trivially reachable for any platform issuing API keys; three to five concurrent malicious uploads are enough to exhaust a typical deployment. Upgrade to vLLM v0.23.1rc0+ (PR #44970) immediately, or gate the audio endpoint behind strict network-level body size limits and per-user concurrency controls as a stop-gap.

Is CVE-2026-54233 actively exploited?

No confirmed active exploitation of CVE-2026-54233 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-54233?

1. Upgrade vLLM to v0.23.1rc0 or later — the fix (PR #44970, commit 1b1359c) adds a decoded-size budget check before `np.concatenate`. 2. If immediate upgrade is blocked, disable or firewall-restrict the `/v1/audio/transcriptions` endpoint at the reverse proxy level. 3. Apply upstream request body limits (e.g., nginx `client_max_body_size 10m`) independently of vLLM's internal check as defence-in-depth. 4. Enforce per-user concurrency limits on audio upload endpoints to cap simultaneous decompression. 5. Set hard memory limits on vLLM containers (`--memory` in Docker or cgroups) to contain blast radius to a single pod. 6. Monitor for sudden RSS spikes in vLLM containers as a detection signal — a single legitimate 30-second transcription should not cause gigabyte-scale allocation.

What systems are affected by CVE-2026-54233?

This vulnerability affects the following AI/ML architecture patterns: LLM inference serving, multimodal AI pipelines, speech-to-text services, model serving.

What is the CVSS score for CVE-2026-54233?

CVE-2026-54233 has a CVSS v3.1 base score of 6.5 (MEDIUM). The EPSS exploitation probability is 0.03%.

What is the AI security impact?

Affected AI Architectures

LLM inference servingmultimodal AI pipelinesspeech-to-text servicesmodel serving

MITRE ATLAS Techniques

AML.T0029 Denial of AI Service

AML.T0034.001 Resource-Intensive Queries

AML.T0049 Exploit Public-Facing Application

Compliance Controls Affected

EU AI Act: Article 9

ISO 42001: A.6.1

NIST AI RMF: MANAGE 2.2

OWASP LLM Top 10: LLM10:2025

What are the technical details?

Original Advisory

### Summary vLLM's `/v1/audio/transcriptions` endpoint limits compressed upload size but not decoded PCM output. A 25MB OPUS file expands to ~14.9GB of float32 PCM at decode time. Tested on vLLM v0.19.0. ### Details `SpeechToTextProcessor` rejects uploads over `VLLM_MAX_AUDIO_CLIP_FILESIZE_MB` (default 25MB) based on compressed byte length, but the audio decoder in `audio.py` accumulates all decoded frames into memory with no size limit before returning: ```python # speech_to_text.py L184-189 if len(audio_data) / 1024 ** 2 > self.max_audio_filesize_mb: raise VLLMValidationError(...) y, sr = load_audio(buf, sr=self.asr_config.sample_rate) # decoded size unchecked # audio.py L77-107 chunks: list[npt.NDArray] = [] for frame in container.decode(stream): chunks.append(frame.to_ndarray()) audio = np.concatenate(chunks, axis=-1).astype(np.float32) # single contiguous allocation ``` A 25MB OPUS file at 6kbps encodes ~8.7 hours of audio. Decoding produces ~5.7GB of float32 PCM (232x amplification), and `np.concatenate` then allocates a second contiguous array, bringing peak RSS to ~14.9GB from a single request. `SpeechToTextConfig.max_audio_clip_s` (default 30s) applies only after the full decode and does not prevent the allocation. ### Impact An unauthenticated attacker can exhaust server memory with a small number of concurrent requests, each a valid upload within the documented size limit. Severity was assessed with reference to prior OOM vulnerability reports in vLLM. ### Fix A fix for this vulnerability was merged here: https://github.com/vllm-project/vllm/pull/44970

Exploitation Scenario

An attacker registers or purchases low-tier API access to a multi-tenant vLLM platform. They craft a ~25MB OPUS file encoding approximately 8.7 hours of low-complexity audio at 6kbps — valid by the documented upload limit. They submit five concurrent POST requests to `/v1/audio/transcriptions`. Each request passes the compressed-byte size check, enters `load_audio()`, and accumulates decoded PCM frames into memory; `np.concatenate` then allocates a second contiguous ~5.7GB array, driving peak RSS to ~14.9GB per request. Five concurrent requests push the vLLM process past available RAM, triggering OOM termination within seconds and denying inference service to all users on the shared instance.

Weaknesses (CWE)

CWE-409 Improper Handling of Highly Compressed Data (Data Amplification) Primary

CWE-409 — Improper Handling of Highly Compressed Data (Data Amplification): The product does not handle or incorrectly handles a compressed input with a very high compression ratio that produces a large output.

Source: MITRE CWE corpus.