CVE-2026-34760: vLLM: audio downmix mismatch enables adversarial input

HIGH
Published April 2, 2026
CISO Take

A perception gap between human-heard audio and AI-processed audio in vLLM (via Librosa's non-standard mono downmixing) allows attackers to craft stereo audio that sounds benign to human reviewers but delivers different frequency content to the AI model. This is a low-noise, hard-to-detect integrity attack vector against voice-enabled AI deployments. Upgrade to vLLM v0.18.0 immediately if running audio inference workloads.

What is the risk?

Medium-rated but architecturally significant for audio AI pipelines. CVSS Integrity impact is HIGH with only Low Privileges Required, meaning authenticated users can exploit the human-AI perception gap to manipulate model outputs. High Attack Complexity limits opportunistic exploitation, and the very low EPSS (0.00057) confirms no active exploitation observed. Risk elevates substantially for organizations using vLLM in voice assistants, audio moderation, or speech-to-text pipelines where human audits are trusted as ground truth.

Severity & Risk

CVSS 3.1
7.1 / 10
EPSS
0.1%
chance of exploitation in 30 days
Higher than 22% of all CVEs
Exploitation Status
No known exploitation
Sophistication
Advanced

Attack Surface

AV AC PR UI S C I A
AV Network
AC Low
PR Low
UI None
S Unchanged
C None
I High
A Low

What should I do?

5 steps
  1. Patch: Upgrade vLLM to v0.18.0 or later (fix is in commit c7f98b4).

  2. Audit: Identify all services using vLLM audio inference or Librosa's to_mono in your stack.

  3. Workaround (if patching is delayed): Implement pre-processing that applies ITU-R BS.775-4 weighted downmixing before audio reaches the model, or enforce mono-only audio input at ingestion.

  4. Detection: Log and compare audio preprocessing outputs against human-audited samples; anomalous divergence may indicate exploitation.

  5. Supply chain hygiene: Pin Librosa versions in your AI serving containers and validate against known-good checksums.

CISA SSVC Assessment

Decision Track
Exploitation none
Automatable No
Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

Classification

Compliance Impact

This CVE is relevant to:

EU AI Act
Article 15 - Accuracy, robustness and cybersecurity
ISO 42001
8.4 - Data for AI systems
NIST AI RMF
MEASURE 2.5 - AI system performance and limitations testing
OWASP LLM Top 10
LLM05:2025 - Supply Chain Vulnerabilities

Frequently Asked Questions

What is CVE-2026-34760?

A perception gap between human-heard audio and AI-processed audio in vLLM (via Librosa's non-standard mono downmixing) allows attackers to craft stereo audio that sounds benign to human reviewers but delivers different frequency content to the AI model. This is a low-noise, hard-to-detect integrity attack vector against voice-enabled AI deployments. Upgrade to vLLM v0.18.0 immediately if running audio inference workloads.

Is CVE-2026-34760 actively exploited?

No confirmed active exploitation of CVE-2026-34760 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-34760?

1. Patch: Upgrade vLLM to v0.18.0 or later (fix is in commit c7f98b4). 2. Audit: Identify all services using vLLM audio inference or Librosa's to_mono in your stack. 3. Workaround (if patching is delayed): Implement pre-processing that applies ITU-R BS.775-4 weighted downmixing before audio reaches the model, or enforce mono-only audio input at ingestion. 4. Detection: Log and compare audio preprocessing outputs against human-audited samples; anomalous divergence may indicate exploitation. 5. Supply chain hygiene: Pin Librosa versions in your AI serving containers and validate against known-good checksums.

What systems are affected by CVE-2026-34760?

This vulnerability affects the following AI/ML architecture patterns: multimodal inference pipelines, voice AI and speech-to-text services, audio content moderation systems, model serving with audio input, human-in-the-loop audio review pipelines.

What is the CVSS score for CVE-2026-34760?

CVE-2026-34760 has a CVSS v3.1 base score of 7.1 (HIGH). The EPSS exploitation probability is 0.07%.

Technical Details

NVD Description

vLLM is an inference and serving engine for large language models (LLMs). From version 0.5.5 to before version 0.18.0, Librosa defaults to using numpy.mean for mono downmixing (to_mono), while the international standard ITU-R BS.775-4 specifies a weighted downmixing algorithm. This discrepancy results in inconsistency between audio heard by humans (e.g., through headphones/regular speakers) and audio processed by AI models (Which infra via Librosa, such as vllm, transformer). This issue has been patched in version 0.18.0.

Exploitation Scenario

An adversary uploads a crafted stereo audio file to a voice-enabled AI application (e.g., a vLLM-backed speech command interface or audio moderation system). The stereo file is engineered so that the ITU-R weighted mix — what a human hears when reviewing the file — contains normal, benign speech. However, numpy.mean downmixing — what vLLM processes — produces a different frequency-domain representation containing adversarial perturbations or hidden commands. The AI model responds to the manipulated version while human auditors reviewing the 'same' audio file see nothing suspicious. This bypasses human-in-the-loop safety reviews and enables adversarial audio injection with plausible deniability.

Weaknesses (CWE)

CVSS Vector

CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:H/A:L

Timeline

Published
April 2, 2026
Last Modified
May 11, 2026
First Seen
April 2, 2026

Related Vulnerabilities