CVE-2026-34760: audio downmix mismatch enables

CISO Take

A perception gap between human-heard audio and AI-processed audio in vLLM (via Librosa's non-standard mono downmixing) allows attackers to craft stereo audio that sounds benign to human reviewers but delivers different frequency content to the AI model. This is a low-noise, hard-to-detect integrity attack vector against voice-enabled AI deployments. Upgrade to vLLM v0.18.0 immediately if running audio inference workloads.

What is the risk?

Medium-rated but architecturally significant for audio AI pipelines. CVSS Integrity impact is HIGH with only Low Privileges Required, meaning authenticated users can exploit the human-AI perception gap to manipulate model outputs. High Attack Complexity limits opportunistic exploitation, and the very low EPSS (0.00057) confirms no active exploitation observed. Risk elevates substantially for organizations using vLLM in voice assistants, audio moderation, or speech-to-text pipelines where human audits are trusted as ground truth.

How severe is it?

CVSS 3.1

7.1 / 10

EPSS

0.3%

chance of exploitation in 30 days

Higher than 18% of all CVEs

Source: EPSS v3 — FIRST.org

Exploitation Status

No known exploitation

Sophistication

Advanced

What is the attack surface?

AV Network

AC Low

PR Low

UI None

S Unchanged

C None

I High

A Low

What should I do?

5 steps

Patch: Upgrade vLLM to v0.18.0 or later (fix is in commit c7f98b4).
Audit: Identify all services using vLLM audio inference or Librosa's to_mono in your stack.
Workaround (if patching is delayed): Implement pre-processing that applies ITU-R BS.775-4 weighted downmixing before audio reaches the model, or enforce mono-only audio input at ingestion.
Detection: Log and compare audio preprocessing outputs against human-audited samples; anomalous divergence may indicate exploitation.
Supply chain hygiene: Pin Librosa versions in your AI serving containers and validate against known-good checksums.

What does CISA's SSVC say?

Decision Track

Exploitation none

Automatable No

Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

How is it classified?

Adversarial Examples Supply Chain Inference Framework AML.T0010.001 - AI Software AML.T0015 - Evade AI Model AML.T0043 - Craft Adversarial Data AML.T0043.003 - Manual Modification

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act

Article 15 - Accuracy, robustness and cybersecurity

ISO 42001

8.4 - Data for AI systems

NIST AI RMF

MEASURE 2.5 - AI system performance and limitations testing

OWASP LLM Top 10

LLM05:2025 - Supply Chain Vulnerabilities

Frequently Asked Questions

What is CVE-2026-34760?

A perception gap between human-heard audio and AI-processed audio in vLLM (via Librosa's non-standard mono downmixing) allows attackers to craft stereo audio that sounds benign to human reviewers but delivers different frequency content to the AI model. This is a low-noise, hard-to-detect integrity attack vector against voice-enabled AI deployments. Upgrade to vLLM v0.18.0 immediately if running audio inference workloads.

Is CVE-2026-34760 actively exploited?

No confirmed active exploitation of CVE-2026-34760 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-34760?

1. Patch: Upgrade vLLM to v0.18.0 or later (fix is in commit c7f98b4). 2. Audit: Identify all services using vLLM audio inference or Librosa's to_mono in your stack. 3. Workaround (if patching is delayed): Implement pre-processing that applies ITU-R BS.775-4 weighted downmixing before audio reaches the model, or enforce mono-only audio input at ingestion. 4. Detection: Log and compare audio preprocessing outputs against human-audited samples; anomalous divergence may indicate exploitation. 5. Supply chain hygiene: Pin Librosa versions in your AI serving containers and validate against known-good checksums.

What systems are affected by CVE-2026-34760?

This vulnerability affects the following AI/ML architecture patterns: multimodal inference pipelines, voice AI and speech-to-text services, audio content moderation systems, model serving with audio input, human-in-the-loop audio review pipelines.

What is the CVSS score for CVE-2026-34760?

CVE-2026-34760 has a CVSS v3.1 base score of 7.1 (HIGH). The EPSS exploitation probability is 0.27%.

What is the AI security impact?

Affected AI Architectures

multimodal inference pipelinesvoice AI and speech-to-text servicesaudio content moderation systemsmodel serving with audio inputhuman-in-the-loop audio review pipelines

MITRE ATLAS Techniques

AML.T0010.001 AI Software

AML.T0015 Evade AI Model

AML.T0043 Craft Adversarial Data

AML.T0043.003 Manual Modification

Compliance Controls Affected

EU AI Act: Article 15

ISO 42001: 8.4

NIST AI RMF: MEASURE 2.5

OWASP LLM Top 10: LLM05:2025

What are the technical details?

Original Advisory

vLLM is an inference and serving engine for large language models (LLMs). From version 0.5.5 to before version 0.18.0, Librosa defaults to using numpy.mean for mono downmixing (to_mono), while the international standard ITU-R BS.775-4 specifies a weighted downmixing algorithm. This discrepancy results in inconsistency between audio heard by humans (e.g., through headphones/regular speakers) and audio processed by AI models (Which infra via Librosa, such as vllm, transformer). This issue has been patched in version 0.18.0.

Exploitation Scenario

An adversary uploads a crafted stereo audio file to a voice-enabled AI application (e.g., a vLLM-backed speech command interface or audio moderation system). The stereo file is engineered so that the ITU-R weighted mix — what a human hears when reviewing the file — contains normal, benign speech. However, numpy.mean downmixing — what vLLM processes — produces a different frequency-domain representation containing adversarial perturbations or hidden commands. The AI model responds to the manipulated version while human auditors reviewing the 'same' audio file see nothing suspicious. This bypasses human-in-the-loop safety reviews and enables adversarial audio injection with plausible deniability.

Weaknesses (CWE)

CWE-20 Improper Input Validation Primary

CWE-20 — Improper Input Validation: The product receives input or data, but it does not validate or incorrectly validates that the input has the properties that are required to process the data safely and correctly.

[Architecture and Design] Consider using language-theoretic security (LangSec) techniques that characterize inputs using a formal language and build "recognizers" for that language. This effectively requires parsing to be a distinct layer that effectively enforces a boundary between raw input and internal data representations, instead of allowing parser code to be scattered throughout the program, where it could be subject to errors or inconsistencies that create weaknesses. [REF-1109] [REF-1110] [REF-1111]
[Architecture and Design] Use an input validation framework such as Struts or the OWASP ESAPI Validation API. Note that using a framework does not automatically address all input validation problems; be mindful of weaknesses that could arise from misusing the framework itself (CWE-1173).

Source: MITRE CWE corpus.