CVE-2024-12704 — HIGH (CVSS 7.5) AI Security Vulnerability

CISO Take

Any production service using llama-index with LangChain LLM streaming is vulnerable to process hang with zero authentication required — attacker just sends a malformed input. Upgrade llama-index-core to 0.12.6 immediately; if you cannot patch now, disable or gate the streaming endpoint. EPSS is low (0.27%) but the exploit is trivial and the blast radius covers all RAG and agent pipelines using this integration.

Risk Assessment

HIGH severity (CVSS 7.5) with a trivial exploitation path: network-accessible, no privileges, no user interaction required. The vulnerability is a pure availability impact — no data exposure or privilege escalation. EPSS of 0.00271 suggests no observed mass exploitation yet, but the attack primitive (sending a wrong-type input to a streaming endpoint) requires zero AI/ML expertise. Risk is elevated for any team running llama-index in production with LangChain LLM wrappers and public-facing APIs.

Affected Systems

Package	Ecosystem	Vulnerable Range	Patched
llama-index-core	pip	< 0.12.6	`0.12.6`
49.1K 1.1K dependents Pushed 8d ago 100% patched ~50d to patch Full package profile →
llamaindex	pip	—	No patch
49.1K Pushed 8d ago 0% patched Full package profile →

Severity & Risk

CVSS 3.1

7.5 / 10

EPSS

0.4%

chance of exploitation in 30 days

Higher than 58% of all CVEs

Source: EPSS v3 — FIRST.org

Exploitation Status

Exploit Available

Exploitation: MEDIUM

Sophistication

Trivial

Exploitation Confidence

medium

○ CISA SSVC: Public PoC

○ Public PoC indexed (trickest/cve)

Composite signal derived from CISA KEV, CISA SSVC, EPSS, trickest/cve, and Nuclei templates.

Attack Surface

AV Network

AC Low

PR None

UI None

S Unchanged

C None

I None

A High

Recommended Action

6 steps

PATCH

Upgrade llama-index-core to >= 0.12.6 (patch commit d1ecfb77). This is the only complete fix.
WORKAROUND (if immediate patch is not possible): Replace calls to stream_complete on LangChainLLM instances with synchronous complete; remove streaming endpoints from public exposure.
INPUT VALIDATION

Add type-checking middleware to reject malformed inputs before they reach LLM wrappers.
CIRCUIT BREAKER

Implement per-request timeouts (e.g., 30s) and process-level watchdogs (e.g., supervisord, Kubernetes liveness probes) to auto-restart hung workers.
DETECTION

Monitor for LLM inference worker threads that do not terminate within expected latency windows; alert on CPU spikes correlated with incomplete LLM responses.
AUDIT

Inventory all internal services importing llama-index and check version with: pip show llama-index-core | grep Version

CISA SSVC Assessment

Decision Track*

Exploitation poc

Automatable Yes

Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

Classification

DoS Framework AML.T0010.001 - AI Software AML.T0029 - Denial of AI Service AML.T0049 - Exploit Public-Facing Application

Compliance Impact

This CVE is relevant to:

EU AI Act

Article 9 - Risk management system

ISO 42001

8.4 - AI system operation

NIST AI RMF

MANAGE-2.2 - Mechanisms are in place to respond to risks identified in AI systems

OWASP LLM Top 10

LLM04 - Model Denial of Service

Frequently Asked Questions

What is CVE-2024-12704?

Any production service using llama-index with LangChain LLM streaming is vulnerable to process hang with zero authentication required — attacker just sends a malformed input. Upgrade llama-index-core to 0.12.6 immediately; if you cannot patch now, disable or gate the streaming endpoint. EPSS is low (0.27%) but the exploit is trivial and the blast radius covers all RAG and agent pipelines using this integration.

Is CVE-2024-12704 actively exploited?

Proof-of-concept exploit code is publicly available for CVE-2024-12704, increasing the risk of exploitation.

How to fix CVE-2024-12704?

1. PATCH: Upgrade llama-index-core to >= 0.12.6 (patch commit d1ecfb77). This is the only complete fix. 2. WORKAROUND (if immediate patch is not possible): Replace calls to stream_complete on LangChainLLM instances with synchronous complete; remove streaming endpoints from public exposure. 3. INPUT VALIDATION: Add type-checking middleware to reject malformed inputs before they reach LLM wrappers. 4. CIRCUIT BREAKER: Implement per-request timeouts (e.g., 30s) and process-level watchdogs (e.g., supervisord, Kubernetes liveness probes) to auto-restart hung workers. 5. DETECTION: Monitor for LLM inference worker threads that do not terminate within expected latency windows; alert on CPU spikes correlated with incomplete LLM responses. 6. AUDIT: Inventory all internal services importing llama-index and check version with: pip show llama-index-core | grep Version

What systems are affected by CVE-2024-12704?

This vulnerability affects the following AI/ML architecture patterns: RAG pipelines, agent frameworks, LLM serving (streaming), document processing pipelines, chatbot backends.

What is the CVSS score for CVE-2024-12704?

CVE-2024-12704 has a CVSS v3.1 base score of 7.5 (HIGH). The EPSS exploitation probability is 0.35%.

Technical Details

NVD Description

A vulnerability in the LangChainLLM class of the run-llama/llama_index repository, version v0.12.5, allows for a Denial of Service (DoS) attack. The stream_complete method executes the llm using a thread and retrieves the result via the get_response_gen method of the StreamingGeneratorCallbackHandler class. If the thread terminates abnormally before the _llm.predict is executed, there is no exception handling for this case, leading to an infinite loop in the get_response_gen function. This can be triggered by providing an input of an incorrect type, causing the thread to terminate and the process to continue running indefinitely.

Exploitation Scenario

An adversary identifies a public-facing endpoint (chatbot, document Q&A, or RAG API) built on llama-index. They send an HTTP request with a malformed payload — for example, passing an integer or list where the LangChainLLM wrapper expects a string prompt. The LangChainLLM.stream_complete method launches a background thread that crashes before _llm.predict executes. The main thread, waiting in get_response_gen, enters an infinite loop with no exit condition. The worker process hangs indefinitely. The attacker repeats the request to exhaust all available workers, bringing the service down. No authentication, no AI/ML knowledge, and no special tooling required — a single malformed HTTP request is sufficient.